The current COVID-19 pandemic is rife with problems that hackers have attacked with gusto. From 3D printed face shields and homebrew face masks to replacements for full-fledged mechanical ventilators, the outpouring of ideas has been inspirational and heartwarming. At the same time there have been many efforts in a different area: research aimed at fighting the virus itself.
Getting to the root of the problem seems to have the most potential for ending this pandemic and getting ahead of future ones, and that’s the “know your enemy” problem that the distributed computing effort known as Folding@Home aims to address. Millions of people have signed up to donate cycles from spare PCs and GPUs, and in the process have created the largest supercomputer in history.
But what exactly are all these exaFLOPS being used for? Why is protein folding something to direct so much computational might toward? What’s the biochemistry behind this, and why do proteins need to fold in the first place? Here’s a brief look at protein folding: what it is, how it happens, and why it’s important.
First Things First: What Do Proteins Do?
Proteins are crucial to life. They provide not only the structural elements of the cell, but also serve as the enzymes that catalyze just about every biochemical reaction. Proteins, whether structural or enzymatic, are long chains of amino acids that are linked end-to-end in a specific sequence. The functions of proteins are determined by which amino acids are present at various locations on and in the protein. If a protein needs to bind to a positively charged molecule, for example, the binding site might be full of negatively charged amino acids.
To understand how proteins achieve the structure that defines their function, a quick review of the basics of molecular biology and the flow of information in the cell is in order.
The production, or expression, of a protein begins with the process of transcription. During transcription, the double-stranded DNA that holds the genetic information in a cell is partially unwound, exposing the nitrogenous bases of the DNA to an enzyme called RNA polymerase, often referred to as RNAPol. RNAPol’s job is to make an RNA copy, or transcript, of the gene. This copy of the gene, called messenger RNA or mRNA, is a single-stranded molecule that is perfect for directing the protein manufacturing machinery of the cell, the ribosomes, in a process called translation.
Ribosomes act like a jig, taking the mRNA template and matching it up to other small bits of RNA called transfer RNA, or tRNA. Each tRNA has two main active areas — one that has a three-base section called an anticodon that matches up with complementary codons on the mRNA, and a region for binding an amino acid that’s specific for that codon. During translation, tRNA molecules randomly try to bind to the mRNA in the ribosome using their anticodon. When a match is made, the tRNA molecule attaches its amino acid to the previous amino acid, forming another link in the chain of amino acids coded for by the mRNA.
This sequence of amino acids is the first tier of structural hierarchy in a protein, and is referred to as the protein’s primary structure. The entire three-dimensional structure of the protein, and indeed its function, comes directly from the primary structure through the different properties of each of those amino acids and how they interact with each other. If it weren’t for these chemical properties and interactions between amino acids, polypeptides would just remain linear sequences with no three-dimensional structure. We see this all the time in cooking, which is the heat-induced denaturation of the three-dimensional structure of proteins.
Long-Distance Connections Between Parts of Proteins
The level of structure beyond the primary structure is cleverly called the secondary structure, and includes fairly short-range hydrogen bonds between amino acids. These stabilizing interactions form two main motifs: the alpha-helix and the beta-pleated sheet. The alpha-helix forms a tightly coiled polypeptide region, while the beta-sheet is a flat, broad area. Both motifs have structural properties as well as functional properties, depending on the characteristics of the amino acids within them. For example, if an alpha-helix has primarily hydrophilic amino acids within it, like arginine and lysine, it’s likely to be involved in aqueous reactions.
Proteins combine these two motifs, as well as variations on their themes, to form the next level of structure, the tertiary structure. Unlike the simple motifs of the secondary structure, tertiary structure tends to be driven more by hydrophobicity. Most proteins tend to have highly hydrophobic amino acids, like alanine and methionine, at their core, where water is excluded due to the “greasy” nature of the residues. These structures will often show up in transmembrane proteins, which are embedded in the lipid bilayer membrane surrounding cells. The hydrophobic domains on the protein are thermodynamically stable inside the fatty interior of the membrane, while the hydrophilic regions of the protein are exposed to the aqueous environment on either side of the membrane.
Tertiary structures also tend to be stabilized by long-distance bonds between amino acids. The classic example of this is the disulfide bridge, which often occurs between two cysteine residues. If you’ve ever been to a hair salon and smelled the slight rotten-egg stink of someone getting a perm, you’re witnessing the partial denaturation of the tertiary structure of keratin in hair by the reduction of disulfide bonds using sulfur-containing thiol compounds.
Disulfide bridges can occur between cysteine residues in the same chain of polypeptides, or between cysteines locate in completed different chains. Interactions between different polypeptide chains are the fourth level of protein structure, the quaternary structure. The hemoglobin in your blood is a perfect example of quaternary structure. Each hemoglobin molecule is formed by four identical globin protein subunits, each of which is held in a specific conformation by disulfide bridges within the polypeptide as well as bonding with the iron-containing heme molecule. All four globin subunits are bound together by intermolecular disulfide bridges, and the entire molecule acts as one to bind up to four oxygen molecules at once, and to release them when needed.
Modeling Structures In a Search for Solutions to the Illness
Polypeptide chains begin folding into their final shape during translation, as the growing chain is extruded from the ribosome, similar to the way a piece of straightened memory wire can snap into a complex shape when heated. But as is always the case with biology, there’s much more to the story.
In many cells, there is extensive editing of the transcribed genes that occurs before translation, which alters the primary structure vastly compared to the raw base sequence of the gene. The translational machinery also often enlists the help of molecular chaperones, proteins that temporarily bind to the nascent polypeptide chain to prevent it from taking an intermediate structure that would prevent it from taking its final shape.
All this is to say that predicting the final shape of a protein from the primary structure is not trivial. For decades, the only way to explore protein structure was with physical methods like X-ray crystallography. It wasn’t until the late 1960s that biophysical chemists started building computational models for protein folding, focused mainly on modeling the secondary structure of a protein. These methods and their descendants take a vast amount of input data in addition to the primary structure sequence, such as tables of bond angles between amino acids, lists of hydrophobicity, charge states, and even conservation of structure and function over evolutionary timescales to make a best guess at what a protein is going to look like.
Current computational methods for secondary structure prediction, like those running on Folding@Home’s network right now, run at about 80% accuracy, which is pretty good considering the complexity of the task. The data generated by the folding prediction models for proteins like the SARS-CoV-2 spike protein will be coupled with physical study data to come up with a firm structure for the protein, and perhaps give us insights into how the virus binds to the human angiotension converting enzyme-2 (ACE-2) receptors that line the respiratory tract, which is its path into the body. If we can figure out the structure, we might be able to find drugs to block binding and prevent infection.
Protein folding research is central to our understanding of so many diseases and infections that even once we figure out a way to beat COVID-19, the Folding@Home network, which as seen such explosive growth over the past month, will not go idle for long. The network is a research tool well-suited to exploring protein models central to dozens of diseases that are related to misfolded proteins, such as Alzheimer’s and variant Cruetzfeldt-Jakob disease, often incorrectly called mad-cow disease. And when the next virus inevitably comes along, all that horsepower, and all the experience being gained in managing it, will be ready to go again.