Exam Prep Book - Genomics, Proteomics, and Systems Biology - PDF
Document Details
Uploaded by MomentousPond
null
Tags
Summary
This document is a chapter from a study guide on genomics, proteomics, and systems biology. The chapter provides an overview of the field and highlights key concepts, such as genomes of bacteria, yeast, and other organisms, along with learning objectives for understanding these topics.
Full Transcript
CHAPTER Genomics, Proteomics, and Systems Biology R ecent years have seen major changes in the way scientists approach cell and molecular biology, with large-scale experimental and computational approaches being applied to understand the complexities of biological systems. Traditionally, cell and...
CHAPTER Genomics, Proteomics, and Systems Biology R ecent years have seen major changes in the way scientists approach cell and molecular biology, with large-scale experimental and computational approaches being applied to understand the complexities of biological systems. Traditionally, cell and molecular biologists studied one or a few genes or proteins at a time. This was changed by genome sequencing projects, which introduced large-scale experimental approaches that generated vast amounts of data to the study of biological systems. The complete genome sequences of a wide variety of organisms, including many individual humans, provide a wealth of information that forms a new framework for studies of cell and molecular biology and opens new possibilities in medical practice. Not only can the sequences of complete genomes be obtained and analyzed, but it is also now possible to undertake large-scale analyses of all of the RNAs and proteins expressed in a cell. These global experimental approaches form the basis of the new field of systems biology, which seeks a quantitative understanding of the integrated behavior of complex biological systems. This chapter considers the development of these new technologies and their impact on understanding the molecular biology of cells. 5.1 Genomes and Transcriptomes Learning Objectives You should be able to: • Compare the numbers of genes in bacterial and yeast genomes. • Summarize the contributions of coding and noncoding sequences to the genomes of Drosophila, C. elegans, and Arabidopsis. • Compare the approximate size and amount of protein-coding sequence in the human genome with that of Drosophila. • Explain why studies of other vertebrates are useful in understanding human genomics. • Outline the basic approach used in next-generation sequencing. • Describe the global methods used to study gene expression. Obtaining the complete sequence of the human genome was the first large-scale experimental project undertaken in the life sciences. When it was initiated in the 1980s, it appeared a daunting task to completely sequence all three billion base pairs of the human genome. After all, at that time the largest DNA molecule to 5 5.1 Genomes and Transcriptomes 157 5.2 Proteomics 168 5.3 Systems Biology 174 Key Experiment The Human Genome 163 Molecular Medicine Malaria and Synthetic Biology 181 158 Chapter 5 have been sequenced was a viral genome of less than 200,000 bases. The Human Genome Project became the largest collaborative undertaking in biology and yielded an initial draft sequence in 2001, with a more refined complete sequence of the human genome published in 2004. Along the way, the complete genome sequences of many other species were obtained, providing important insights into genome evolution. Since then, tremendous advances in the technology of DNA sequencing have been made, and new sequencing methodologies allow rapid and economical sequencing of individual genomes or transcribed RNAs. These advances have changed the way scientists think about the structure and function of our genomes, as well as allowing new approaches to disease diagnosis and treatment based on personal genome sequencing. The genomes of bacteria and yeast Genomes of bacteria and yeast consist primarily of protein-coding sequences. The first complete sequence of a cellular genome, reported in 1995 by a team of researchers led by Craig Venter, was that of the bacterium Haemophilus influenzae, a common inhabitant of the human respiratory tract. The genome of H. influenzae is a circular molecule containing approximately 1.8 × 106 base pairs, more than 1000 times smaller than the human genome. Once the complete DNA sequence was obtained, it was analyzed to identify the genes encoding rRNAs, tRNAs, and proteins. Potential protein-coding regions were identified by computer analysis of the DNA sequence to detect open-reading frames—long stretches of nucleotide sequence that can encode polypeptides because they contain none of the three chain-terminating codons (UAA, UAG, and UGA). Since these chain-terminating codons occur randomly once in every 21 codons (three chain-terminating codons out of 64 total), open-reading frames that extend for more than 100 codons usually represent functional genes. This analysis identified 1743 potential protein-coding regions in the H. influenzae genome as well as six copies of rRNA genes and 54 different tRNA genes (Figure 5.1). The predicted coding sequences have an average size of approximately 900 base pairs, so they cover about 1.6 Mb of DNA, 1,800,000 1,700,000 100,000 200,000 1,600,000 300,000 1,500,000 400,000 1,400,000 500,000 1,300,000 600,000 Figure 5.1 The genome of Haemophilus influenzae Predicted protein-coding regions are designated by colored bars. Numbers indicate base pairs of DNA. (From R. D. Fleischmann et al., 1995. Science 269: 496.) 1,200,000 700,000 1,100,000 1,000,000 900,000 800,000 Genomics, Proteomics, and Systems Biology corresponding to nearly 90% of the genome of H. influenzae. The genome of E. coli is approximately twice the size of H. influenzae, 4.6 × 106 base pairs long and containing about 4200 genes, again with nearly 90% of the DNA used as protein-coding sequence. The use of almost all the DNA to encode proteins is typical of bacterial genomes, thousands of which have now been sequenced. A model for a simple eukaryotic genome, which was sequenced in 1996, is found in the yeast Saccharomyces cerevisiae. As discussed in Chapter 1, yeast are simple unicellular organisms, but they have all of the characteristics of eukaryotic cells. The genome of S. cerevisiae consists of 12 × 106 base pairs, which is about 2.5 times the size of the genome of E. coli. It contains about 6000 protein-coding genes. Thus, despite the greater complexity of a eukaryotic cell, yeast contain only about 50% more genes than E. coli. Protein-coding sequences account for approximately 70% of total yeast DNA, so yeast, like bacteria, have a high density of protein-coding sequence. 159 Yeast contain about 6000 proteincoding genes. The genomes of Caenorhabditis elegans, Drosophila melanogaster, and Arabidopsis thaliana The next major advance in genomics was the sequencing of the genomes of the relatively simple multicellular organisms, C. elegans, Drosophila, and Arabidopsis. Distinctive features of each of these organisms make them important models for genome analysis: C. elegans and Drosophila are widely used for studies of animal development, and Drosophila has been especially well analyzed genetically. Likewise, Arabidopsis is a model for studies of plant molecular biology and development. The genomes of all three of these organisms are approximately 100 × 106 base pairs, about ten times larger than the yeast genome but 30 times less than the genome of humans. The determinations of their complete sequences in 1998 (C. elegans) and 2000 (Drosophila and Arabidopsis) were major steps forward, which extended genome sequencing from unicellular bacteria and yeasts to multicellular organisms. An unanticipated result of sequencing these genomes was that they contained fewer protein-coding genes than expected relative to bacterial or yeast genomes (Table 5.1). The C. elegans genome is 97 × 106 base pairs and Table 5.1 Representative Genomes Organism Genome size (Mb)a Number of proteincoding genes Protein-coding sequence Bacteria H. influenzae 1.8 1700 90% E. coli 4.6 4200 88% 12 6000 70% C. elegans 97 19,000 25% Drosophila 180 14,000 10% 125 26,000 25% 3000 20,000 1.2% Yeasts S. cerevisiae Invertebrates Plants Arabidopsis thaliana Mammals Human a Mb = millions of base pairs Video 5.1 A Brief Introduction to C. elegans 160 Chapter 5 C. elegans, Drosophila, and Arabidopsis contain fewer proteincoding genes and more noncoding sequence than expected relative to bacteria and yeast. The number of protein-coding genes does not correlate with biological complexity. FYI Apples contain more than twice the number of genes present in the human genome because of genome wide repeats (see Chapter 6). The human genome contains about 20,000 protein-coding genes, with protein-coding sequences corresponding to only about 1% of human DNA. contains about 19,000 predicted protein-coding sequences—approximately eight times the amount of DNA but only three times the number of genes in yeast. When the Drosophila genome was sequenced in 2000, it was surprising to find that it contains only about 14,000 protein-coding genes—substantially fewer than the number of genes in C. elegans, even though Drosophila is a more complex organism. Moreover, it is striking that a complex animal like Drosophila has only a little more than twice the number of genes found in yeast. Protein-coding sequences correspond to only about 10% of the Drosophila genome and 25% of the C. elegans genome, compared with 70% of the yeast genome, so the larger sizes of the Drosophila and C. elegans genomes are substantially due to increased amounts of non–protein-coding sequences rather than protein-coding genes. The completion of the genome sequence of Arabidopsis thaliana in 2000 extended genome sequencing from animals to plants. The Arabidopsis genome, approximately 125 × 106 base pairs of DNA, contains approximately 26,000 protein-coding genes—significantly more genes than were found in either C. elegans or Drosophila. Even more genes have since been found in other plant genomes. For example, the genome of the apple contains about 57,000 protein-coding genes, further emphasizing the lack of relationship between gene number and complexity of an organism. The human genome For many scientists, the ultimate goal of the Human Genome Project was determination of the complete nucleotide sequence of the human genome: approximately 3 × 109 base pairs of DNA. Because of its large size, determination of the human genome sequence was a phenomenal undertaking, and its publication in draft form in 2001 was heralded as a scientific achievement of historic magnitude. The draft sequences of the human genome published in 2001 were produced by two independent teams of researchers, each using different approaches (see Key Experiment). Both of these sequences were initially incomplete drafts in which approximately 90% of the genome had been sequenced and assembled. Continuing efforts then closed the gaps and improved the accuracy of the draft sequences, leading to publication of a high-quality human genome sequence in 2004 by the International Human Genome Sequencing Consortium. A major surprise from the genome sequence was the unexpectedly low number of human genes (see Table 5.1). The human genome consists of only about 20,000 protein-coding genes, which is not much larger than the number of genes in simpler animals like C. elegans and Drosophila and fewer than in Arabidopsis or other plants. Whereas protein-coding sequences correspond to the majority of the genomes of bacteria and yeast and 10–25% of the genomes of Drosophila and C. elegans, they represent only about 1% of the human genome. The nature of the additional sequences in the human genome and their roles in gene regulation, which may contribute more to biological complexity than simply the number of genes, are discussed in the next chapter. Over 40% of the predicted human proteins are related to proteins in simpler sequenced eukaryotes, including yeast, Drosophila, and C. elegans. Many of these conserved proteins function in basic cellular processes, such as metabolism, DNA replication and repair, transcription, translation, and protein trafficking. Most of the proteins that are unique to humans are Genomics, Proteomics, and Systems Biology made up of protein domains that are also found in other organisms, but these domains are arranged in novel combinations to yield distinct proteins in humans. Compared with Drosophila and C. elegans, the human genome contains expanded numbers of genes involved in functions related to the greater complexity of vertebrates, such as the immune response, the nervous system, and blood clotting, as well as increased numbers of genes involved in development, cell signaling, and the regulation of gene expression. The genomes of other vertebrates In addition to the human genome, a large and growing number of vertebrate genomes have been sequenced, including the genomes of fish, frogs, chickens, dogs, rodents, and primates (Figure 5.2). The genomes of these other vertebrates are similar in size to the human genome and contain a similar number of genes. Their sequences provide interesting comparisons to that of the human genome and are proving useful in facilitating studies of different model organisms and in identifying a variety of different types of functional sequences, including regulatory elements that control gene expression. For example, a comparison of the human, mouse, chicken, and zebrafish genomes indicates that about half of the protein-coding genes are common to all vertebrates, whereas approximately 3000 genes are unique to each of these four species (Figure 5.3). Zebra sh Frog Chicken Platypus Opossum Dog Mouse Rat 161 FYI For many years, scientists generally accepted an estimate of approximately 100,000 genes in the human genome. On publication of the draft genome sequence in 2001, the number was drastically reduced to between 30,000 and 40,000. Current estimates, based on the high-quality sequence published in 2004 and using improved computational tools to identify genes, reduce the number of human protein-coding genes still further, to approximately 20,000. Rhesus Chimp Neandertal 5 Human 0.4 25 40 91 92 150 170 310 360 450 Figure 5.2 Evolution of sequenced vertebrates The estimated times (millions of years ago) when species diverged are indicated at branch points in the diagram. 162 Chapter 5 Chicken Mouse Human 2,596 2,657 1,602 43 892 2,963 48 129 89 10,660 105 Zebrafish 3,634 57 2,059 73 Figure 5.3 Comparison of ver- tebrate genomes The number of genes shared between human, mouse, chicken, and zebrafish genomes is indicated. (From K. Howe et al., 2013. Nature 496: 498.) The genomes of other vertebrates provide useful comparisons with the human genome. The mammalian genomes that have been sequenced, in addition to the human genome, include the genomes of the platypus, opossum, mouse, rat, dog, rhesus macaque, and chimpanzee. As discussed in earlier chapters, the mouse is the key model system for experimental studies of mammalian genetics and development, while the rat is an important model for human physiology and medicine. Mice, rats, and humans have 90% of their genes in common, so the mouse and rat genome sequences provide essential databases for research in these areas. The many distinct breeds of pet dogs make the sequence of the dog genome particularly important in understanding the genetic basis of morphology, behavior, and a variety of complex diseases that afflict both dogs and humans. There are approximately 300 breeds of dogs, which differ in their physical and behavioral characteristics as well as in their susceptibility to a variety of diseases, including several types of cancer, blindness, deafness, and metabolic disorders. These characteristics are highly specific properties of different breeds, greatly facilitating identification of the responsible genes. For example, as discussed in Chapter 6, studies of breed-specific differences in dog legs led to the identification of a specific type of gene rearrangement responsible for this characteristic. Recent analyses of dog genomes have also identified genes responsible for coat color and for the body size of small breeds. Similar types of analysis are under way to understand the genetic basis of multiple diseases, including several types of cancer, that are common in some breeds of dogs. Since many of these diseases afflict both dogs and humans, the results of these studies can be expected to impact human health as well as veterinary medicine. In the future, we can also expect genetic analysis of behavior in dogs. Since many canine behaviors, such as separation anxiety, are also common in humans, psychologists may have much to learn from the species that has been our closest companion for thousands of years. The sequences of the genomes of other primates, including the chimpanzee, bonobo, orangutan, and rhesus macaque, may help pinpoint the unique features of our genome that distinguish humans from other primates. Interestingly, however, comparison of these sequences does not suggest an easy answer to the question of what makes us human. The genome sequences of humans and chimpanzees are about 99% identical. Perhaps surprisingly, the sequence differences between humans and chimpanzees frequently alter the coding sequences of genes, leading to changes in the amino acid sequences of most proteins. Although many of these amino acid changes may not affect protein function, it appears that there are changes in the structure as well as in the expression of thousands of genes between chimpanzees and humans, so identifying those differences that are key to the origin of humans is not a simple task. The genome of Neandertals, our closest evolutionary relatives, has also been recently sequenced. It is estimated that Neandertals and modern humans diverged about 300,000–400,000 years ago. The genomes of Neandertals and modern humans are more than 99.9% identical, significantly more closely related to each other than either is to chimpanzees. Interestingly, these differences alter the coding sequence of only about 90 genes that are conserved in modern humans. These include genes that are involved in the skin, skeletal development, metabolism, and cognition. Further studies of these genes may elucidate their potential roles in the evolution of modern humans. Genomics, Proteomics, and Systems Biology Key Experiment The Human Genome Initial Sequencing and Analysis of the Human Genome International Human Genome Sequencing Consortium Nature, Volume 409, 2001, pages 860–921 The Sequence of the Human Genome J. Craig Venter and 273 others Science, Volume 291, 2001, pages 1304–1351 A Skeptical Beginning The idea of sequencing the entire human genome was conceived in the mid-1980s. It was initially met with broad skepticism among biologists, most of whom felt it was simply not a feasible undertaking. At the time, the largest genome that had been completely sequenced was that of EpsteinBarr virus, which totaled approximately 180,000 base pairs of DNA. From this perspective, sequencing the human genome, which is almost 20,000 times larger, seemed inconceivable to many. However, the idea of such a massive project in biology captivated the imagination of others, including Charles DeLisi, who was then head of the Office of Health and Environmental Research at the U.S. Department of Energy. In 1986 DeLisi succeeded in launching the Human Genome Initiative as a project within the Department of Energy. The project gained broader support in 1988 when it was endorsed by a committee of the U.S. National Research Council. This committee recommended a broader effort, including sequencing the genomes of several model organisms and the parallel development of detailed genetic and physical maps of the human chromosomes. This effort was centered at the U.S. National 163 Institutes of Health, initially under the direction of James Watson (co-discoverer of the structure of DNA), and then under the leadership of Frances Collins. The first complete genome to be sequenced was that of the bacterium Haemophilus influenzae, reported by Craig Venter and colleagues in 1995. Eric Lander (Rick FriedCraig Venter (Evan Venter had been part of the man/Corbis Historical/ Hurd/Alamy Stock genome sequencing effort Getty Images.) Photo.) at the National Institutes of Health but had left to head accelerating their efforts, resulting in a a nonprofit company, The Institute for race that eventually led to the publiGenomic Research, in 1991. In the cation of two draft sequences of the meantime, considerable progress had human genome in February 2001. been made in mapping the human The Sequence genome, and the initial sequence of The two groups of scientists used H. influenzae was followed by the sedifferent approaches to obtain the quences of other bacteria, yeast, and human genome sequence. The C. elegans in 1998. In 1998 Venter formed a new compa- publicly funded team, The International Human Genome Sequencing ny, Celera Genomics, and announced Consortium, headed by Eric Lander, plans to use advanced sequencing technologies to obtain the entire human sequenced DNA fragments derived from bacterial artificial chromosome genome sequence in three years. Col(BAC) clones that had been previously lins and other leaders of the publicly funded genome project responded by (Continued on next page) Genomic DNA A target genome is fragmented and cloned… BAC library …resulting in a library of a large-fragment cloning vector (BAC). Organized mapped large clone contigs The genomic DNA fragments are then organized into a physical map… BAC sequenced …and individual BAC clones are selected and sequenced by the random shotgun strategy. Shotgun clones Shotgun sequence Assembly ACCGTAAATGGGCTGATCATGCTTAAACCCTGTGCATCCTACTG The clone sequences are assembled to reconstruct the sequence of the genome. Strategy for genome sequencing using bacterial artificial chromosome (BAC) clones that had been organized into overlapping clusters (contigs) and mapped to human chromosomes. (After E. S. Lander et al., 2001. Nature 409: 860.) 164 Chapter 5 Key Experiment (continued) mapped to human chromosomes, similar to the approach used to determine the sequence of the yeast and C. elegans genomes (see figure). In contrast, the Celera Genomics team used a whole-genome shotgun sequencing approach that Venter and colleagues had first used to sequence the genome of H. influenzae. In this approach, DNA fragments were sequenced at random, and overlaps between fragments were then used to reassemble a complete genome sequence. Both sequences covered only the euchromatin portion of the human genome—approximately 2900 million base pairs (Mb) of DNA—with the heterochromatin repeat-rich portion of the genome (approximately 300 Mb) remaining unsequenced. Both of these initially published versions were draft, rather than completed, sequences. Subsequent efforts completed the sequence, leading to publication of a highly accurate sequence of the human genome in 2004. Fewer Genes than Expected Several important conclusions immediately emerged from the human genome sequences. Most strikingly, the number of human genes was surprisingly small and appeared to be between 20,000 and 25,000 in the completed sequence. This unexpected result has led to the recognition of the roles of non–protein-coding sequences in our genome, particularly with respect to the multiple mechanisms by which they regulate gene expression. Beyond the immediate conclusions drawn in 2004, the sequence of the human genome, together with the genome sequences of other organisms, has provided a new basis for biology and medicine. The impact of the genome sequence has been and continues to be felt in discovering new genes and their functions, understanding gene regulation, elucidating the basis of human diseases, and developing new strategies for prevention and treatment based on the genetic makeup of individuals. Knowledge of the human genome may ultimately contribute to meeting what Venter and colleagues refer to as “The real challenge of human biology . . . to explain how our minds have come to organize thoughts sufficiently well to investigate our own existence.” Question: How can genome sequencing contribute to the diagnosis and treatment of cancer? Next-generation sequencing and personal genomes Next-generation sequencing allows individual genomes to be sequenced for about $1000. The human genome and most of the genomes of model organisms discussed above were sequenced using the dideoxynucleotide technique discussed in Chapter 4, first described by Fred Sanger in 1977 (see Figure 4.20). Automation of this basic method increased its speed and capacity to allow whole genome sequencing, culminating in the successful completion of a high-quality human genome sequence in 2004. However, even with robust automation, genome sequencing by this approach was slow and expensive, so that the sequencing of a complete genome was a major undertaking. For example, the initial sequencing of the human genome took 15 years at a cost of approximately $3 billion. Starting around 2005, a number of new sequencing methods, collectively called next-generation sequencing, were developed that have substantially increased the speed and lowered the cost of genome sequencing (Figure 5.4). Since 2001, the cost of sequencing a human genome has decreased about 100,000 times—from approximately $100 million to about $1000. The speed of sequencing has increased even more, so it is now possible to sequence a complete human genome in a few days. These dramatic changes in sequencing technology have opened the door to sequencing the complete genomes of large numbers of different individuals, allowing new approaches to understanding the genetic basis for many of the diseases that afflict mankind, including cancer, heart disease, and degenerative diseases of the nervous system such as Parkinson’s and Alzheimer’s disease. In addition, understanding our unique genetic makeup as individuals is expected to lead to the development of new tailor-made strategies for disease prevention and treatment. Next-generation DNA sequencing (also called massively parallel sequencing) refers to several different methods in which millions of templates are Genomics, Proteomics, and Systems Biology 165 $100M Cost per genome $10M $1M $100K $10K $1K 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 Year Figure 5.4 Progress in DNA sequencing The cost of sequencing a human genome has dropped from approximately $100 million in 2001 to about $1000 in 2015. (Data from the National Human Genome Research Institute.) sequenced simultaneously in a single reaction. The basic general strategy of these methods is illustrated in Figure 5.5. First, the DNA is fragmented and adapter sequences, which serve as primers for amplification and sequencing reactions, are added to the ends of each fragment. Single DNA molecules are then attached to a solid surface and amplified by PCR to produce clusters of spatially separated templates. Millions of templates can then be sequenced in parallel by using lasers to monitor the incorporation of fluorescent nucleotides. The sequences derived from this large collection of overlapping fragments can then be assembled to yield a continuous genome sequence. Alternatively, if a known genome sequence (for example, the human genome) is already available, it can be used as a reference genome on which fragment sequences from a particular individual can be aligned. The first individual human genomes to be sequenced included those of Craig Venter and James Watson, reported in 2007 and 2008. Since then, the genome sequences of thousands of individuals have been determined. With the cost of sequencing an individual genome now in the range of $1000, it can be anticipated that personal genome sequencing will become part of medical practice. This will allow therapies to be specifically tailored to the needs of individual patients, both with respect to disease prevention and treatment. Perhaps the best current example, discussed in Chapter 20, is the development of new drugs for cancer treatment, which are specifically targeted against mutations that can be identified by sequencing the cancer genomes of individual patients. In the future, we may expect genome sequencing of healthy people to play an important role in disease prevention by identifying genes that confer susceptibility to disease, followed by taking appropriate measures to intervene. For example, genome sequencing could identify women carrying mutations in genes that confer a high risk for development of breast cancer, which might be prevented by mastectomy. There is also little doubt that continuing progress in genomics will not only lead to increasing applications in medicine, but will also help us to elucidate the contribution of our genes to other unique characteristics, Animation 5.1 Next-Generation Sequencing 166 Chapter 5 such as athletic ability or intelligence, and to better understand the interactions between genes and environment that lead to complex human behaviors. Cell DNA Global analysis of gene expression Fragmentation Ligate adapters Anchor single molecules to solid surfaces Amplify each molecule by PCR Add 4 color-labeled reversible chain terminating nucleotides, polymerase, and universal primer 3′ 5′ Universal primer Templated addition of chain terminator 3′ 5′ 3′ 5′ CATAAAAGCCGTGTC… G A A G T T C C T G Remove unincorporated nucleotides CATAAAAGCCGTGTC… G CATAAAAGCCGTGTC… G Detect with laser Reverse chain termination Repeat cycle 1 to 100 times The availability of complete genome sequences has enabled researchers to study gene expression on a genome-wide global level. It is thus now possible to analyze all of the RNAs that are transcribed in a cell (the transcriptome), rather than analyzing the expression of one gene at a time. One commonly used method for global expression analysis is hybridization to DNA microarrays, which allows expression of tens of thousands of genes to be analyzed simultaneously. A DNA microarray consists of a glass or silicon chip onto which oligonucleotides are printed by a robotic system in small spots at a high density (Figure 5.6). Each spot on the array consists of a single oligonucleotide. Tens of thousands of oligonucleotides can be printed onto a typical chip, so it is readily possible to produce DNA microarrays containing sequences representing all of the genes in cellular genomes. As illustrated in Figure 5.6, one widespread application of DNA microarrays is in studies of gene expression; for example, a comparison of the genes expressed by two different types of cells. In an experiment of this type, cDNAs are synthesized from the mRNAs expressed in each of the two cell types (e.g., cancer cells and normal cells) by reverse transcription. The cDNAs are labeled with fluorescent dyes and hybridized to DNA microarrays in which 20,000 or more human genes are represented by oligonucleotide spots. The arrays are then analyzed using a high-resolution laser scanner, and the relative extent of transcription of each gene is indicated by the intensity of fluorescence at the appropriate spot on the array. It is then possible to analyze the relative levels of gene expression between the cancer cells and normal cells by comparing the Figure 5.5 Next-generation sequencing Cellular DNA is fragmented and adapters are ligated to the ends of each fragment. Single molecules are then anchored to a solid surface and amplified by PCR, forming millions of clusters of molecules. Four color-labeled reversible chain terminating nucleotides are added together with DNA polymerase and a primer that recognizes the adapter sequence. Incorporation of a labeled nucleotide into each cluster of DNA molecules is detected by a laser. Unincorporated nucleotides are removed, chain termination is reversed, and the cycle is repeated to obtain the sequences of millions of clusters simultaneously. Genomics, Proteomics, and Systems Biology Cancer cell Normal cell mRNA Fluorescent cDNAs 167 Figure 5.6 DNA microarrays An example of comparative analysis of gene expression in cancer cells and normal cells. mRNAs extracted from cancer cells and normal cells are used as templates for synthesis of cDNAs labeled with a fluorescent dye. The labeled cDNAs are then hybridized to a DNA microarray containing spots of oligonucleotides corresponding to 20,000 or more distinct human genes. The relative level of expression of each gene is indicated by the intensity of fluorescence at each position on the microarray, and the levels of expression in cancer cells and normal cells can be compared. Examples of genes expressed at higher levels in cancer cells are indicated by arrows. DNA microarray Animation 5.2 DNA Microarray Technology Laser scan All of the RNAs expressed in a cell can be determined by nextgeneration sequencing. intensity of hybridization of their cDNAs to oligonucleotides representing each of the genes in the cell. The continuing development of next-generation sequencing has also made it feasible to use DNA sequencing to determine and quantify all of the RNAs expressed in a cell. In this approach, called RNA-seq, cellular mRNAs are isolated, converted to cDNAs by reverse transcription, and subjected to next-generation sequencing (Figure 5.7). In contrast to microarray analysis, RNA-seq reveals the complete extent of transcribed sequences in a cell, rather than just detecting those that hybridize to a probe on a microarray. The frequency with which individual sequences are detected in RNA-seq is proportional to the quantity of RNA in the cell, so this analysis determines the abundance as well as the identity of all transcribed sequences. The sensitivity of RNA-seq is high enough to allow analysis at the single cell level, so the transcriptomes of individual cells can be determined. Transcriptome analysis of human cells indicates that each type of cell expresses approximately 11,000 of the 20,000 protein-coding genes in the human genome. About 6,000 of these genes are common to multiple cell types, suggesting that they serve general cell maintenance functions. Surprisingly, RNA-seq analysis has also shown that many more RNAs are transcribed than are accounted for by the protein-coding genes of human cells. As discussed in the next chapter, these studies have led to the identification of new classes of RNAs that play critical roles in gene regulation. mRNAs Reverse transcription cDNAs Next-generation sequencing Frequency of sequence reads Figure 5.7 RNA-seq Cellular mRNAs are reverse transcribed to cDNAs, which are subjected to next-generation sequencing. The results yield the sequences of all mRNAs in a cell. The relative amount of each mRNA is indicated by the frequency at which its sequence is represented in the total number of sequences read. 168 Chapter 5 5.1 Review Bacterial and yeast genomes are compact, with protein-coding sequences accounting for most of the DNA. The S. cerevisiae genome contains about 6000 genes. Protein-coding sequences account for 10–25% of the genomes C. elegans, Drosophila, and Arabidopsis, which contain approximately 19,000, 14,000, and 26,000 genes, respectively. The human genome contains approximately 20,000 protein-coding genes—not much more than the number of genes found in simpler animals like Drosophila and C. elegans, and fewer than in Arabidopsis and other plants, emphasizing the lack of relationship between gene number and complexity of an organism. Enormous progress in the technology of DNA sequencing has now made it feasible to determine the complete sequence of individual genomes and of all the RNAs expressed in a cell. Questions 1. How can protein-coding sequences be identified in the DNA sequence of a genome? 2. Find an open-reading frame in the following sequence. ACTTAGCCCCGTAAGCTCGATT 3. How does the total amount of protein-coding sequence in the human genome compare to that in Drosophila? How about the total amounts of noncoding sequence? 4. Why are dogs useful models for human medicine? 5. You are using RNA-seq to analyze gene expression in breast cancers compared to normal breast cells. One gene of potential interest yields about five times the frequency of reads in normal cells. What does this say about the gene’s level of expression in the two cell types? Do you think this gene could have a role in cancer? 5.2 Proteomics Learning Objectives You should be able to: • Explain how proteins are identified by mass spectrometry. • Summarize the approaches used for analysis of the proteome of subcellular organelles. • Describe the approaches used for identification of protein interactions. Mass spectrometry is the basic method of protein identification. The analyses of cell genomes and transcriptomes are only the first steps in understanding the workings of a cell. Since proteins are directly responsible for carrying out almost all cell activities, it is necessary to understand not only what proteins can be encoded by a cell’s genome but also what proteins are actually expressed and how they function within the cell. A complete understanding of cell function therefore requires not only the analysis of the sequence and transcription of its genome, but also a systematic analysis of its protein complement. This large-scale analysis of cell proteins (proteomics) has the goal of identifying and quantifying all of the proteins expressed in a given cell (the proteome), as Genomics, Proteomics, and Systems Biology well as establishing the localization of these proteins to different subcellular organelles and elucidating the networks of interactions between proteins that govern cell activities. Protein sample Identification of cell proteins Protease digestion Peptides Ionization of peptides Mass spectrometer Mass spectrum Intensity The number of distinct species of proteins in eukaryotic cells is typically far greater than the number of genes. This arises because many genes can be expressed to yield several distinct mRNAs, which encode different polypeptides as a result of alternative splicing (discussed in Chapter 6). In addition, proteins can be modified in a variety of different ways, including the addition of phosphate groups, carbohydrates, and lipid molecules. The human genome, for example, contains approximately 20,000 different protein-coding genes, and the number of these genes expressed in any given cell is around 10,000. However, because of alternative splicing and protein modifications, it is estimated that these genes can give rise to more than 100,000 different proteins. In addition, these proteins can be expressed at a wide range of levels. Characterization of the complete protein complement of a cell, the goal of proteomics, thus represents a considerable challenge. Although major progress has been made in the last several years, substantial technological hurdles remain to be overcome before a complete characterization of cell proteomes can be achieved. The major tool used in proteomics is mass spectrometry, which was developed in the 1990s as a powerful method of protein identification (Figure 5.8). Proteins to be analyzed are digested with a protease to cleave them into small fragments (peptides) in the range of approximately 20 amino acid residues long. A commonly used protease is trypsin, which cleaves proteins at the carboxy-terminal side of lysine and arginine residues. The peptides are then ionized by irradiation with a laser or by passage through a field of high electrical potential and introduced into a mass spectrometer, which measures the mass-to-charge ratio of each peptide. This generates a mass spectrum in which individual tryptic peptides are indicated by a peak corresponding to their mass-to-charge ratio. Computer algorithms can then be used to compare the experimentally determined mass spectrum with a database of theoretical mass spectra representing tryptic peptides of all known proteins, allowing identification of the unknown protein. More detailed sequence information than just the mass of the peptides can be obtained by tandem mass spectrometry (Figure 5.9). In this technique, individual peptides from the initial mass spectrum are automatically selected to enter a “collision cell” in which they are partially degraded by random breakage of peptide bonds. A second mass spectrum of the partial degradation products of each peptide is then determined. Because each amino acid has a unique molecular weight, the amino acid sequence of the peptide can be deduced from these data. Protein modifications, such as phosphorylation, can also be identified because they alter the mass of the modified amino acid. 169 Mass/charge ratio Database search Protein identification Figure 5.8 Identification of proteins by mass spectrometry A protein is digested with a protease that cleaves it into small peptides. The peptides are then ionized and analyzed in a mass spectrometer, which determines the mass-to-charge ratio of each peptide. The results are displayed as a mass spectrum, which is compared to a database of theoretical mass spectra of all known proteins for protein identification. 170 Chapter 5 Peptides Figure 5.9 Tandem mass spectrometry A mixture of peptides is separated in a mass spectrometer 1. A randomly selected peptide is then fragmented by collisioninduced breakage of peptide bonds. The fragments, which differ by single amino acids, are then separated in a second mass spectrometer 2. Since the fragments differ by single amino acids, the amino acid sequence of the peptide can be deduced. Intensity Mass spectrometer 1 Mass/charge ratio Peptide randomly selected Collision-induced fragmentation Mass spectrometry can also be used to analyze mixtures of proteins, not just single isolated species. In this approach, called “shotgun mass spectrometry,” a mixture of cell proteins is digested with a protease (e.g., trypsin), and the complex mixture of peptides is subjected to sequencing by tandem mass spectrometry. The sequences of individual peptides are then used for database searching to identify the proteins present in the starting mixture. Additional methods have been developed to compare the amounts of proteins in two different samples, allowing a quantitative analysis of protein levels in different types of cells or in cells that have been subjected to different treatments. Although several problems with the sensitivity and accuracy of these methods remain to be solved, the analysis of complex mixtures of proteins by “shotgun” mass spectrometry provides a powerful approach to the systematic analysis of cell proteins. Global analysis of protein localization Intensity Mass spectrometer 2 Mass/charge ratio Proteomes of subcellular organelles can be characterized by mass spectrometry or immunofluorescence. Understanding the function of eukaryotic cells requires not only the identification of the proteins expressed in a given cell type, but also characterization of the locations of those proteins within the cell. As reviewed in Chapter 1, eukaryotic cells contain a nucleus and a variety of subcellular organelles. Systematic analysis of the proteins present in these organelles is an important goal of proteomic approaches to cell biology. The protein composition of a variety of organelles has been determined by combining classical cell biology methods with mass spectrometry. Organelles of interest are isolated from cells by subcellular fractionation techniques, as discussed in Chapter 1 (see Figures 1.40–1.42). The proteins present in isolated organelles can then be determined by mass spectrometry. The proteome of a variety of organelles and large subcellular structures, such as nucleoli, have been characterized by this approach. For example, more than 700 different proteins have been identified by mass spectrometry of isolated mitochondria and 1000–2000 different proteins in plasma membranes. By performing such studies with organelles isolated from cells of different tissues or grown under different conditions, it is also possible to determine the changes in protein composition that are associated with different cell types or physiological states. An alternative approach to determining the proteomes of subcellular organelles is large-scale analysis by immunofluorescence. An extensive recent analysis has used almost 14,000 immunofluorescent antibodies to determine the subcellular locations of 12,003 human proteins. This analysis defined the proteomes of 30 subcellular structures and 13 organelles, examples of which are shown in Figure 5.10. Consistent with the results of mass spectrometry, this analysis identified approximately 1000 proteins in mitochondria and 1500 proteins in the plasma membrane. However, some smaller organelles, such as nucleoli, contained more proteins than previously recognized (more than 1000). Unexpectedly, more than half of the proteins analyzed were localized to more than one compartment, suggesting the possibility that they may have different functions in different locations. Genomics, Proteomics, and Systems Biology 272 1070 263 430 959 171 1466 Figure 5.10 Analysis of subcellular organelle proteomes Examples of immunofluorescence images used to localize human proteins to subcellular organelles. The number of proteins localized to each organelle is indicated below the image. (From P. J. Thul et al., 2017. Science 356: 820.) Protein interactions Proteins almost never act alone within the cell. Instead, they generally function by interacting with other proteins in protein complexes and networks. Elucidating the interactions between proteins therefore provides important clues as to the function of novel proteins, as well as helps to understand the complex networks of protein interactions that govern cell behavior. Along with global studies of subcellular localization, the systematic analysis of protein complexes and interactions has therefore become an important goal of proteomics. One approach to the analysis of protein complexes is to isolate a protein from cells under gentle conditions, so that it remains associated with the proteins it normally interacts with inside the cell. Typically, an antibody against a protein of interest would be used to isolate that protein from a cell extract by immunoprecipitation (Figure 5.11). A cell extract is incubated with an antibody, which binds to its antigenic target protein. In order to maintain associations between the target protein and the proteins with which it normally interacts inside the cell, extracts are prepared under gentle conditions and adjacent proteins are sometimes chemically crosslinked. The resulting antigen–antibody complexes are then isolated, and interacting proteins will be present together with the target protein in the immunoprecipitates. Such immunoprecipitated protein complexes can then be analyzed, for example by mass spectrometry, to identify not only the protein against which the antibody was directed, but also other proteins with which it was associated Incubate with antibody against protein of interest Collect antigen– antibody complexes Beads Mixture of proteins Coimmunoprecipitation can be used to identify protein complexes. Beads Release proteins Beads Antibody binds to target protein Antibody bound to beads Figure 5.11 Immunoprecipitation A mixture of cell proteins is incubated with an antibody bound to beads. The antibody forms complexes with the protein (green) against which it is directed (the antigen). These antigen–antibody complexes are collected on the beads and the target protein is isolated. Antigen–antibody complexes bound to beads 172 Chapter 5 Interacting proteins Known protein Protein complex isolated from cells Protease digestion Peptides Mass spectrometry Intensity Mass spectrum Mass/charge ratio in the cell extract (Figure 5.12). This approach to analysis of protein interactions has been used to characterize a variety of protein–protein interactions in different types of cells and under different physiological conditions, leading to the identification of numerous interactions between proteins involved in processes such as cell signaling or gene expression. Alternative approaches to systematic analyses of protein complexes include screens for protein interactions in vitro as well as genetic screens that detect interactions between pairs of proteins that are introduced into yeast cells. Expression of cloned genes in yeast is particularly useful because simple methods of yeast genetics can be employed to identify proteins that interact with one another. In this type of analysis, called the yeast two-hybrid system, two different cDNAs (for example, from human cells) are joined to two distinct domains of a protein that stimulates expression of a target gene in yeast (Figure 5.13). Yeast are then transformed with the hybrid Human protein A Human protein B Yeast domain 2 Yeast domain 1 Identification of all proteins in complex Introduce recombinant cDNAs into yeast cell Figure 5.12 Analysis of protein complexes A known protein (blue) is isolated from cells as a complex with other interacting proteins (orange and red). The entire complex can be analyzed by mass spectrometry to identify the interacting proteins. Target gene DNA binding site Gene expression Figure 5.13 The yeast two-hybrid system cDNAs of two human proteins are cloned as fusions with two domains (designated 1 and 2) of a yeast protein that stimulates transcription of a target gene. The two recombinant cDNAs are introduced into a yeast cell. If the two human proteins interact with each other, they bring the two domains of the yeast protein together. Domain 1 binds DNA sequences at a site upstream of the target gene, and domain 2 stimulates target gene transcription. The interaction between the two human proteins can thus be detected by expression of the target gene in transformed yeast. Genomics, Proteomics, and Systems Biology cDNA clones to test for interactions between the two proteins. If the human proteins do interact with each other, they will bring the two domains of the yeast protein together, resulting in stimulation of target gene expression in the transformed yeast. Expression of the target gene can be easily detected by growth of the yeast in a specific medium or by production of an enzyme that produces a blue yeast colony, so the yeast two-hybrid system provides a straightforward method to test protein–protein interactions. High-throughput analysis by both mass spectrometry and the yeast two-hybrid system has been applied to proteome-scale studies of the interactions between proteins of higher eukaryotes, including Drosophila, C. elegans, and humans. These screens have identified thousands of protein–protein interactions, which can be presented as maps that depict an extensive network of interacting proteins within the cell (Figure 5.14). In human cells, interaction maps have been obtained for about 25% of protein-coding genes. Continuing elucidation of these protein interaction networks will be a major step forward in our understanding of the 173 The yeast two-hybrid system is a genetic screen for protein interactions. Membrane and extracellular proteins Cytoplasmic proteins Nuclear proteins Figure 5.14 A protein interaction map of Drosophila Interactions among 2346 proteins are depicted, with each protein represented as a circle placed according to its subcellular localization. (From L. Giot et al., 2003. Science 302: 1727.) 174 Chapter 5 complexities of cell regulation, as well as illuminating the functions of many so-far-unidentified proteins. 5.2 Review Characterization of the complete protein complement of cells is a major goal of proteomics. Mass spectrometry provides a powerful tool for protein identification, which can be used to identify either isolated proteins or proteins present in mixtures. The protein compositions of subcellular organelles can be analyzed by mass spectrometry or large-scale immunofluorescence. The purification of protein complexes from cells and analysis of interactions of proteins introduced into yeast can identify interacting proteins and may lead to elucidation of the complex networks of protein interactions that regulate cell behavior. Questions 1. You have obtained the mass spectrum of an unknown protein. How can you determine the protein’s complete amino acid sequence? 2. How could you compare the protein composition of mitochondria from cancer cells and normal cells? 3. You are interested in a protein called Mox, which is found in the nucleus of cancer cells but the cytoplasm of normal cells by immunofluorescence. Mass spectrometry indicates that one peptide of Mox differs by a molecular mass of 78 in cancer versus normal cells, with the highermass peptide found in the nucleus. How do you interpret these results? 4. You are studying a protein called Nef, which is phosphorylated in nerve cells but not in other cell types. You hypothesize that phosphorylation affects the participation of Nef in protein–protein interactions. How can you compare the proteins that interact with phosphorylated versus nonphosphorylated Nef? Would the yeast two-hybrid system be useful in this analysis? 5. If you assume that each human protein interacts with one other protein and that each gene encodes only a single protein (both of which are minimal assumptions), how many protein–protein interactions could occur in the human proteome? 5.3 Systems Biology Learning Objectives You should be able to: • Contrast the approaches of traditional biological experimentation and systems biology. • Summarize the methods used for large-scale screens of gene function. • Explain the approaches used to identify gene regulatory sequences. • Illustrate the types of interactions between pathways in a regulatory network. • Define synthetic biology. Genomics, Proteomics, and Systems Biology TRADITIONAL BIOLOGY SYSTEMS BIOLOGY Single gene and protein experiments Genome- and proteome-wide experiments Understanding individual molecules and pathways Understanding integrated cell processes 175 Figure 5.15 Systems biology Traditional biological experiments study individual molecules and pathways. Systems biology uses global experimental data for quantitative modeling of integrated systems and processes. The genome sequencing projects have led to a fundamental change in the way in which many problems in biology are being approached, with large-scale experimental approaches that generate vast amounts of data now in common use. Handling the enormous amounts of data generated by whole genome sequencing required sophisticated computational analysis and spawned the new field of bioinformatics. This field lies at the interface between biology and computer science and is focused on developing the computational methods needed to analyze and extract useful biological information from the sequence of billions of bases of DNA. Other types of large-scale biological experimentation, including global analysis of gene expression and proteomics, similarly yield vast amounts of data, far beyond the scope of traditional biological experimentation. These large-scale experimental approaches form the basis of the new field of systems biology, which seeks a quantitative understanding of the integrated dynamic behavior of complex biological systems and processes. In contrast to traditional approaches, systems biology is characterized by the use of large-scale datasets for quantitative experimental analysis and modeling (Figure 5.15). Some of the research areas that are amenable to large-scale experimentation, bioinformatics, and systems biology are discussed below. Systematic screens of gene function The identification of all of the genes in an organism opens the possibility for a large-scale systematic analysis of gene function. One approach is to systematically inactivate (or knockout) each gene in the genome by homologous recombination with an inactive mutant allele (see Figure 4.31). Complete collections of strains with mutations in all known genes are available for several model organisms, including E. coli, yeast, Drosophila, C. elegans, and Arabidopsis thaliana. These collections of mutant strains can be analyzed to determine which genes are involved in any biological property of interest. A large-scale international project to systematically knockout all genes in the mouse is also under way, and targeted mutagenesis has now indicated functions of more than 7000 mouse genes. More recently, genome-wide screens using the CRISPR/Cas system (see Figure 4.33) have similarly been applied to systematically identify sets of genes in human cells that are responsible for properties such as survival or resistance to anticancer drugs. Studies of this sort have recently identified a set of approximately 2000 human genes that are essential for cell survival—about 10% of the human genome. Alternatively, large-scale screens based on RNA interference (RNAi) are being used to systematically dissect gene function in a variety of organisms, including Drosophila, C. elegans, and mammalian cells in culture. In RNAi screens, double-stranded RNAs are used to induce degradation of Gene functions can be systematically studied by gene knockouts or RNA interference. 176 Chapter 5 Figure 5.16 Genome-wide RNAi screen for cell growth and viability Each microwell contains siRNA corresponding to an individual gene. Tissue culture cells are added to each well and incubated to allow cell growth. Those wells in which cells fail to grow identify genes required for cell growth or viability. Each well contains siRNA against an individual gene Inoculate with cells Incubate to allow cell growth Cell growth Well in which siRNA blocked cell growth or viability the homologous mRNAs in cells (see Figure 4.35). With the availability of complete genome sequences, libraries of double-stranded RNAs can be designed and used in genome-wide screens to identify all of the genes involved in any biological process that can be assayed in a high-throughput manner. For example, genome-wide RNAi analysis can be used to identify genes required for the growth and viability of cells in culture (Figure 5.16). Individual double-stranded RNAs from the genome-wide library are tested in microwells in a high-throughput format to identify those that interfere with the growth of cultured cells, thereby characterizing the entire set of genes that are required for cell growth or survival under particular sets of conditions. Similar RNAi screens have been used to identify genes involved in a variety of biological processes, including cell signaling pathways, protein degradation, and transmission at synapses in the nervous system. Regulation of gene expression Regulatory elements that control gene expression are usually short sequences of DNA. Genome sequences can, in principle, reveal not only the protein-coding sequences of genes, but also the regulatory elements that control gene expression. As discussed in subsequent chapters, regulation of gene expression is critical to many aspects of cell function, including the development of complex multicellular organisms. Understanding the mechanisms that control gene expression, including transcription and alternative splicing, is therefore a central undertaking in contemporary cell and molecular biology, and the availability of genome sequences contributes substantially to this task. Unfortunately, it is far more difficult to identify gene regulatory sequences than it is to identify protein-coding sequences. Most regulatory elements are short sequences of DNA, typically spanning only about 10 base pairs. Consequently, sequences resembling regulatory elements occur frequently by chance in genomic DNA, so physiologically significant elements cannot be identified from DNA sequence alone. The identification of functional regulatory elements and elucidation of the signaling networks that control