Microbial Genomics PDF
Document Details
Uploaded by TollFreeLesNabis
Tags
Related
- VL9 Introduction to Microbial Genomics and Prokaryotic Diversity PDF
- MiBi VL9 Introduction to Microbial Genomics and Prokaryotic Diversity PDF
- Microbial Genome Organization PDF
- Applied Genomics Lecture Notes (BIOT9008) PDF
- MMG301 Learning Objectives Unit 2 v3 - Microbiology Course
- Lecture 6: The World of Omics PDF
Summary
This document provides an overview of microbial genomics, discussing synthetic life, DNA sequencing methods, and the principles of genome sequencing. It details historical and contemporary research approaches, along with various techniques.
Full Transcript
18 Microbial Genomics ©Olena_T/Getty Images “Synthetic Life”: Oxymoron or the...
18 Microbial Genomics ©Olena_T/Getty Images “Synthetic Life”: Oxymoron or the Future? To comprehend how we have arrived in a world where a computer is the starting point for a new organism, we need to review the individual technologies required. This involves the integration of a number of genetic “T o live, to err, to fall, to triumph, to recreate life out of life!” When James Joyce wrote these words in 1916 in A Portrait of the Artist as a Young Man, he never imagined that they would be encrypted in the and biotechnology principles; be sure you succeed in the readiness check, because the world of genomics is a vast and amazing place well worth understanding. genetic code of a synthetic genome in a novel bacterium. But 94 years later, Daniel Gibson and his co-workers at the J. Craig Venter Institute Readiness Check: (JCVI) did just that. Building on 15 years of investigation, the JCVI Based on what you have learned previously, you should be able to: researchers used a computer to design a genome and a DNA synthesizer ✓ Identify the key structural elements of DNA, RNA, and protein to make short stretches of DNA. These were stitched together to create a (section 13.2) 1.08 million base pair genome based on the chromosome of Mycoplasma ✓ Discuss what is meant by the term genome mycoides (see Techniques & Applications 17.2). The genome was ✓ Explain the process and utility of PCR (section 17.2) transplanted into M. capricolum, which after a few rounds of replication ✓ Diagram the fundamental principles of gel electrophoresis consisted entirely of molecules whose synthesis was directed by a (Techniques & Applications 17.1) chromosome that started as computer code and four bottles ✓ Summarize the purpose, construction, and screening of a genomic library (section 17.3) of deoxyribonucleotides. The new microbe is known as M. mycoides JCVI-Syn 1.0, or just Syn 1.0. Why construct this microbe? Craig Venter views it as a $30 million proof of concept leading to the construction of genomes specifically designed for the development of vaccines, pharmaceuticals, clean water 18.1 DNA Sequencing Methods and food products, and biofuels. To reach this goal, Venter and his team After reading this section, you should be able to: took the 901 genes in the Syn 1.0 genome and, through trial and error, a. Explain how DNA is sequenced by the Sanger chain termination whittled the genome down to just 473 genes, thereby creating the method smallest self-replicating organism, called Syn 3.0. The idea is that this b. Contrast and compare the advantages and disadvantages of the leaves room to add DNA sequence needed to engineer microbes capable Sanger method with next-generation sequencing of special tasks. But is synthetic life ethical? Depends on whom you ask. Some The genomic era is a 21st-century phenomenon. However, one of environmental groups say synthetic genome research should stop until the most crucial steps leading to the genomic revolution was specific regulations are in place. Most scientists and bioethicists take a figuring out how to sequence DNA. This was accomplished in more measured view, envisioning two ways such a microbe might be 1977 by Alan Maxam and Walter Gilbert (collaborating on one released: bioterror and bioerror. Protection against bioterror relies on a technique), and Frederick Sanger. Sanger’s method became the host of security measures, many of which are already in place. Bioerror most commonly used and is now discussed. brings us back to James Joyce. The Syn 1.0 genome includes a “water- mark”—sequences unique to this microbe so that if it were to escape the Sanger DNA Sequencing lab, its identity could be determined. They developed a cipher to convert The Sanger method involves the synthesis of a new strand of the genetic code so that when decoded, the watermark spells out the DNA using the DNA to be sequenced as a template. The reaction Joyce quote, as well as the words of the late physicist Richard Feynman, begins when single strands of template DNA are mixed with an “What I cannot create, I do not understand.” oligonucleotide primer (DNA about 12 to 20 nucleotides 425 wil11886_ch18_425-446.indd 425 23/10/18 10:24 am 426 CHAPTER 18 | Microbial Genomics complementary to the region to be sequenced; it primes the ini- NH3 tiation of new strand synthesis), DNA polymerase, the four de- N N oxynucleoside triphosphates (dNTPs), and dideoxynucleoside triphosphates (ddNTPs). ddNTPs differ from dNTPs in that the O O O N N 3′ carbon lacks a hydroxyl group (figure 18.1). In such a reaction – O P O P O P O CH 2 O mixture, DNA synthesis will continue until a ddNTP, rather than O– O– O– a dNTP, is added to the growing chain. Without a 3′-OH group to attack the 5′-PO4 of the next dNTP to be incorporated, synthe- 3′ 2′ sis stops (see figure 13.10). Indeed, Sanger’s technique is fre- H H quently referred to as a chain-termination DNA sequencing Figure 18.1 Dideoxyadenosine Triphosphate (ddATP). Note the method. lack of a hydroxyl group on the 3′ carbon, which prevents further chain To obtain sequence information, four separate synthesis re- elongation by DNA polymerase. actions must be prepared, one for each ddNTP (figure 18.2). The MICRO INQUIRY What is the function of the 3′-OH during DNA ddNTP is mixed with all four normal dNTPs, and as DNA syn- synthesis? thesis proceeds, only sometimes will the ddNTP be incorporated into the growing DNA strand, rather than its dNTP cousin. This 1 Isolated unknown DNA fragment 6 Schematic view of how all possible positions on the fragment are occupied C A C T T A G C C G A T C C by a labeled nucleotide G T G A A T C G G C T A G G AGC C G ATC C Original DNA to be sequenced AGC C G AT C AGC C G AT 2 DNA is denatured to produce single template AGC C G A strand. AGC C G AGC C G T G A A T C G G C T A G G AG C AG 3 Labeled specific primer molecule hybridizes to the DNA strand. A Primer C A C T T G T G A A T CG G C T A G G +ddGTP +ddCTP +ddATP +ddTTP 4 DNA polymerase and deoxynucleotides (dATP, dCTP, dGTP, and dTTP) are added to all four tubes. To each of these tubes, a single dideoxoynucleotide (ddATP, ddCTP, ddGT, or ddTTP) is added so that each tube is mixture of all dNTPs and a single ddNTP. Each ddNTP is labeled with a tracer so they can be later visualized. DNA Largest Sequence – G C A T +ddGTP +ddCTP +ddATP +ddTTP C Incubate C T 5 Newly replicated strands are terminated at the A point of addition of a dd nucleotide. G C C A C T T A C G T G A A T CG G C T A G G G A C A C T T A G + G T G A A T CG G C T A G G Smallest 7 Running the reaction tubes in four separate gel lanes separates them by C A C T T A G C G T G A A T CG G C T A G G size and nucleotide type. Reading from bottom to top, one base at a time, provides the correct DNA sequence. C A C T T A G C C G T G A A T CG G C T A G G Figure 18.2 The Sanger Method of DNA Sequencing. Steps 1–6 are used for both manual and automated sequencing. Step (7) shows preparation of a gel for manual sequencing in which radiolabeled ddNTPs are used. Although manual sequencing is rarely used, it is instructive in understanding how DNA sequence is determined. wil11886_ch18_425-446.indd 426 23/10/18 10:24 am 18.1 DNA Sequencing Methods 427 GCGACAT different fluorescent color. This is recorded on a graph called a chromatogram in which the amplitude of each spike represents +ddA +ddG +ddC +ddT the fluorescent intensity of each particular fragment (figure 18.3b). GCGAC A GCG GCGA C GCGACA T The corresponding DNA sequence is listed above the chromato- GCG A G GC gram. Typically, an automated Sanger chain termination se quencing system can accurately read 500 to 800 bp in a single electrophoresis run. Mix and electrophorese T Next-Generation DNA Sequencing A C Sanger’s chain termination method was used to complete the first A human genome sequence in 2001. This cost about $300 million G and took about a decade to finish. Although this was an amazing C feat at the time, it illustrates two limitations of Sanger sequenc- G ing: It is expensive and time consuming. Because scientists want (a) to sequence genomes faster and more cheaply, innovative, newer G T C T A A C T TG T C T T C C T T C T TC T C T T CC T G T T T A AGA AGAGAA DNA sequencing techniques have been invented. 540 550 560 570 580 This desire stimulated the development of next-generation sequencing (NGS) technologies. The setup is quite different from the Sanger approach; rather than long genomic strands in solution, NGS uses short, sheared pieces of DNA templates with oligonucleotides called adapters attached to each end (figure 18.4, step 1). The adapter at one end of each DNA frag- ment attaches it to a solid substrate. The adapter at the other end (b) also attaches the DNA fragment to the substrate, but by anneal- Figure 18.3 Automated Sanger DNA Sequencing. (a) Part of an ing it to a primer used to initiate the polymerase chain reaction automated DNA sequencing run. Here the ddNTPs are labeled with fluorescent (PCR; figure 18.4, steps 2 and 3). PCR results in the production dyes. (b) Data generated during an automated DNA sequencing run. Bases of many copies of the same fragment, which are sequenced si- 538 to 580 are shown. multaneously. Because thousands of identical DNA fragments are sequenced at the same time, these methods are sometimes called massively parallel sequencing techniques. In addition to making genomic sequencing faster and cheaper, as we dis- results in a collection of DNA fragments of varying lengths, each cuss later in this chapter, NGS avoids the need to insert ending in the same ddNTP. For example, a reaction prepared (i.e., clone) individual DNA fragments into vectors. This is im- with ddATP + dATP, dTTP, dGTP, and dCTP produces portant because it is almost impossible to clone every DNA fragments ending with an A, those with ddTTP produce fragment into any given genomic library. Polymerase chain fragments with T termini, and so forth. After DNA synthesis reaction amplifies targeted DNA (section 17.2); Genomic is completed, the DNA is made single stranded, usually by heat- libraries: cloning genomes in pieces (section 17.3) ing. DNA is usually prepared for automated sequencing Although several NGS techniques have been marketed, (figure 18.3). Here each ddNTP is labeled with a different currently reversible chain termination sequencing is most colored fluorescent dye. The resulting fragments are then sepa- frequently used. Here, the DNA is sequenced as each nucleotide rated by electrophoresis. Recall that each fragment’s migration is incorporated. This is called sequencing by synthesis. This rate is inversely proportional to the log of its molecular weight. takes place in a flow cell, which is a glass slide into which Simply put, the smaller a fragment is, the faster it moves through grooves have been cut. Flow cells allow reagents to be added, the gel. Because synthesis always adds a nucleotide to the 3′-OH flushed, and new reagents added without the loss of the at- of the growing strand, the ddNTP at the end of the shortest frag- tached DNA. PCR amplification of DNA templates within the ment is assigned to the 5′ end of the DNA sequence, while the flow cell is called “bridge amplification” and it creates clusters largest fragment is the 3′ end. In this way, the DNA sequence can of double-stranded fragments of DNA scattered over the sur- be read directly from the gel from the smallest to the largest face of the slide (figure 18.4, step 4). As we will see, it is criti- fragment. Gel electrophoresis (Techniques & Applications 17.1) cal that fragments with identical nucleotide sequences are Sanger Sequencing clustered together in the flow cell. In the next step, the double- Automated sequencing uses a laser beam to detect DNA stranded fragments are denatured, and the flow cell is flushed. fragments as they exit the bottom of the gel from smallest to This leaves bundles of single-stranded linear fragments ready largest. This is possible because each ddNTP is labeled with a for sequencing (figure 18.4, steps 5 and 6). wil11886_ch18_425-446.indd 427 23/10/18 10:24 am 428 CHAPTER 18 | Microbial Genomics Similar to Sanger sequencing, sequencing by synthesis for DNA synthesis can be introduced to catalyze the incorpora- uses a modified fluorescent nucleotide that, when introduced tion of a modified dNTP, which is then detected by the onset of into the growing strand, stops the reaction because the 3′-OH fluorescence. These synthesis reagents are then flushed from is blocked (figure 18.5). However, rather than using a dideoxy- the cell, and the enzyme cocktail that removes the fluorescent nucleotide as in Sanger sequencing (figure 18.1), here a small tag and blocking group is infused. The cycle can then repeat chemical group is added that can be enzymatically removed. (figure 18.4, steps 7 through 9). Another big difference is that the modified nucleotide does not In summary, once the templates have been amplified in clus- fluoresce until it is incorporated into the growing DNA strand. ters, the incorporation of each nucleotide determined by laser Because sequencing proceeds in a flow cell, reagents needed optics involves the following events: (1) The DNA polymerase, Adapter DNA fragment DNA Adapters Dense lawn of primers Adapter 1. Genomic DNA is 2. Single-stranded fragments are 3. PCR reagents, including fragmented and adapters bound to flow cell surface. unlabeled nucleotides, are are attached to both added to begin bridge PCR ends of each fragment. amplification of each fragment. Attached Free terminus Attached Attached ends Clusters 4. Bridge PCR amplification generates 5. Fragments are denatured to 6. Each fragment serves as PCR double-stranded fragments bound to become single stranded. template to generate millions of the flow cell surface at both ends. clusters of identical fragments. Figure 18.4 Reversible Chain Termination Sequencing. MICRO INQUIRY Why is it important that identical fragments of DNA to be sequenced are clustered together? (Continued) wil11886_ch18_425-446.indd 428 23/10/18 10:24 am 18.1 DNA Sequencing Methods 429 Figure 18.4 (Continued) C A C A G G C T G G T T C A Laser 7. The first sequencing cycle begins when 8. Laser excitation is followed by 9. Enzymatic cleavage of the reversible reagents, including primers and labeled image capture of fluorescent signal terminator enables the second nucleotides with reversible terminators, from each cluster of fragments. sequencing cycle to begin. are added. This identifies the first base, which is recorded. T G C A Reference genome A T C..GCTGATGTGCCGCCTCACTCCG GTGG T G CACTCCTGTGG A A CTCACTCCTGTGG T GCTGATGTGCCACCTCA C T C GATGTGCCACCTCACTC T G GTGCCGCCTCACTCCTG CTCCTGTGG G G A C Unknown Known variant SNP called T identified A and called G G GCTGA... 10. Laser excitation generates the 11. Cleavage of terminator and 12. Sequences are aligned and compared with signal that identifies the second sequencing cycles are repeated to reference data base. SNP, single nucleotide base in each cluster of fragments. determine base sequence of each polymorphism. cluster of fragments. all four modified dNTPs, and other reagents needed for DNA the blocking molecule are removed, exposing the 3′-OH. Finally, synthesis, using the tethered fragments as template, are added to (6) the enzyme cocktail is flushed, and synthesis reagents are the flow cell. (2) The incorporation of a modified nucleotide as added once again and the cycle repeats. Because the light emitted determined by the template strand occurs, causing its fluorescent by any single nucleotide is too dim to accurately record, each tag to emit light while at the same time stalling synthesis. This cluster of identical fragments must grow synchronously so that a pause allows the identity of the incorporated base to be deter- large enough signal is generated. The incorporation of the same mined by the color of the fluorescent tag. (3) Reagents needed for nucleotide at the same time into each identical fragment in a synthesis are then flushed from the flow cell, and (4) cleavage cluster generates a signal with sufficient amplitude for detection enzyme reagents are introduced. (5) The fluorescent label and (figure 18.4, steps 8 and 10). This explains why the length of each wil11886_ch18_425-446.indd 429 23/10/18 10:24 am 430 CHAPTER 18 | Microbial Genomics bacterial genomes to manageable sizes, they developed whole- H2N genome shotgun sequencing and the computer software needed N to assemble sequence data into a complete genome. They used N their new method to sequence the genomes of the bacteria Hae- O O O mophilus influenzae and Mycoplasma genitalium. With this ac- N complishment, Venter and Smith ushered in the genomic era. HO P O P O P O N O Within 20 years the number of complete genomes published O– O– O– grew from two to thousands of sequenced genomes spanning all three domains of life. O 3ʹ Blocking Group Figure 18.5 Modified Base Used in Reversible Chain Termination Whole-Genome Shotgun Sequencing Sequencing. This example shows modified dATP, with a 3′-OH blocking Although revolutionary when it was introduced, sample prepara- group and a fluorescent tag attached to the adenine. The base only fluoresces tion is now considered one of the biggest disadvantages of whole- when the PPi is removed during incorporation into the growing strand. The genome shotgun sequencing. This is because it requires cloning 3′ blocking group is then enzymatically removed exposing the 3′-OH needed all the DNA to be sequenced as genomic fragments inserted into for continued DNA synthesis. cloning vectors (i.e., the construction of a genomic library; figure 18.6). The entire sequencing process can be broken into four stages starting with library construction, which is discussed read (the number of nucleotides determined in a single run) is in section 17.3. Here we briefly describe the subsequent stages: much shorter than it is for Sanger sequencing: After about 150 to random sequencing, fragment alignment and gap closure, and 300 bases have been incorporated, the synchronicity of nucleotide editing. incorporation deteriorates. That is, the synthesis of complemen- tary strands in a cluster of identical templates becomes “out of 1. Random sequencing. The vectors carrying the cloned sync.” Without a strong synchronous signal, the data become DNA are purified and thousands of DNA fragments are ambiguous. Nonetheless, short read length is compensated by the sequenced using Sanger sequencing, employing primers volume of data produced: At least 1.5 GB (billion bases) are read that recognize the plasmid DNA sequences adjacent to the per flow cell in about a day, with newer systems reading up to cloned, chromosomal insert. Usually all stretches of the 120 GB in about 2 days. Compare this with the output of auto- genome are sequenced between 8 and 10 times to increase mated Sanger sequencing, of about 1 million bases per day. With the accuracy of the final results. the average bacterial genome measuring about 4 million base 2. Fragment alignment and gap closure. Using computer pairs, one can begin to appreciate the enormous power of NGS. analysis, the DNA sequence information of each fragment In addition to the method described here, various new ap- is assembled into longer stretches of sequence. Two proaches are being developed. For example, newer sequencing fragments are joined together to form a larger stretch of platforms seek to streamline sample preparation, improve signal DNA if the sequences at their ends overlap and match. This detection, optimize surfaces to more rapidly attain high reagent comparison process results in a set of larger, contiguous concentrations while preventing reagents from sticking, and at- nucleotide sequences called contigs. Sometimes an tain longer read lengths. Although there is room for improve- overlapping sequence is missing, generating gaps between ment, next- generation DNA sequencing technologies have contigs. There are several strategies to obtain the missing already revolutionized genomics by drastically reducing time sequences. Ultimately, however, the contigs are aligned in and cost, and increasing accuracy. the proper order to form the complete genome sequence. The term scaffold is used to describe sequence data with gaps that persist between contigs. 18.2 Genome Sequencing 3. Editing. The sequence is then carefully proofread to resolve any ambiguities or frameshifts in the After reading this section, you should be able to: sequence. Proofreading is accomplished by ensuring a. List the steps used in whole-genome shotgun cloning that all reads of the same sequence are identical b. Compare and contrast genome assembly using Sanger and next- and the sequences of the two DNA strands are generation sequencing (NGS) approaches complementary. c. Describe the multiple strand displacement method and how this technique is used Next-Generation Genomic Sequencing In 1995 J. Craig Venter, Hamilton Smith, and their collaborators Sanger sequencing ushered in the genomic era. However, the were the first to sequence a bacterial genome. Prior to that, only advent of NGS techniques has made genomic sequencing, par- the small genomes of viruses had been sequenced. To reduce ticularly of microorganisms, much more practical in terms of wil11886_ch18_425-446.indd 430 23/10/18 10:24 am 18.2 Genome Sequencing 431 Figure 18.6 Whole-Genome Shotgun Sequencing. Multiple copies of microbial genome MICRO INQUIRY Which step (or steps) (includes plasmids) extracted from microorganism of interest in this process is (are) not used in next- generation sequencing? Which are the same? 1. Digest genome with restriction enzymes Millions of genome fragments as a result of restriction digestion 2. Ligate genomic fragments into plasmid or cosmid vectors Library of vectors, each with a different genomic insert 3. Perform Sanger sequencing on each genomic insert GCGACAT +ddA +ddG +ddC +ddT GCGAC A GCG GCGA C GCGACA T GCG A G GC Mix and electrophorese T A C A G C G 4. Construct scaffolds by aligning contigs with overlapping ends; fill gaps Scaffold Contig 1 Contig 2 Genomic fragment Sequenced region of fragment (“Read”) Region of fragment not yet sequenced, length is deduced wil11886_ch18_425-446.indd 431 23/10/18 10:24 am 432 CHAPTER 18 | Microbial Genomics time, money, and improved outcome. Like Sanger sequencing, the vast majority of microbes cannot be grown axenically (i.e., preparing genomic DNA for sequencing by synthesis involves in pure culture). The capacity to sequence the DNA from a sin- shearing the DNA into smaller pieces. The important difference gle cell extracted from its natural environment is prompting is that, rather than inserting genomic fragments into cloning new research strategies in microbial genetics, ecology, and in- vectors as is the case with Sanger sequencing, adapters are fectious disease. added to the genomic DNA fragments, which are then attached The process of single-cell genomic sequencing requires to a solid substrate (figure 18.4). This is vastly more efficient; DNA amplification, but rather than PCR, a method called multi- typically close to 100% of the genomic fragments with adapters ple displacement amplification (MDA) is used (figure 18.7). bind to the solid substrate (e.g., flow cell surface), whereas clon- Unlike PCR DNA amplification, MDA occurs at a single tempera- ing more than 80% of the genomic fragments into a genomic li- ture and uses the DNA polymerase from bacteriophage phi29 to brary requires a lot of skill and luck. synthesize new strands of DNA from the genome template. This The advent of NGS has had a significant impact on the qual- polymerase is used because it does not readily dissociate from ity of the final genomic sequence, as assessed by two specific (“fall off”) the template strand; the importance of this feature will factors: depth of coverage and breadth of coverage. Coverage be made clear shortly. In addition, phi29 polymerase rarely incor- refers to the average number of times each nucleotide is se- porates the wrong base; that is to say, it has higher fidelity than quenced in a genome (or other sequencing project). This redun- most thermostable DNA polymerases used in PCR. Another dif- dancy of coverage is also referred to as the depth of sequencing, ference between PCR and MDA is the use of a collection of prim- so “depth” and “coverage” can be used interchangeably. De- ers with random sequences, six bases in length (hexamers), to pending on the technique used, a single nucleotide might be read (sequenced) 18 times or 80 times. The latter is described as deep sequencing; each nucleotide has been sequenced a very high 5' 5' 3' 5' average number of times. Breadth of coverage refers to how much of the entire genome was sequenced. A coverage of 100% 3rd round of amplification means the genome sequence is complete, without gaps between 3' contigs. Ideally one wants deep sequencing with 100% breadth 5' 2nd round of amplification of coverage. Because NGS does not involve the construction of a genomic library, it almost always yields higher breadth of cov- 1st round of amplification erage. The depth of coverage is also quite different. When Sanger 3' Primer 3' genomic sequencing is used, any given region of the genome is 3' 5' genomic DNA typically sequenced no more than ten times. By contrast, NGS Strand displacement (from a single cell) results in the same g enomic fragment being read 30 to 100 times, (a) greatly increasing accuracy. This is important because the DNA polymerase used in sequencing reactions lacks Environmental proofreading capability, so it cannot correct mismatched samples bases. Deep NGS overcomes the limitations of the DNA polymerase reaction by re-reading the same stretch of Draft genomes genome many times. A base at a given position that is not in agreement with the m ajority of the reads is recognized as a mistake or a rare variant worthy of further study. One feature that is not terribly different between the two types of approaches is the assembly of a complete genome. Recall that NGS generates much shorter reads (about 150 bases rather than 500 to 800 bases for Sanger); nonetheless, the reads are aligned by overlapping ends Isolation of Whole genome Genome sequencing, (figure 18.6). A great deal of computing power is single cells amplification assembly required to align and assemble the thousands of short, (b) overlapping sequences. Figure 18.7 Single-Cell Genomic Sequencing. (a) Many copies of DNA extracted from a cell are generated by multiple displacement amplification. Random hexamer Single-Cell Genomic Sequencing primers (red) bind to complementary template sequences, and DNA polymerase from It is now possible to amplify the few femtograms bacteriophage phi29 is used to catalyze synthesis in the 5′ to 3′ direction (purple (10−15 gram) of DNA present in a single microbial cell arrowheads). When the end of a newly synthesized strand meets double-stranded DNA, to the several micrograms (10−6 gram) needed for one strand is displaced by the growing DNA. (b) Single-cell genomic sequencing can be sequencing. This is an important breakthrough because used to discover uncultivated bacteria and archaea from natural samples. wil11886_ch18_425-446.indd 432 23/10/18 10:24 am 18.3 Metagenomics Provides Access to Uncultured Microbes 433 initiate synthesis. The primers hydrogen bond to complementary technology across a number of biological disciplines, including sequences scattered throughout the genome. As DNA synthesis ecology, environmental microbiology, infectious disease, and proceeds from each primer, the growing 3′ end of one newly made immunology. Metagenomics samples the entire pool of nucleic strand will eventually bump into and then displace the 5′ end of acids found in any given ecosystem (e.g., soil, water, feces) and it another newly growing strand. Recall that the phi29 polymerase is most frequently used to determine the members of the micro- does not easily dissociate once bound to the DNA, so both new bial community living there. Before the development of NGS, strands will continue to grow. In this way, many new strands are DNA extracted from the environment was typically used as a rapidly synthesized. The new strands have an average length of template to amplify small subunit rRNA genes or other target about 12,000 bases (12 kb) but can be as long as 100 kb. This genes using PCR. This approach continues to be used, but many makes them suitable for DNA sequencing. NGS is then performed microbiologists now use shotgun metagenomics instead. The to avoid the requirement of genomic library construction and to term shotgun is used because it involves sequencing all the DNA ensure higher breadth and depth of coverage. extracted from an environment, rather than using PCR primers Microbiologists have used single-cell genomics to sequence to target a specific gene. In addition to taxonomic information, the genomes of hundreds of uncultivated microbes representing shotgun metagenomics also catalogs most of the genes present in multiple undefined bacterial and archaeal taxa sampled from the sample, providing important clues about microbial activity many different environments (figure 18.7b). These microbes con- (figure 18.8). Microbial taxonomy and phylogeny are stitute what the researchers call “microbial dark matter,” because largely based on molecular characterization (section 19.3) there was no prior knowledge of their existence. Although the To illustrate the impact of coupling direct DNA extraction average breadth of coverage is typically only about 40%, reflect- with NGS, consider a microbial census based only on the micro- ing a major limitation of single-cell genomics, a variety of dis- organisms that can be cultured from an environmental sample coveries have been made, including at least 20,000 new (e.g., soil). Based on the average recovery of microbes in culture, hypothetical protein families, evidence for new superphyla, and this approach will miss about 98% of the microbial species pre- the surprising presence of genes encoding sigma factors in sent. Such an approach would be like taking a census of a big city archaea. by randomly sampling a few individuals. Next best would be to base a microbial census on DNA extracted from soil and cloned into bacterial vectors that are then sequenced by Sanger chain Comprehension Check termination, as this would certainly improve representation, but 1. Why is the Sanger technique of DNA sequencing also called the by an unknown amount. Keeping with our city census analogy, chain-termination method? this would be like taking a census by sending out a questionnaire 2. Explain the difference between a dideoxynucleotide used in Sanger to each household, hoping everyone would respond, but under- sequencing and the modified bases used in reversible chain standing that only an unknown fraction of residents will. termination sequencing. Metagenomic analysis using next-generation sequencing, on the 3. Why does reversible chain termination sequencing yield short reads? other hand, is the equivalent of sending a team of census takers 4. How would one recognize a gap in the genome sequence following into the community to ensure that each person is counted: The nucleotide sequencing? team probably won’t reach every single resident, but depth and 5. Suggest a medical and an ecological application of single-cell breadth of coverage will be very high. Similarly, without the need genomic sequencing. to construct a genomic library and by sequencing deeply, metagenomic analysis coupled with NGS can detect organisms present in low numbers. Once nucleotide sequences are obtained from DNA extracted 18.3 Metagenomics Provides Access directly from the environment, partial or full genomes can be de- tected in two ways. First, they can be assembled as previously dis- to Uncultured Microbes cussed, by aligning overlapping sequences at the ends of reads (figure 18.6). However, this approach can be difficut with shotgun After reading this section, you should be able to: metagenomics, because a collection of reads 100 to 300 bp in a. Differentiate between the construction and screening of a genomic length is generated; thus the number of gapped sequences tends to library and a metagenomic library be large. Instead, one can attempt to align these short reads to pre- b. List two applications of metagenomics in any field of viously sequenced genomes in an effort to identify similar se- microbiology quences from known microorganisms. This too has the major drawback that many taxa obtained by metagenomics are uncul- DNA sequencing is a powerful tool, but its impact would be di- tured and are therefore unrepresented in existing genomic data- minished if it were applied only to the small minority of mi- bases; they too are microbial dark matter. This problem prompted crobes currently held in cultures. Fortunately this is not the case. the establishment of the Genomic Encyclopedia of Bacteria and Metagenomics—the study of microbial genomes based on DNA Archaea Project. The goal of this program is to improve reference extracted directly from the environment—has emerged as a key databases by sequencing the genomes of a wide diversity of wil11886_ch18_425-446.indd 433 23/10/18 10:24 am 434 CHAPTER 18 | Microbial Genomics (a) (b) (c) (d) Genomic DNA extraction DNA fragmentation End conversion and (physical/enzymatic) adapter addition A G A C G T T A C NGS Figure 18.8 Metagenomic Analysis. DNA has been extracted directly from environments such as (a) bacterial mats at Yellowstone National Park, (b) the human colon, (c) cabbage white butterfly larvae, and (d) tube worms from hydrothermal vents. The DNA can be used to construct a metagenomic library and NGS is used to obtain nucleotide sequence information. (a) ©Yi Xiang Yeng/iStock/Getty Images; (b) ©McGraw-Hill Education; (c) ©Nigel Cattlin/Science Source; (d) Source: OAR/National Undersea Research Program (NURP)/College of William & Mary/NOAA cultured microorganisms. Metagenomic sampling of natural envi- convert raw nucleotide data into the location and potential func- ronments has so far led to the discovery of hundreds of new bacte- tion of genes or presumed genes on sequenced genomes using a rial and archaeal phyla. complex process called genome annotation. Once genes have been identified, bioinformaticists can perform computer or in silico analysis to further examine the genome. Comprehension Check Obviously, obtaining nucleotide sequences without under- 1. NGS techniques are considered well suited for metagenomics because standing the location and function of individual genes would be a a genomic library does not have to be constructed. Apart from pointless exercise. The goal of genome annotation is to identify convenience, explain why this is important for metagenomic analysis. every potential (putative) protein-coding gene as well as each 2. Examine figure 18.8. How might metagenomics be used to isolate rRNA- and tRNA-coding gene. A protein-coding gene is usually genes encoding a potentially new peptide antibiotic? recognized as an open reading frame (ORF); to find all ORFs, both strands of DNA must be analyzed in all three reading frames (figure 18.9). A bacterial or archaeal ORF is generally defined as 18.4 Bioinformatics: What Does a sequence of at least 100 codons (300 base pairs) that is not inter- rupted by a stop codon and has terminator sequences at the 3′ end. the Sequence Mean? The 5′ end of the gene should also bear a ribosome-binding site. After reading this section, you should be able to: Only if these elements are present is an ORF considered a putative protein-coding gene. This process is performed by gene prediction a. Explain how a potential protein-coding gene is recognized within a programs designed to find genes that encode proteins or func- genome sequence tional RNA products. Ideally, computer-identified genes are then b. Compare the meaning of the terms orthologue and paralogue manually inspected by bioinformaticists to verify the computer- c. Differentiate between a conserved hypothetical protein and a generated gene assignments. This process is called genome putative protein of unknown function curation. d. Describe the construction of a physical genome map ORFs that appear to encode proteins are called coding se- quences (CDS). Bioinformaticists have developed algorithms As you might imagine, sequencing entire genomes generates a to compare the sequence of predicted CDS with those in large huge volume of information. The field of bioinformatics com- databases containing nucleotide and amino acid sequences of bines biology, mathematics, computer science, and statistics to known proteins. The base-by-base comparison of two or more wil11886_ch18_425-446.indd 434 23/10/18 10:24 am 18.4 Bioinformatics: What Does the Sequence Mean? 435 Reading direction for sequence of top DNA strand Considerable information can often be inferred from translated amino acid sequences of potential 3 N- ile leu phe arg val ile arg pro thr arg asn phe thr arg -C Reading 2 N- tyr phe ile ser ser asn ser thr leu asn ala lys leu his leu thr -C genes. Often a short pattern of amino acids, called a frames 1 N- leu phe tyr phe glu phe asp leu lys arg glu thr ser leu asn-C motif or domain, will represent a functional unit within a protein, such as the active site of an enzyme. 5′ TTATTTTATTTCGAGTAATTCGACCTTAAACGCGAAACTTCACTTAAC 3′ For instance, figure 18.10 shows the C-terminal do- DNA 3′ AATAAAATAAAGCTCATTAAGCTGGAATTTGCGCTTTGAAGTGAATTG 5′ main of the cell division protein MinD from a number of microbes. Because these amino acids are found in 3 C- lys ile glu leu leu glu val lys phe ala phe ser lys val -N Reading 2 C- ile lys asn arg thr ile arg gly val arg phe lys val arg -N such a wide range of organisms, they are considered frames 1 C- asn lys ser thr asn ser arg leu arg ser val glu ser leu ser-N phylogenetically well conserved. In this case, the conserved region is predicted to form a coil needed Reading direction for sequence of bottom DNA strand for proper localization of the protein to the membrane. Figure 18.9 Finding Potential Protein-Coding Genes. Annotation of genomic sequence Finding this high level of conservation (similarity) al- requires that both strands of DNA be translated from the 5′ to 3′ direction in each of three lows the genome curator to confidently assign a function possible reading frames. Stop codons are shown in green. to the domain. Cytokinesis (section 7.2) Genes from different organisms with such simi- lar ORFs are called orthologues. Sometimes there appear to be duplicated genes on the same genome. This is discovered when gene sequences is called alignment. Alignments can also be two or more genes have very similar nucleotide sequences. Such performed by comparing amino acid sequences between two genes are called paralogues. proteins. Scientists most often use BLAST (basic local align- As the number of sequenced genomes has expanded, so has the ment search tool) programs to perform this task. These pro- need to carefully define how new genes and proteins are named. grams compare the nucleotide (or amino acid) sequence of The use of a structured vocabulary is called ontology, and a stand- interest, called the query sequence, to all other sequences en- ard gene ontology (GO) has been adopted as the means by which tered in the database. The results (“hits”) are ranked in order of proteins, or motifs within proteins, are commonly named. This is decreasing similarity. An E-value is assigned to each align- based on the similarities of amino acid sequences among ortholo- ment; this value measures the possibility of obtaining the align- gous proteins. A GO term not only reflects protein function but also ment by chance; thus highly homologous sequences have very defines the cellular process in which the protein participates (e.g., low E-values. motility) and the cellular location of the protein (e.g., flagellum). Consensus helical region Escherichia coli P F R F I E E E K K G F L - - - K R L F G G 270 Yersinia pestis P F R F V E E E K K G F L - - - K R L F G G 270 Gram-negative Vibrio cholerae E F R F L T E A K K G I F - - - K R L F G G 276 bacteria Pseudomonas aeruginosa P H R F L D V Q K K G F L - - - Q R L F G G R E 271 Xylella fastidiosa P M R F T T V E K K G F F - - - S K L F G G 269 Listeria monocytogenes P L M S I E T K K A G F F A R L K Q L F S G K 266 Gram-positive Clostridium acetobutylicum P F E K Y E T Q - T G F I A A I K K I F S K 263 bacteria Bacillus subtilis P L Q V L E E Q N K G M M A K I K S F F G V R S 268 Hyperthermophilic Aquifex aeolicus P L K R Y G - E K K G L L - - - S R L L G G 262 bacteria Thermotoga maritima L E N D F V T V S K G L I D T L K D F F S K L K R G 271 Methanocaldococcus jannaschii E D E I K I I R K E S F I D K I K R L F R M Y 263 Archaea Archaeoglobus fulgidus P A E V K E K K K E G A L A K M L R I F R R R 263 Pyrococcus furiosus T P P E P E S P V K R I F - - - K A L F G G K R 264 Mesostigma viride Y L V N L E T G N K G L L K R V Q Q F L T G S E E N V 286 Chloroplasts Nephroselmis olivacea P S P S D S A P S R G W F A A I R R L W S 274 Putative membrane targeting sequence Figure 18.10 Analysis of Conserved Regions of Phylogenetically Well-Conserved Proteins. C-terminal amino acid residues of MinD from 15 organisms and chloroplasts are aligned to show strong similarities. Amino acid residues identical to E. coli are boxed in yellow, and conservative substitutions (e.g., one hydrophobic residue for another) are boxed in orange. Dashes indicate the absence of amino acids in those positions; such gaps may be included to maintain an alignment. The number of the last residue shown relative to the entire amino acid sequence is shown at the extreme right of each line. MICRO INQUIRY Which amino acids are most highly conserved? wil11886_ch18_425-446.indd 435 23/10/18 10:24 am 436 CHAPTER 18 | Microbial Genomics Proteins that do not align with known amino acid the functional metabolic pathways, transport mechanisms, and sequences fall into two classes: (1) Conserved hypothetical other physiological features of the microbe (figure 18.11). One of proteins are encoded by genes that have matches in the the first microbes for which such analysis was performed was the database but no function has yet been assigned to any of the causative agent of syphilis, Treponema pallidum. Because sequences. (2) Proteins of unknown function are the products T. pallidum cannot be grown in pure culture, genomics is key to of genes unique to that organism. On the one hand, as more learning about its metabolism and the ways it avoids host defenses. genomes are published, the likelihood of finding a match in The sequencing and annotation of the T. pallidum genome revealed another organism is increased. But on the other hand, as more that T. pallidum lacks genes encoding key enzymes in the TCA metagenomic data become available, so too does the pool of cycle and oxidative phosphorylation. T. pallidum also lacks many putative proteins with unknown functions (i.e., conserved biosynthetic pathways and must rely on molecules supplied by its hypothetical proteins). host. This inference is supported by the observation that about 5% Once all the genes have been annotated, a physical map of its genes code for transport proteins. Genomic analysis is a com- representing the entire genome may be drawn. A physical map is mon approach to understanding microbial physiology. It is particu- typically drawn as concentric circles depicting a bacterial or ar- larly rewarding when, like T. pallidum, the microbe cannot be chaeal chromosome with each gene described by functional cultured. Phylum Spirochaetes: bacteria with a corkscrew class (e.g., energy metabolism), which is color coded. Deviation morphology (section 21.6) from the mean % G + C is often indicated an the inner circle. Such deviations are common when DNA has been acquired by Transcriptome Analysis horizontal gene transfer. If the microbe has multiple chromo- Once the identity and function of the genes that comprise a ge- somes or plasmids, a genome map will show each chromosome nome have been analyzed, the key question remains, “Which and plasmid. Recall that the term genome includes all of the genes are expressed at any given time?” The entire collection of cell’s DNA. RNAs produced at any one time by an organism is its transcriptome, which is studied in the field of transcriptomics. Comprehension Check Prior to the genomic era, researchers could identify only a lim- ited number of genes whose expression was altered under spe- 1. What is the goal of genome annotation? Why does it require cific circumstances. There are two technologies that allow knowledge of mathematics, statistics, biology, and computer scientists to look at the expression level of a vast collection of science? genes: DNA microarrays and RNA-Seq. The first to be devel- 2. What three elements must be present in a sequence of DNA for it to oped, DNA microarrays, consists of solid supports, usually of be considered a potential open reading frame? glass or silicon, upon which DNA is attached in an organized 3. Why is conservation of nucleotide or amino acid sequence important grid fashion. Each spot of DNA, called a probe, represents a in annotating a putative gene? single gene. The probe may be a PCR product generated from genes of interest or complementary DNA (cDNA). The location and identity of each probe on the grid are carefully recorded. 18.5 Functional Genomics Links The analysis of gene expression using microarray technol- ogy, like many other molecular genetic techniques, is based on Genes to Phenotype hybridization between the single-stranded probe DNA (i.e., the After reading this section, you should be able to: genes attached to the microarray) and the target nucleotides from the microbe of interest. Rather that using mRNA, which is a. Explain how genome annotation can be used to graphically represent the metabolism, transport, motility, and other key features easily degraded, cDNA is made from all the mRNA molecules of a microbe extracted from the microbe at a particular time of interest (for b. Contrast and compare microarray analysis with RNA-Seq in the example before and after infecting a host organism). The target study of transcriptomes cDNAs are labeled with fluorescent dyes and incubated with the c. Explain how 2-D gel electrophoresis is able to separate proteins of microarray under conditions that ensure proper binding of target identical molecular weight cDNA to its complementary probe (gene). Unbound target is d. Summarize the importance of mass spectrometry in analyzing washed off and the microarray is then scanned with laser optics. protein structure Fluorescence at each spot or probe indicates that cDNA hybrid- e. Explain why DNA-protein interactions are of interest and how they ized to its corresponding gene. Recall that each cDNA is derived can be experimentally identified from an mRNA molecule, so the color and intensity of each probe indicates the relative level of gene expression. However, Functional genomics seeks to place genomic information in a bio- microarray technology is difficult to standardize and large logical context. For instance, the careful annotation of a microbe’s changes in gene expression cannot be detected. Furthermore, genome can be used to piece together metabolic pathways, trans- one can only detect changes in transcription level of those genes port systems, and potential regulatory and signal transduction represented as probes. For these and other reasons, RNA-Seq is mechanisms. A common outcome of a genome project is to infer now the transcriptomic method of choice. DNA Probe wil11886_ch18_425-446.indd 436 23/10/18 10:24 am 18.5 Functional Genomics Links Genes to Phenotype 437 Cations K+ Carnitine P-type ATPase ntpJ Cu+ Glutamate/aspartate Glutamate TpF1 Cations Spermidine/putrescine troABCD potABCD Mg2+ Neutral amino acids ENERGY PRODUCTION mgtCE D-Alanine/glycine Glucose dagA Ribose-5-P Na+ Glucose-6-P oadAB Ribulose-5-P Pentose phosphate Glyc-3-P L-Proline Alanine 6-P-gluconate pathway Glycolysis L-Serine L-Glycine L-Glutamate + ADP + Pi Thiamine PEP L-Glutamine V-type ATPase Oxaloacetate Nucleosides/ PRPPAdenine + Na+ Pyruvate L-Aspartate + α-Ketoglutarate ? nucleotides dUMP PPi Lactate L-Asparagine ATP Acetyl-CoA MUREIN SYNTHESIS dAMP, dCMP, dTMP rAMP, rCMP, rUMP + Acetyl P N-acetyl-Gin-1-P Fatty acid ? Acetate UDP-N-acetyl-D-Gin dNDPs rNDPs Phosphatidic acid 8 murein synthesis Ribose/ genes galactose Phosphatidyl glyc-P UDP-N-acetyl-Gin CELL WALL dNTPs rNTPs SYNTHESIS rbsAC ? L-Glutamate D-Glutamate Phosphatidyl glycerol L-Alanine D-Alanine Galactose mglABC D-Alanyl-D-Alanine Glycerol-3-P ATP PROTEIN SECRETION Malate/succinate/ fumarate ADP + Pi sec protein secretion dctM H+ and Glucose/galactose/ leader peptidases glycerol-P(?) V-type ATPase 22 putative lipoproteins Figure 18.11 Metabolic Pathways and Transport Systems of Treponema pallidum. This depicts T. pallidum metabolism as deduced from genome annotation. Note the limited biosynthetic capabilities and extensive array of transporters. Although glycolysis is present, the TCA cycle and respiratory electron transport are lacking. Question marks indicate where uncertainties exist or expected activities have not been found. MICRO INQUIRY Based on this genomic reconstruction, can you determine if T. pallidum has a respiratory or fermentative metabolism? The advent of NGS techniques has made it possible to di- and compared with databases that house millions of protein rectly sequence total cellular mRNA, a process known as RNA- sequences. Seq. RNA-Seq has replaced microarray technology for many RNA-Seq generates a wealth of information. Because the microbial transcriptomic applications. Quantification of mRNA number of reads is not limited, there is no upper limit to the by RNA-Seq is accomplished by measuring the number of reads level of gene induction that can be detected. For instance, (sequenced products) matching each gene. under specific growth conditions, a 9,000-fold increase in The process of RNA-Seq begins with extracting all the Saccharomyces cerevisiae (baker’s yeast), gene expression RNA from the microbe of interest and converting all the was estimated for several genes when 16 million mapped reads cellular mRNA into cDNA by incubating the mRNA in the pre (that is, sequences that could be aligned to the reference sence of the enzyme reverse transcriptase and dNTPs (fig genome) were analyzed. Also, because RNA-Seq yields ure 18.12; also see figure 17.5). The cDNA is then prepared for nucleotide sequence, information is revealed about the tran- NGS by adding adapter sequences to each cDNA fragment. scripts themselves. For instance, variations in transcribed Each cDNA, representing an individual mRNA, is then se- sequences can be detected, as can the nucleotides where tran- quenced using NGS. The resulting nucleotide sequences can be scription starts and stops. analyzed in two ways. Typically, the microorganism being Transcriptomic experiments yield a vast amount of informa- studied has a sequenced ge