Principles of Molecular Biology PDF

1 ABSTRACT Background Molecular diagnostics and its parent ﬁeld, molecular pathology, examine the origins of disease at the molecular level, primarily by studying nucleic acids. Deoxyribonucleic acid (DNA), which contains the blueprint for constructing a living organism, is the centerpiece for research and clinical analysis. Molecular pa- thology is an outgrowth of the enormous amount of successful research in the ﬁeld of molecular biology that has discovered over the last seven decades the basic biological and chemical processes of how a living cell functions. The success of molec- ular biology, as noted by the large number of Nobel prizes awarded for its discoveries, is now used for clinical diagnosis and the development and use of therapeutics. Content The following chapters are devoted to describing and the speciﬁc applications currently being used to HISTORICAL DEVELOPMENTS IN GENETICS AND MOLECULAR BIOLOGY Molecular diagnostics would not be possible without the many signiﬁcant pioneering efforts in genetics and molecular biology. Earlier observations in genetics began with the discovery of the inheritance of biological traits made by Gregor Mendel in 1866 and the observation in 1910 that genes were associated with chromosomes by Thomas Morgan. The initial ﬁndings that contributed to determining that DNA was the transmittable genetic material were per- formed by Grifﬁth in 1928 and Avery, McLeod, and McCarty in 1944.1,2 The deﬁnitive studies, published by Hershey and Chase in 1952, demonstrated that radiolabeled phosphate incorporated into the DNA of a bacteriophage was found in newly synthesized DNA containing bacteriophage instead of radiolabeled sulfur in protein, which showed that DNA and not protein was the genetic material.3 Deciphering the structure of DNA required several crucial ﬁndings. These included the observation by Erwin Chargaff that the quantity of adenine is generally equal to the quantity of thymine, and the quantity of guanine is similar to the amount of cytosine4 and the pivotal x-ray crystallography re- sults produced by Rosalind Franklin and Maurice Wilkins.5,6 Molecular biology has historically traced its beginnings to the ﬁrst description of the structure of DNA by James Watson Principles and Applications of Molecular Diagnostics. Copyright © 2018 Elsevier Inc. All rights reserved. 2 Principles and Applications of Molecular with the discovery of the basic biology of genes and their expression, many important techniques were invented. For example, the isolation of restriction enzymes18 and DNA ligase allowed for the construction of recombinant DNA,19 which could be transferred from one organism to another, leading to the cloning of DNA20 and the emergence of genetic engineering. The Southern blot method, which identiﬁed speciﬁc electrophoretically separated pieces of DNA, partici- pated in many discoveries and was one of the ﬁrst molecular diagnostics methods to be used to test for genetic diseases.21 DNA sequencing technologies were invented22,23 and further advances in these technologies led to the ﬁrst large biological science research undertaking, the Human Genome Project. Along with DNA sequencing, further technical discoveries, including the polymerase chain reaction in 198624 and micro- array technology in 1995,25 became methodologic founda- tions for molecular diagnostics. MOLECULAR BIOLOGY ESSENTIALS Whether it is a bacterium, virus, or eukaryotic cell, the genetic material located in these organisms dictates their form and function. For the most part the genetic material is DNA, which is composed of two strands of a sugar-phosphate back- bone that are bound together by hydrogen bonds between two purines and two pyrimidines attached to the sugar mole- cule, deoxyribose, in a double helix (Figs. 1.1 and 1.2). DNA in human cells is wrapped around histone proteins and pack- aged into nucleosome units, which are compacted further to form chromosomes (Fig. 1.3). There are 23 pairs of chromo- somes, two of which are the sex chromosomes, X and Y. Each chromosome is a single length of DNA with a stretch of short repeats at the ends called telomeres and additional repeats in the centromere region. In humans, there are two sets of 23 chromosomes that are a mixture of DNA from the mother’s egg and father’s sperm. Each egg and sperm is therefore a single or haploid set of 23 chromosomes and the combination of the two creates a diploid set of human DNA, allowing each PYRIMIDINES PURINES H H3C * O H N H N H N N Chain O H Thymine (uracil*) Adenine H H N H O H N H N N O H N Chain H A Cytosine Guanine FIGURE 1.1 A, Purine and pyrimidine bases formation of hydrogen bonds. (*In RNA, thymine methyl group.) B, A single-stranded DNA the 50 carbon of one sugar to the 30 carbon phosphate residue, and a base. (yIn RNA, 5' backbone One helical turn ⫽ 3.4 nm 3' Adenine Thymine Guanine Cytosine FIGURE 1.2 The DNA double helix, with sugar-phosphate backbone and pairing of the bases in the core-forming planar structures. (From Jorde LB, Carey JC, Bamshad MJ, editors: Medical genetics. 4th ed. Philadelphia: Mosby; 2010.) 4 Principles and Applications of Molecular DNA Chromatin Telomeres Centromere FIGURE 1.3 Structural organization of human core of histone proteins to form nucleosomes, Nuclear DNA in conjunction with its associated compact state forms chromosomes. The primary chromosome’s ends are the telomeres. (From Philadelphia: Mosby; 2010.) Types of Deoxyribonucleic Acid Double-stranded DNA in living cells is generally found as the right-handed B-DNA helical structure, which has speciﬁc dimensions. Each turn of the helix is 3.4 nm long and consists of 10 bases. The DNA sugar-phosphate backbone is on the A-motif, tetraplex G-quadruplex, i-motif, hairpin, cruciform, and triplex and are abundant in the human genome because a large percentage of the genome contains various repeats. Non-B DNA is associated with many biological processes, including transcriptional control. However, these structures also can create genetic instability, which can lead to various diseases such as neurologic disorders.26 Molecular Composition of Ribonucleic Acid The composition of RNA is similar to that of DNA because it contains four nucleotides linked together by a phosphodiester bond, but with several important differences. RNA consists of a ribose sugar with a hydroxyl group at the 20 carbon instead of the hydrogen atom in DNA. The bases attached to the ribose sugar are adenine, cytosine, and guanine, but not thymine because RNA uses another pyrimidineduracildas a substitute for thymine. Structure of Ribonucleic Acid One signiﬁcant difference between DNA and RNA is that RNA does not normally exist as two strands bound to one another, although a single strand can bind internally to itself creating functionally important secondary structures. Although in the past several decades the complexity and number of different RNAs has greatly expanded, the majority of cellular RNA is composed of a rather small number of RNA types. These include mRNA, rRNA, and tRNA. Ribonucleic Acids Associated With Protein Production mRNA is the most diverse group of the three major types of RNAs, but constitutes only a small percentage of the total RNA. mRNAs are transcribed from DNA that codes for pro- teins and therefore are used as the template for the translation of proteins. In the case of prokaryotes the mRNA is colinear with the protein that is translated; however, in eukaryotes the mRNA begins as a precursor RNA called premessenger or heterogeneous nuclear RNA (hnRNA) that includes untrans- lated intron and translated exon regions. After the hnRNA is spliced into mature mRNA lacking the introns. The mature mRNA contains only exons and can be further modiﬁed by the addition of a 7-methylguanosine cap at the 50 end, which protects the mRNA from degradation, and a polyadenosine (polyA) sequence at the 30 end. In eukaryotes the production and processing of the hnRNA to mRNA takes place in the nucleus, and the ﬁnal form of the mRNA is then transported to the cytoplasm to be translated. rRNA is associated with ribosomes, which are the primary structures that produce protein through the biological process of translation. rRNA, unlike mRNA, does not code for pro- teins. The ribosome is composed of two structures, the 50S and 30S subunits found in prokaryotes and the 60S and 40S subunits found in eukaryotes. The “S” stands for Svedberg units and is determined by the centrifugal sedimentation rate. The Svedberg unit measures the mass, density, and shape of an object. The ribosome is a mixture of RNA and protein. In eukaryotes there are four major rRNAs: the 18S rRNAs found in the 40S subunit and the 28S, 5.8S, and 5S rRNAs found in the 60S subunit. In prokaryotes, the 50S subunit contains the 23S and 5S rRNAs and the 30S subunit contains the 16S rRNA. Synthesis of eukaryotic rRNA occurs as a large 45S precursor RNA that is enzymatically cleaved to form all the rRNAs except 6 Principles and Applications of Molecular H3 H4 H2A H2B H1 FIGURE 1.4 Schematic illustration of a nucleosome unit. A segment of DNA is wound around a nucleosome core particle consisting of an octamer of two each of the histone proteins H2A, H2B, H3, and H4. Tails with modiﬁcations red star) are shown to protrude from H3 and H4. Adjacent nucleosomes are separated by a segment of linker linker histone, H1. nucleosomes are condensed into ﬁlaments and even more compact structures to form a chromosome (see Fig. 1.3). There are 23 pairs of chromosomes; 22 autosomal chromo- somes and 2 sex chromosomes, X and Y, with an XX pair denoting female and an XY pair denoting male. The DNA in chromosomes is continuous for each chromosome and can be as much as several hundred million base pairs in length for the largest chromosomes. From a cytogenetic viewpoint, regions of the chromo- somes can be classiﬁed by their transcriptional activity. The more condensed heterochromatin DNA is transcriptionally inactive and stains with Giemsa, a mixture of several dyes that bind to AT-rich regions of DNA. The less condensed euchromatin DNA is transcriptionally active and does not stain with Giemsa. The ends of the chromosomes, called telo- meres, contain a repeat sequence, such as TTAGGG that is found in humans and shortens with age. The centromeres, at the center of most chromosomes, are important for linking sister chromatids during mitosis and contain various satellite DNAs, such as a-satellite tandem repeats (171 bp) that are over several million base pairs (Mb) in length. Surprisingly, most of the human DNA does not code for the expression of protein. As much as 50% of human DNA consists of many types of interspersed repeat sequences, such as satellites, telomeres, microsatellites, minisatellites, short and long interspersed nuclear elements (SINES, LINES), and retrovirus elements.31 Like other eukaryotes, hu- man genes are in pieces with the protein-encoding regions, exons, alternating with the introns, which do not code for protein sequence and occupy more than a quarter of the human DNA.33 Other regions around the genes, such as the promoter regions and the 30 untranslated regions are also not translated into proteins. After all the noncoding se- quences are removed, the protein-coding DNA sequence spans only approximately 1.2 to 1.5% of human DNA. Even though most human DNA is not associated with protein-producing genes, the Encyclopedia of DNA Elements 3⬘ 5⬘ Double-stranded parent DNA Leading strand 5⬘ 3⬘ Lagging strand Direction of unwinding of helix 3⬘ 5⬘ Replication fork FIGURE 1.5 DNA replication. Double-stranded DNA is separated at the replication fork. The leading strand is synthesized whereas the lagging strand is synthesized discontinuously joined later by DNA ligase. DNA replication is part of the cell cycle and occurs during the synthesis phase. The rest of the cell cycle is the interphase, further divided into the ﬁrst growth phase (G1) and the sec- ond growth phase (G2), along with the DNA replication or synthesis (S) phase that lies between G1 and G2. The mitosis phase, which involves the splitting of one cell into two cells, occurs after the G2 phase. Mitosis is divided into six subphases: prophase, prometaphase, metaphase, anaphase, telophase, and cytokinesis. At important control points in the cell cycle the cell will commit signiﬁcant resources to proceed further. One of these control points is between the G1 and S phase, just before it begins DNA replication. The G1/S boundary control point is disrupted in many cancers. It is common for neoplasms to have mutations in the retinoblastoma gene (RB1), whose protein product regulates cell cycle progression from G1 to S. Another control point is between G2 and M, just as the cell commits to creating two cells from one. Deoxyribonucleic Acid Repair The integrity of DNA is damaged in a variety ways that culmi- nate in changes or mutations in the DNA sequence. DNA ba- ses may be damaged, removed, cross-linked or incorrectly paired with one another, and single- or double-stranded breaks may also occur.37,38 When the cell senses that its DNA has become damaged, it stops the progression of its cell cycle and initiates DNA repair processes.39 Cells repair these lesions by employing multiple DNA repair mechanisms that are speciﬁc for the type of DNA lesion and include base excision repair, nucleotide excision repair, mismatch repair, and homologous recombination repair. Mechanisms Base excision repair removes bases that are damaged by deam- ination, oxidation, and alkylation. Deamination of guanine, cytidine, and adenine converts them into structures that will incorrectly base pair, creating transition mutations, which are changes between similar nitrogenous bases such as a purine to a purine. A transversion mutation is a change from a purine to a pyrimidine or vice versa. DNA glycosylases, such as uracil-DNA-glycosylase, cleave the damaged base, and a 50 -deoxyribose phosphate lyase removes the nucleotide 8 Principles and Applications of Molecular Deoxyribonucleic Acid Modiﬁcation Enzymes There are two groups of nucleases, the endonucleases that cut through the sugar-phosphate backbone and exonucleases that digest the ends of DNA. The commercially important restric- tion endonucleases, which bacteria have acquired to protect themselves from viral infections, are used to cleave DNA at a speciﬁc nucleotide sequence or restriction sites.42 Several thousand restriction endonucleases have been characterized and are used extensively to manipulate DNA in molecular biology and molecular diagnostics. Recent work has described new nucleases, such as the RNA-guided engineered nuclease, CRISPR/Cas system, that can precisely cleave genomic DNA.43 DNA glycosylases are a family of enzymes associated with base excision repair that are used in the ﬁrst step of DNA repair to remove the damaged base, without disrupting the sugar-phosphate backbone. An important member of that family, uracil DNA glycosylase, repairs the most common mutation found in humans, the spontaneous deamination of cytosine to uracil, by removing the uracil base. Gene Structure The structure of prokaryotic genes is straightforward; almost all of the gene sequence is used to make protein; however, this is not the case with eukaryotic genes. One of the unique hall- marks of eukaryotic genes is that the protein-coding DNA is interspersed with regions that do not code for DNA, an observation made by Richard Roberts and Phillip Sharp in 1977. A mature mRNA retains only the protein-coding sequences called exons, and the sequences between the exons are noneprotein-encoding sequences called introns that are removed during mRNA maturation (Fig. 1.6).44 In addition to introns and exons, eukaryotic genes consist of regulatory regions, such as promoters and enhancers, and 30 regions that contain termination and polyadenylation sig- nals. The regulation of the expression of eukaryotic genes can occur at all levels from transcription to splicing to transla- tion to degradation; however, most gene regulation occurs at the initiation of transcription by various promoters and enhancers.45 There are two groups of regulatory elements: one is close to the transcriptional start site and is made up of the core promoter and ancillary promoters slightly further away from the start of transcription. The other group of regulatory elements can be much further away, not only upstream but also downstream from the gene. This second DNA Transcription start 5⬘ Promoter Pre-mRNA Mature mRNA Cap FIGURE 1.6 DNA transcription and messenger region and variable numbers of introns and RNA or heterogeneous nuclear RNA (hnRNA) mature messenger RNA. have several functional domains. One functional domain of the transcription factor binds to a speciﬁc promoter DNA sequence via several structures, such as the helix-turn-helix, zinc ﬁnger, and leucine zipper structures. Another domain binds to the other transcription factor of the dimer pair, and a third domain may bind to the RNA polymerase complex that carries out transcription.46 Even though promoters and the transcription factors binding to them are far away from the transcription initiation complex, the promoter DNA folds back on itself to allow for the transcription factors to interact with the RNA polymerase complex.53 Important recurring sequences are found in the core pro- moter. For example, the core promoter of an RNA polymerase II gene contains a TATAAA sequence, called a TATA box located upstream 25 to 40 nucleotides from the transcriptional start site. Only 20% to 30% of eukaryotic promoters contain TATA boxes, but they are highly regulated compared to those without TATA boxes that are mostly housekeeping genes.45,54,55 The ﬁrst step in mRNA transcription is the binding of transcription factor IID (TFIID) to the TATA box, which in turn promotes the binding of other transcription factors (TFIIA, TFIIB, TFIIE, TFIIF, and TFIIH), RNA polymerase II, and proteins attached to the upstream promoter sites. To form a functional transcription complex, the promoter region’s doubled-stranded DNA separates and the transcrip- tion complex moves away from the core promoter region.45 Once started, the RNA polymerase adds nucleotides to the 30 free hydroxyl group in a manner similar to that of DNA replication. Transcription is eventually terminated by one of several termination mechanisms. In bacteria a termination factor bound to the RNA polymerase recognizes a DNA sequence termination signal. In the case of genes transcribed by RNA polymerase II, termination is coupled with the polyadenylation step (see Fig. 1.6). Two posttranscriptional processing events are performed on the newly formed hnRNA, one at each end of the RNA. At the 50 end, the hnRNA is capped with a 7-methyl guano- sine molecule to help protect the hnRNA from degradation. At the 30 end, a polyadenosine (poly A) stretch is added by poly A polymerase after the RNA sequence AAUAAA is syn- thesized. Some transcribed mRNAs are not polyadenylated, such as histone mRNAs.56 Transcription initially produces an hnRNA that contains both exons and introns, which needs to be processed or spliced into mature mRNA for it to be properly translated into protein. RNA splicing involves cleavage and removal of intron RNA segments and splicing of exon RNA segments. The process uses consensus splice site sequences located at both the 50 (GU) and 30 (AG) ends of the intron and an inter- nal intron sequence. Splicing requires the effort of a number of proteins and small RNAs that come together to form a spli- ceosome, which directs the splicing of exons and removal of introns.57 Splicing begins with the binding of the U1 small nuclear ribonucleic protein (snRNP) to the donor splice site and the U2 snRNP to the internal intron sequence, followed by the binding of U4, U5, and U6 snRNPs, resulting in excising the intron and joining (splicing) of the ends of the two exons on either side of the excised intron (see Fig. 1.6).57 An important modiﬁcation of the splicing process, alter- native splicing, allows for the generation of different mRNAs from the same primary RNA transcript by the cutting and 10 Principles and Applications of Molecular folding of the polypeptide chain into a three-dimensional form. Quaternary structure is the structural relationship of more than one polypeptide/protein joining together, such as in immunoglobulin molecules, that contains light and heavy proteins bound together by cysteine residues. Once proteins are synthesized, they can be modiﬁed in various ways. One of the most common modiﬁcations is phosphorylation of the amino acids serine, threonine, and tyrosine, which can regulate protein activity. Other modiﬁca- tions include proteolytic cleavage, such as removal of the signal transport sequence, and acetylation of the N- terminus of most eukaryotic proteins that helps to prevent degradation. Glycosylation of secreted and membrane proteins on asparagine, serine, and threonine residues and formation of disulﬁde bonds via cysteine cross-linking are additional modiﬁcations. Taking into consideration these posttranslational modiﬁ- cations and alternatively spliced forms mentioned in an earlier section, the total number of proteins in the more than 200 human cell types is estimated to range from 250,000 to several million.61 The genetic code, which was deciphered in the early 1960s, is required to convert a nucleic acid sequence into an amino acid sequence.13 It was reasoned that if there are 20 amino acids, a code of at least 3 nucleotides was necessary to have enough combinations. A 3-nucleotide code gives 64 combi- nations, and therefore one hallmark of the genetic code is that it is redundant, meaning that there are several codes for one amino acid. That is the case for most amino acids, but not all; for example, methionine and tryptophan have only one code. The redundancy is usually in the third base of the code. All of the 64 3-nucleotide codon possibilities code for an amino acid, except 3 that serve as stop codons (UAA, UGA, and UAG) (Fig. 1.7). U C UUU Phenyl- UCU UUC alanine UCC U UUA UCA Leucine UUG UCG CUU CCU CUC CCC C Leucine CUA CCA First Letter CUG CCG AUU ACU AUC Isoleucine ACC A AUA ACA AUG Methionine ACG GUU GCU GUC GCC G Valine GUA GCA GUG GCG FIGURE 1.7 Genetic code. Translation of messenger Ribosomal polypeptide subunits chain Cap 5' Small ribosome unit FIGURE 1.8 Translation. Shown is a ribosome (codon) via a speciﬁc amino acidebound transfer RNA positions. A new amino acidebound front of the moving ribosome and then moves RNA combines with the growing polypeptide it prepares to leave the ribosome. (Modiﬁed Louis, Elsevier; 2017.) tRNA speciﬁc for the next 3-base codondfor example, lysinedbinds to the acceptor site of the ribosome and with the help of elongation factors (eg, eEF2), the amino acid in the peptidyl site is bound to the amino acid in the acceptor site by the formation of a peptide bond. A peptide bond is created between the amino group of one amino acid and the carboxyl group of the next amino acid through conden- sation releasing water. At the same time the tRNA shifts positions, with the methionine tRNA shifting to the exit site and the tRNA containing the growing chain of amino acids shifting to the peptidyl site. At the same time, the ribo- some moves forward one codon and the next tRNA speciﬁc for the next codon through its anticodon binds in the acceptor site, and the process is repeated until a termination codon is reached (Fig. 1.8). Termination factors then bind and stop the translation process.62 Protein synthesis occurs in the eukaryotic cytoplasm in the endoplasmic reticulum where multiple ribosomes called polyribosomes are involved in translating an individual mRNA. Regulation of translation is not as extensive as that for transcription. However, there is global regulation otic translation at the initiation step with phosphorylation of initiation factor 2B by four different protein kinases. This occurs when the cells are under stress, such as amino acid starvation or DNA damage.64 In addition, mRNA- speciﬁc translational regulation can occur through binding to speciﬁc sequences located in the 50 and 30 untranslated re- gions. Furthermore, there are over 1000 microRNAs in humans,65 many of which regulate transcription. The micro- RNA genes are transcribed as precursor RNA and then pro- cessed into a mature 22-nucleotide form by the processing enzymes Dicer and Drosha. The mature form of microRNAs 12 Principles and Applications of Molecular Deoxyribonucleic Acid Methylation DNA methylation is a well-known epigenetic change that is important in X chromosome inactivation, gene imprinting (eg, Prader-Willi, Angelman syndromes), and cancer. The most common methylation event is the methylation of cyto- sine to form 5-methylcytosine. DNA methylation typically occurs at cytosines directly upstream of guanines, or CpG di- nucleotides. Cytosine is both methylated and demethylated by a variety of enzymes. The initial methylation state is cata- lyzed by one type of DNA cytosine-5-methyltransferase, whereas the maintenance of the methylated state is performed by another type of DNA cytosine-5-methyltransferase and occurs during each cell division after being established in early embryonic development.75 Demethylation involves three members of the ten-eleven translocation (TET) family of dioxygenases, which catalyze the conversion of 5-methylcytosine to other modiﬁed forms, such as 5-hydroxymethylcytosine during demethylation.76 5-Hydroxymethylcytosine is found in high amounts in neural cells and is postulated to regulate gene expression.76 Gene expression is altered by methylation via several mechanisms. The most direct effect is through altering the ability of transcription factors to bind to promoters. Methyl- ation decreases the afﬁnity of transcription factors to a DNA promoter and enhances the binding of methylation-speciﬁc transcription factors (Fig. 1.9). Additionally, methylation compacts the chromatin structure, thus reducing the access of transcription factors to a promoter.77 Cancer is the most common human disease associated with aberrant DNA methylation.78 Interestingly, the overall level of 5- methylcytosine in cancer cells is 60% less than in normal cells; however, certain promoter-speciﬁc CpG islands are hypermethylated.78 Other human diseases that are associ- ated with methylation include lupus and many neurologic diseases. DNA methylation No gene expression Me Me C G C G Methylated CpG containing promoter Histone modification No gene expression No histone modification FIGURE 1.9 Epigenetics. Top, DNA methylation associated with loss of gene expression and absence of Me, gene expression is unaffected. methylation, acetylation, and phosphorylation, (Modiﬁed from Zaidi SK, Young DW, Montecino maintenance of epigenetic information. J group to the histone will repel the histone away from the negatively charged DNA and loosen up the chromatin struc- ture.80 Other modiﬁcations include poly(ADP-ribosyl)ation, ubiquitination, SUMOylation, and glycosylation.81 Histone variants have been known for decades, but many of their functions are not well established. Histone protein variants H3.3 and H2A.Z are the most well-known and are shown to function in regulation of gene expression.82 Histone variant H3.3 incorporates into chromatin independent of replication and is associated with active chromatin.83,84 ATP-dependent remodeling enzymes use the energy from the hydrolysis of ATP to change the structure of chro- matin.84,85 ATP-dependent remodeling enzymes are grouped into four families including SWItch/Sucrose NonFermentable (SWI/SNF), imitation switch (ISWI), inositol requiring 80 (INO80), and chromodomain (CHD).79,85 The remodeling enzymes have similar properties, including (1) speciﬁc interaction with nucleosomes, (2) attraction to the modiﬁed histone tail residues found in nucleosomes, (3) contain an ATPase domain, (4) ATPase regulatory function, and (5) ability to interact with transcription factors and chromatin-associated proteins.81,85 The primary enzymes is to remodel the chromatin structure. The SWI/ SNF proteins function in the sliding and ejecting somes, but do not function in chromatin assembly. The IWSI family of enzymes changes the nucleosome spacing through sliding that is necessary after DNA replication. This family interacts with unmodiﬁed histone tails and functions to regulate transcription. The CHD family functions to slide and eject nucleosomes, by which it regulates transcription. The INO80 family of proteins has an insertion in the middle of its ATPase domain and functions in promoting transcrip- tion and DNA repair. A mammalian member of this family, SWR1, can exchange histones to facilitate DNA repair.81,85-87 Noncoding Ribonucleic Acids Most of the expressed RNA in a cell is not translated into pro- tein. Only the mRNAs are translated into protein, and they represent only 1% to 5% of the total RNA depending on cell type. Much of this noncoding RNA is known and includes rRNA and tRNAs. However, over the last several decades two large groups of noncoding RNAs have been discovered, the short and long noncoding RNAs. The ENCODE project tested for the expression from DNA not associated with genes by using probes that overlapped one another regardless of the location of genes. Over 80% of the human DNA could be assigned a biochemical function, although biochemical function was liberally deﬁned.88 None- theless, it was determined that the bulk of the human genome is expressed into RNA.89 The short noncoding RNAs consist of microRNAs, small interfering RNAs and piwi interacting RNAs.90,91 MicroRNAs regulate gene expression by binding to a speciﬁc sequence of the mRNA and inhibiting its translation. Small interfering RNAs (siRNA) inhibit translation by also binding to a region of the mRNA, but do so by initiating the degrada- tion of the mRNA by the associated Argonaute protein. Piwi interacting RNAs (piRNA) function in the repression of transposons and are important in the development of gametes in many multicellular eukaryotic species. The long RNAs are arbitrarily designated to be greater than 200 nucleotides while the short RNAs are between 20 14 Principles and Applications of Molecular Aggregation Consortium (ExAC) has sequenced over 60,000 exomes to delineate common genetic variation within human exomes. The SNP database, International HapMap Project, 1000 Genomes Project, ExAC, and genome-wide association studies have helped to deﬁne genetic variability within individuals and populations to understand the basis of many genetic diseases.98 A more fundamental biology project is the encyclopedia DNA elements, or ENCODE, whose goal is a catalog of the functional elements of the genomes of humans and other species. The functional elements include the genes and all their expressed RNA forms and epigenetic modiﬁcations.51 One of the most important ﬁndings is the discovery that much of the human genome is expressed into RNA. With the introduction of the ﬁrst massively parallel DNA sequencing instrument in 2005 and subsequent instruments from 2006 onward, the current technologic era of genomics has progressed over the last decade to make signiﬁcant inroads into applying genomics to patient care.99 Along with the technologic innovation in DNA sequencing, there has been innovation in bioinformatics, which is required to manage and interpret the large amount of information generated by massively parallel DNA sequencing instruments. Although the Human Genome Project is a signiﬁcant feat, it was not the ﬁrst whole genome to be sequenced. Whole genome sequencing initially focused on infectious pathogens, because of their impact on human health and also their size. The ﬁrst free-living organism to be sequenced was Haemophi- lus inﬂuenzae in 1995.100 Subsequently, many species from a cross-section of living organisms have been sequenced. The ﬁrst individual human to have their whole genome was Craig Venter, who led one of the two groups sequenced the human genome. The second person to have their whole genome sequenced was James Watson, whose genome was the ﬁrst to be sequenced by using massively parallel DNA sequencing. An important clinical application of genomics is cancer diagnostics (see Chapters 7 and 8); however, the diversity and complexity of cancer requires a signiﬁcant amount of basic biological information to interpret molecular diagnostic testing results of patient samples. The ﬁrst whole genome sequencing of a cancer was an acute myeloid leukemia in 2008,101 and many others have subsequently been sequenced. The Cancer Genome Atlas project includes large numbers of the most common cancers to identify all their associated mutations. For example, a recent study describes mutational data for 12 of the most common cancers.102 The signiﬁcant amount of basic information now available on human cancers and the availability of new therapeutics targeting speciﬁc cancer-associated genes allow the clinical use of molecular proﬁling in cancer patients.103 With the increasing use of genetic and genomic informa- tion to characterize a patient’s disease, an interesting conver- gence of electronic medical records and genomics is emerging. The implementation of electronic medical records throughout the United States will allow for greater access to the large amount of genomic data that will be available on pa- tients, which will eventually be a source for scientiﬁc research and discovery. The Electronic Medical Records and Genomics Network is currently developing tools and conditions under which genomic research can be pursued using electronic medical records.104 12. Kornberg A. Biologic synthesis of deoxyribonucleic acid. Science 1960;131:1503e8. 13. N Principles of Molecular Biology John Greg Howe characterize and help treat patients with a variety of ailments, including hereditary genetic diseases, cancer neoplasms, and infectious diseases. In this chapter the fundamentals of molecular biology are reviewed, followed by a focus on ge- nomes and their variants in Chapter 2. In Chapters 3 and 4 techniques for isolating and analyzing nucleic acids are dis- cussed. The clinically important subdivisions of molecular diagnostics are then reviewed and include microbiology in Chapter 5, genetics in Chapter 6, solid tumors in Chapter 7, and hematopoietic malignancies in Chapter 8. Chapters 9 and 10 are devoted to the molecular diagnostic analysis of circulating tumor cells and circulating nucleic acids. Finally, pharmacogenetics and identity assessment are the focus of Chapters 11 and 12. this ﬁeld and Francis Crick in 1953.7,8 The description of the DNA structure initiated the dramatic increase in the knowledge of the biology and chemistry of our genetic machinery. The impact of the Watson and Crick discovery was so signiﬁcant that it is considered one of the most important scientiﬁc discoveries of the 20th century.9 One reason the work of Watson and Crick had such a dra- matic impact on scientiﬁc discovery was that they not only described the structure of DNA, but hypothesized about many of its properties, which took decades to conﬁrm exper- imentally.7,8,10 One of those properties was the replication of DNA, which was shown to be semiconservative by Meselson and Stahl11 in 1958. At the same time, DNA polymerase, which replicates the DNA, was discovered by Arthur Korn- berg.12 Deciphering the genetic code was vital for under- standing the information stored in DNA, and cracking the code in 1965 required many scientists, most prominently Marshall Nirenberg.13 Additional studies described the tran- scription and translation processes and uncovered several startling ﬁndings. One ﬁnding was the isolation of reverse transcriptase, an enzyme that synthesizes DNA from ribonu- cleic acid (RNA), which demonstrates that genetic informa- tion can be transferred in part in a bidirectional manner.14,15 Another ﬁnding showed that the eukaryotic gene structure was composed of alternating noneprotein- encoding introns and protein-encoding exons.16,17 Along https://doi.org/10.1016/B978-0-12-816061-9.00001-1 1 Diagnostics individual to possess two different sequences, genes, and alleles on each chromosome, one from each parent. Each child has a unique combination of alleles because of homol- ogous recombination between homologous chromosomes during meiosis in the development of gametes (egg and sperm cells). This creates genetic diversity within the human population. If a child has a random DNA sequence change or mutation, the child’s genotype is different from that inherited from either of the parents (de novo variant). If the child’s genotype leads to visible disease, the child has acquired a different phenotype from the parents. Human cells have a limited lifespan and die through a process called apoptosis. Therefore most cells replace them- selves as they progress naturally through their cell cycle. As a cell moves through phases of the cell cycle, its DNA doubles during the synthesis phase when the double-stranded DNA molecule separates. Each strand of DNA is used as a template to make a complementary strand by DNA polymerase in a pro- cess called DNA replication. Eventually during the cell cycle, two cells are created from one during the ﬁnal mitotic phase. DNA is composed of genes that code for proteins and RNA. For DNA to convert its store of vital information into func- tional RNA and protein, the DNA strands need to separate so that RNA polymerase can bind to the start region of the gene. With the help of transcription factors that bind upstream to promoters, the RNA polymerase produces single strands of RNA that are further processed to remove the introns and retain the protein-encoding exons. The mature, processed RNA molecule, the messenger RNA (mRNA), migrates to the cytoplasm, where it is used in the production of protein. To start the process of protein synthesis or translation, the mRNA is bound by various protein factors and a ribosome, which contains ribosomal RNA (rRNA) and protein. The mRNA-bound ribosome begins to produce a polypeptide chain by binding a methionine-bound transfer RNA (tRNA) to the mRNA’s initiating AUG codon or triplet code. The conversion of the nucleic acid triplet code to a polypeptide is accomplished by the tRNA, which contains a Θ O OΘ P 5′ O O 5′ N H H2C H H Base O 4′ 1′ H 3′ 2′ H N Θ N Chain O O H Deoxyribose P (ribose†) O O H2C H O H Base H H Θ O O H N H P O O N H2C H O H Base N Chain Phosphodiester linkage H H HO H B 3′ and the formation of complementary base pairs. Dashed lines indicate the is replaced by uracil, which differs from thymine only in its lack of the chain. Repeating nucleotide units are linked by phosphodiester bonds that join of the next. Each nucleotide monomer consists of a sugar moiety, a the sugar is ribose, which adds a 20 -hydroxyl to deoxyribose.) CHAPTER 1 Principles of Molecular Biology 3 3' nucleic acid triplet code (anticodon) in its RNA sequence that is speciﬁc for an amino acid bound to one end of the tRNA molecule. After synthesis, the protein migrates to its func- tional location and eventually is removed and degraded. NUCLEIC ACID STRUCTURE AND FUNCTION DNA is a rather simple molecule with a limited number of Sugar-phosphate components compared to those of proteins. DNA is composed of a deoxyribose sugar, phosphate group, and four nitrogen-containing bases. Deoxyribose is a pentose sugar containing ﬁve carbon atoms that are numbered from 10 to 50 , starting with the carbon that will be attached to the base in DNA and progressing around the ring until the last carbon that is not part of the ring structure. The bases Bases consist of the purines, adenine and guanine and the pyrimi- dines, cytosine and thymine; an additional base, uracil, replaces thymine in RNA. A basic building block is the nucle- otide, which consists of a deoxyribose sugar with an attached base at the 10 carbon and a phosphate group at the 50 carbon. The triphosphate nucleotide is the building block for making newly synthesized DNA. Newly synthesized DNA forms a polynucleotide chain that connects the individual nucleotides through the 50 and 30 carbons of each deoxyribose sugar via phosphodiester bonds. Structure of Deoxyribonucleic Acid DNA is double stranded, and the two strands bind to one another through hydrogen bonds between the bases on each strand. Hydrogen bonding is augmented by hydropho- bic attraction (stacking) between bases on adjacent rungs of the DNA ladder. Both hydrogen bonds and base stacking are not covalent, but are weak bonds that can be broken and reestablished. This important property is exploited by many of the methods that are used in molecular diagnostics. The composition of DNA is equal quantities of guanine and cytosine and equal quantities of adenine and thymine, because, in general, guanine binds to cytosine and adenine binds to thymine.4,7 There are two hydrogen bonds between adenine (A) and thymine (T) and three hydrogen bonds between cytosine (C) and guanine (G), and because of this difference in the number of hydrogen bonds, separating a guanine-cytosine (G-C) pair takes more energy than an adenine-thymine (A-T) pair (see Fig. 1.1). Each of the two DNA strands is formed by a phosphate sugar backbone that starts at the 50 phosphate and ends at a 30 hydroxyl group with the complementary bases binding to 5' one another between the two phosphate sugar backbones. Each strand is therefore a polar opposite of the other (see Fig. 1.2). When the two strands are bound to one another they progress in opposite 50 to 30 directions in an antiparallel conﬁguration. By convention, the DNA sequence is denoted in a 50 to 30 direction. As discussed later, both the replication of new DNA and the transcription of DNA to RNA progress in the 50 to 30 direction. In addition, the conversion of RNA to protein, a process called translation, proceeds from the 50 end of the RNA to the 30 end. The combination of the base pairing and the directionality of the two DNA strands allows for the deciphering of the DNA sequence of one strand of DNA when the other complementary strand sequence is known. Diagnostics DNA double helix Histone Nucleosomes Solenoid Chromatin loop contains approximately 100,000 bp of DNA Chromatid chromosomal DNA. Double-stranded DNA is wound around the octamer which are further compacted into a helical structure called a solenoid. structural proteins is known as chromatin. Chromatin in its most constriction of a chromosome is the centromere, and the Jorde LB, Carey JC, Bamshad MJ, editors. Medical genetics. 4th ed. outside of the helix, and the bases of each strand are inside bound to their complement on the other strand by hydrogen bonds. Other conformational structures of DNA occur, mostly associated with DNA sequences that are repeated. These non-B DNA forms include a left-handed Z-form, CHAPTER 1 Principles of Molecular Biology 5 the 5S RNA, which is transcribed separately. Ribosomal RNAs have secondary and tertiary structures that are well conserved with various loops, stem loops, and pseudoknots that contribute to their function. Ribosomal RNA and protein, as the components of ribosomes, function to carry out the trans- lation of proteins. The sequence of the 16S rRNA has alter- nating conserved and divergent regions that can be used to identify microorganisms. The structure of the ribosome is now known, and the rRNA is more important than ribosomal proteins in ribosome functioning. The RNA acts as a catalytic agent called a ribozyme.27,28 Another important group of RNAs are the tRNAs, which function as key molecules that act as a bridge between the nucleic acids and the proteins. They have a unique cloverleaf secondary structure, with the 30 end covalently attached to the amino acid by speciﬁc aminoacyl tRNA synthetases. In the middle of the tRNA structure is the anticodon sequence that binds to a speciﬁc homologous codon in the mRNA. Therefore the codon directs the binding of a speciﬁc tRNA linked to its corresponding amino acid. The genetic code, which consists of a 64 3-base code, speciﬁes the appropriate amino acid to be attached to the growing polypeptide chain (see Figs. 1.7 and 1.8, later in the chapter). There are several different classes of aminoacyl tRNA synthetases, but there is at least one aminoacyl tRNA synthetase for each of the 20 amino acids. There is also at least one tRNA for each amino acid; however, there can be more depending on the species.29 Besides the three major types of RNAs, other RNAs include nuclear, nucleolar, and cytoplasmic small RNAs, signaling RNAs, telomerase RNA, and micro-RNAs.30 This list appears to be growing with each passing year. Some of the ﬁrst characterized small RNAs, the nuclear and nucleolar small RNAs, are involved with the processing of precursor RNAs to mature RNAs, including splicing of hnRNA to mRNA and precursor rRNA to mature rRNAs. More recently a large number of microRNAs have been discovered that partly function in the regulation of translation. In addition, there are many other noncoding RNAs whose functions are just beginning to be understood. transcription Human Chromosome Human double-stranded DNA that is contained in the sperm or egg is a single copy or haploid amount of DNA made up of approximately 3 billion base pairs (bp). To be more precise, the Human Genome Project consensus sequence of the hu- man genome was 2.91 109 bp31 and the ﬁrst human to be sequenced, Craig Venter, had a genome size of 2.81 109 bp,32 not including remaining gaps of highly repetitive se- quences, many near centromeres and telomeres (see Chapter 2). The DNA in the cell is bound by many proteins to form chromatin (see Fig. 1.3). The proteins in chromatin consist of histones, which are bound in precise amounts per a length of DNA, and other proteins called nonhistone proteins that are bound more irregularly and in widely varying amounts. The histone proteins consist of eight proteins (two copies each of H2A, H2B, H3, and H4) that bind as a unit to 147 bp of DNA to make up a nucleosome, and the protein, H1, that binds between the nucleosomes (Fig. 1.4). The nucleosomes are the basic structure to which many other proteins interact and modify to regulate gene expression. For example, the access to DNA by transcription factors is controlled by proteins that remodel the histone proteins through phosphorylation, acetylation, and methylation. The Diagnostics (ENCODE) project has shown that much of the none protein-encoding DNA is transcribed into noncoding RNAs, most with unknown function. CENTRAL DOGMA OF MOLECULAR BIOLOGY Francis Crick originated the concept of the central dogma of biology, which describes the transfer of genetic information into functional macromolecules.34 This was generally depicted to show the movement of genetic information from DNA to RNA via transcription using RNA polymerase and further translated into protein via ribosomes and various factors. This is a simplistic version of the original concept, which took into consideration every possible transfer of information even though no evidence existed at the time. However, since the original publication a number of other postulated transfers have been described. DNA can enzymat- ically replicate itself by DNA polymerase, and RNA can be made into DNA using reverse transcriptase.35 Many of these enzymes are used in molecular diagnostics assays. (indicated by a Deoxyribonucleic Acid Replication DNA and the A general principle underlying the synthesis or replication of new DNA is that it uses one of the two DNA strands as a tem- plate to make a new homologous strand. This is termed semi- conservative replication and was ﬁrst theorized by Watson and Crick.7 DNA replication begins at an adenine and thymine (AT)-rich structure called an origin of replication. In bacteria there is generally only one origin of replication, but in eukaryotic cells there are thousands. Since DNA can be supercoiled into more structures, a topoisomerase is required to ﬁrst unwind this structure so that the DNA is accessible. A DNA helicase binds to the double-stranded DNA and separates the two strands, providing two single- stranded DNA templates. Replication progresses in a 50 to 30 direction; therefore one strand, the leading strand, is syn- thesized as one continuous strand using the 30 to 50 template and the other strand, called the lagging strand, is synthesized in small segments called Okazaki fragments from the 50 to 30 template. Because the DNA polymerase requires a primer, small RNA primers are made by a primase enzyme on the 50 to 30 template and the Okazaki fragments are synthesized starting from the primer. Okazaki fragments are ﬁnally linked by a ligase (Fig. 1.5).36 DNA polymerases of various types have been identiﬁed and they function in many different roles, the most important being the replication of new DNA and the repair of existing DNA. Using the template strand as a guide, the DNA poly- merase binds a nucleotide triphosphate to the primer at a free 30 hydroxyl group, releasing pyrophosphate. The speciﬁc nucleotide selected depends on the base on the template strand; for example, an adenine nucleotide is used if a thymine nucleotide is in the template strand. In summary, a complementary sequence is synthesized opposite the tem- plate strand. The insertion of the correct nucleotide does not always occur. Mistakes occur approximately every 100,000 nucleotides; therefore a major function of a DNA polymerase is error correction or proofreading and is accom- plished by an intrinsic 30 to 50 exonuclease activity. DNA polymerases are important in molecular diagnostics because they are used in the polymerase chain reaction (PCR) and DNA sequencing. CHAPTER 1 Principles of Molecular Biology 7 upstream of the removed base. DNA polymerase and ligase then add a new nucleotide repairing the damage. One of the inherited disorders associated with this repair process that leads to a predisposition to various neoplasms is caused by mutations in MUTYH, a DNA glycosylase gene.38,40 Nucleotide excision repair removes base modiﬁcations Daughter that change the helical structure of DNA, including bulky strands DNA distortions and covalently bound structures that may be created by ultraviolet radiation and certain cancer drugs. The damage is recognized by global and transcription- mediated repair processes. After the repair is initiated, the transcription factor, TFIIH, binds to a complex of proteins and makes an incision. The damaged DNA is unwound, and the gap is ﬁlled by DNA polymerase and ﬁnally sealed by DNA ligase. Mutations in the nucleotide excision repair genes cause xeroderma pigmentosum, which leaves affected individuals susceptible to speciﬁc tumors.38,41 continuously, Mismatch repair recognizes base incorporation errors and but is base damage. DNA polymerase has a 30 to 50 editing exonu- clease with a proofreading function that is not completely effective and allows some mismatches to occur that can lead to mutations after DNA replication. The mismatched nucleotides must be repaired on the newly synthesized strand of DNA, which in prokaryotes is recognized by its unmethy- lated state. In eukaryotes the mechanism is different, and it is proposed that proteins associated with the replication appa- ratus, speciﬁcally the proliferating cell nuclear antigen protein determines the appropriate DNA strand for repair.38 These mutations are corrected with DNA mismatch repair proteins, which identify the mismatches by their methylation patterns, excise the surrounding sequence, and then repair the excision with new sequence. Mutations in the human mismatch repair genes are associated with Lynch syndrome (hereditary non- polyposis colorectal cancer). Double-stand breaks are a very destructive form of DNA damage that destabilizes the genome, sometimes resulting in gross chromosomal changes, such as translocations that are frequently found in cancer. Double-stranded breaks are caused by several processes, including ionizing radiation and chemotherapy drugs, and are repaired by either homolo- gous recombination or nonhomologous end joining.38,41 The homologous recombination repair pathway is initiated by recognition of a double-stranded break, followed by resection using exonucleases to create a 30 single-stranded overhang. With the assistance of many proteins, RAD51 is bound to the single-stranded DNA, which invades the intact homolo- gous double-stranded DNA of the sister chromatid and uses it as a template for new double-stranded DNA repair.38 DNA repair mechanisms operate independently to repair simple lesions. However, the repair of more complex lesions involves multiple DNA processing steps regulated by the DNA damage response pathway. When single- and double- stranded DNA breaks occur, a cascade of responses is initiated that culminates in either DNA repair, stopping the cell cycle, or programmed cell death. After DNA damage has occurred, the DNA damage response pathway activates the protein ki- nases ATM (ataxia telangiectasia mutated) and ATR (ataxia telangiectasia and Rad3-related protein) to phosphorylate signaling proteins, such as p53, which eventually leads to cell cycle arrest at the G1/S boundary. This gives time for the DNA repair mechanism to repair the damaged DNA; however, if the damage is too extensive, the cell initiates apoptosis or cell death.39 Diagnostics group is made up of enhancers, silencers, insulators, and locus-speciﬁc control regions.45,46 These regulatory elements contain speciﬁc sequences that bind to transcription factors that can upregulate or downregulate the expression of a gene. There are only several thousand human transcription factors, much less than the number of human genes; therefore each gene has many regulatory elements to provide the needed complexity to function in 200 different human cell types.45 A surprising property of human genes is that there are so few compared to less complex species. Humans have approx- imately 20,000 genes, many fewer than found in rice and only slightly more than found in the roundworm, Caenorhabditis elegans.47-49 Recently, results from the ENCODE project have challenged the concept of “one gene, one protein.”50 Their studies show that the exon of one gene can be spliced into the exon of another gene.51 This result, along with alter- native splicing, demonstrates that one gene can make multi- ple proteins and is probably the reason humans have such a small number of genes. Ribonucleic Acid Transcription and Splicing RNA transcription involves synthesizing an RNA strand using DNA as a template. This requires many different proteins, the most important being the RNA polymerases, of which there are three types in eukaryotic cells. RNA polymerase I is speciﬁc for the rRNAs, 28S, 18S, and 5.8S, which are initially transcribed as a single primary transcript of 45S. RNA poly- merase II transcribes all genes that encode proteins and the small nuclear RNA (snRNA) genes. RNA polymerase III tran- scribes a variety of small RNAs, including the 5S rRNA, and tRNA. Additional proteins called transcription factors function in combination to recognize and regulate transcrip- tion of different genes.52 The synthesis of RNA proceeds in a 50 to 30 direction using DNA as a template and a speciﬁc DNA sequence acts as a tran- scription start site. Transcription progresses through three phases: initiation, elongation, and termination. The initiation phase includes the binding of transcription factors to pro- moters upstream from the start site and includes the core pro- moter immediately upstream and the ancillary promoters further away. However, some of the small RNA gene promoters are in the middle of the gene. Transcription factors binding to upstream promoters act as regulators of the transcription of genes. These factors generally bind in pairs or dimers and Exons 3⬘ Transcription Introns mRNA processing AAAAA RNA processing. A gene that encodes for a protein contains a promoter exons. Transcription commences at the transcription start site. Premessenger is processed by capping, polyadenylation, and intron splicing and becomes a CHAPTER 1 Principles of Molecular Biology 9 joining of the RNA strand at different locations. Among the types of alternative splicing are exon skipping, alternative 30 and 50 splice sites, and intron retention. It is estimated that 92% to 95% of all human genes are alternatively spliced.58,59 The movement of cellular signals from the surface of a cell to the nucleus is called signal transduction, and one of the eventual targets is the modiﬁcation (eg, phosphorylation) of transcription factors, which can modulate the binding of other transcription factors to DNA and their dimerization, thereby controlling gene expression.60 A common cascade of signaling begins with the activation of a receptor on the cell surface, such as a tyrosine kinase receptor. The tyrosine kinase receptor in the form of a dimer can be activated by binding to a hormone or growth factor, for example, which causes a dimerization and autophosphorylation of the tyro- sine receptor protein kinase. This in turn activates a cyto- plasmic protein, such as the guanine nucleotide exchange factor that activates the G-protein Ras, which can then modify another G-protein, Raf, which propagates the signal to a common signaling pathway, the mitogen-activated protein (MAP) kinases. The ﬁnal enzyme in the pathway can then act on downstream targets, including other protein kinases, and transcriptional factors. Some mutations in the tyrosine kinase receptor or Ras protein switches them to an unregulated “on” position, which can lead to uncontrolled growth of the cell and eventually to cancer.60 Translation The ﬁnal phase of the transfer of information from DNA is to proteins, the structural and functional molecules that make up the majority of a living organism, such as the human body. Proteins are long single strands of various amino acids and are synthesized by a process called translation, which requires the functioning of many protein factors, tRNAs, and ribosomes. Amino acids have a common structure consisting of a car- bon atom bound to amino and carboxylic acid groups and a unique side chain. There are 20 amino acids each with a different side chain that give them their unique properties. The side chains can be divided into four types: nonpolar (hydrophobic), polar (hydrophilic uncharged), and negative and positively charged. Nonpolar (hydrophobic) amino acids include alanine, leucine, isoleucine, valine, proline, methio- nine, phenylalanine, and tryptophan. The uncharged polar (hydrophilic) amino acids include glycine, serine, threonine, cysteine, tyrosine, glutamine, and asparagine. The negatively charged (acidic) amino acids are aspartic acid and glutamic acid, and the positively charged (basic) amino acids are argi- nine, histidine, and lysine. A protein’s amino acid makeup and sequence in the polypeptide chain determine the overall structure and function of the protein. Some amino acids have a more signiﬁcant presence than others. For example, proline, which disrupts secondary structure, and cysteine, which can cross-link to another cysteine through disulﬁde bonds, can change the structure of a protein. Protein structures are grouped into four different classes. The primary structure is the sequence of the amino acids in the protein. There are several common types of secondary structure, such as b-pleated sheets and a helixes. Proteins can be constructed with a combination of these different types of secondary structures. Tertiary structure applies to the Diagnostics Protein synthesis or translation occurs in the cytoplasm and proceeds in three steps: initiation, elongation, and termi- nation. The process requires tRNA and rRNA molecules, as well as ribosomes and initiation, elongation, and termination factors. One of the most important groups of molecules are the tRNAs, which are recognized by aminoacyl tRNA synthe- tase enzymes that attach amino acids to the 30 end of speciﬁc tRNA molecules. Each tRNA has a 3-base sequence (anti- codon) that facilitates the speciﬁc recognition and interaction with a codon in the mRNA. The initiation step of protein synthesis is the most complex and begins with the binding of initiation factor 4E to the cap structure on the 50 end of the mRNA and binding of poly- adenosineebinding protein (PABP) to the 30 PABP polyade- nosine tail. The binding of initiation factor 4G to both initiation factor 4E and PABP circularizes the mRNA and pre- pares it for binding to the preinitiation complex containing the 40S ribosomal subunit, initiation factor 2, and methionine tRNA. The preinitiation complex then scans the mRNA until it ﬁnds a methionine start codon (AUG), at which point the 60S ribosomal subunit binds forming the 80S initiation com- plex and initiates translation elongation.62 This is a simplistic description of the initiation process because over a dozen additional initiation and auxiliary factors are involved. Ribosomes have at least three structural positions where tRNAs can bind, the acceptor (A), peptidyl (P), and exit (E) sites. The acceptor site binds the incoming aminoacyl- tRNA. The peptidyl site holds the peptidyl-tRNA that is cova- lently linked to the growing polypeptide chain, and the exit site binds to the outgoing empty tRNA that carries no amino acid.62,63 The ﬁrst codon (AUG) always codes for methionine; therefore to initiate translation the methionine tRNA binds to the aminoacyl-tRNA binding site of the ribosome. The Second Letter A G UAU UGU U Tyrosine Cysteine UAC UGC C Serine UAA Stop Codon UGA Stop Codon A UAG G Stop Codon UGG Tryptophan CAU CGU U Histidine CAC CGC C Proline Arginine CGA Third Letter A CAA Glutamine CAG CGG G AAU AGU U Asparagine Serine AAC AGC C Threonine A AAA Lysine AGA Arginine AAG AGG G GAU Asparatic GGU U GAC Acid GGC C Alanine Glycine GAA Glutamic GGA A GAG Acid GGG G RNA to amino acids during protein synthesis. CHAPTER 1 Principles of Molecular Biology 11 3' mRNA Poly -A Growing Activated amino acid Peptide bonds P Anticodon (mRNA bonding site) E A Codon Large ribosome unit Ribosome Direction of protein synthesis bound to a messenger RNA converting the messenger RNA triplet code transfer RNA containing a complementary anticodon sequence. There are three transfer RNA ﬁrst arrives on the ribosome at the A or acceptor site at the to the P or peptidyl site where the amino acid on the newly arrived transfer chain. Finally the now empty transfer RNA moves to the E, or exit site, where from Huether SE, McCance KL. Understanding pathophysiology. 6th ed. St. can bind to speciﬁc sites on mRNA while associated with the Argonaute protein and either reversibly inhibit translation or degrade the mRNA.62,66 For example, microRNAs Mir 15a/ 16-1 are deleted in chronic lymphocytic leukemia, thereby increasing Bcl2 expression and inhibiting apoptosis or cell death to prolong the life span of the cell.67 After proteins are synthesized there are two major processes to remove excess or damaged proteins. One process degrades the proteins ingested and uses nonspeciﬁc proteases, such as pepsin and trypsin, to digest proteins associated with foodstuff in the gut into amino acids so they can be absorbed. The second process digests extracellular and intracellular proteins by either general proteinases within lysosomes or by protein degradation via ubiquination. With the latter mechanism, proteins are tagged for degradation by binding to ubiquitin, which is recognized by a large multiprotein structure, the proteasome that degrades the ubiquinated proteins by proteolysis.68 EPIGENETICS of eukary- Although the original meaning of epigenetics encompassed all molecular pathways that affect the expression of genes, over time the deﬁnition has focused on the regulation of gene expression by heritable modiﬁcations that do not change the DNA sequence.69 More recently this has been broadened to include nonheritable modiﬁcations.70-73 Currently there are three major areas of epigenetic modiﬁcations or marks: (1) DNA methylation; (2) chromatin conformation regulation through histone modiﬁcations, including ATP- dependent remodeling enzymes and histone variants; and (3) noncoding RNAs.74 Diagnostics Chromatin Conformation Regulation Many basic cellular functions require proteins to interact with DNA. However, DNA is generally not freely accessible but is wound around histones to form nucleosomes and further condensed or compacted into heterochromatin that decreases gene expression. The cell requires the DNA to be accessible to carry out DNA replication, repair, and transcription.74,79 The chromatin, therefore, is a very dy- namic structure; at any one point in time portions of the DNA are being exposed and other portions are being covered. The mechanisms that control chromatin confor- mation include histone modiﬁcations, histone variants, and ATP-dependent remodeling enzymes. Speciﬁc histones are reversibly and posttranslationally modiﬁed at their N-terminal tails and globular regions to change the chromatin from a euchromatin state to a hetero- chromatin state and back (see Fig. 1.9). These modiﬁcations include acetylation of lysine residues at the N-terminal tails of H2A, H3, and H4 by histone acetyltransferases (HATs) and deacetylation by histone deacetylases (HDACs). Histone acet- ylation removes the positive charge on the lysine residue, leaving the lysine less attracted to the negatively charged DNA phosphate backbone and thereby opening the DNA.77 Histone methylation of lysine and arginine residues occurs mostly on histone protein H3, but also histone protein H4, and is carried out by histone methyltransferases (HMTs) and histone demethylases (HDMs). The effect of methylation on chromatin structure ranges from active to poised to repressed. Histone lysine and arginine residues can be mono-, di-, and tri-methylated, but the positive charge is un- changed.39,79 Histone methylation is found associated with DNA transcription, replication, and repair. Histones are phosphorylated at serine, threonine, and tyrosine residues and are associated with DNA repair and transcription. The addition of a negatively charged phosphate Gene expression C G C G Unmethylated CpG containing promoter P Gene expression Me Ac Histone modifications of CpG island regions indicated by Me in and around gene promoters is silencing of the gene. When CpG islands are unmethylated, shown by Bottom, Modiﬁcations of the tails of histone proteins, such as shown as Me, Ac, and P, respectively, can increase gene expression. M, van Wijnen AJ, Stein JL, Lian JB, et al. Bookmarking the genome: Biol Chem 2011;286:18355e18361.) CHAPTER 1 Principles of Molecular Biology 13 and 200 nucleotides.92 Only recently has the extent of long noncoding RNAs been appreciated.89 The diversity of the long noncoding RNAs is predicted to be in the hundreds of thousands in vertebrates and their expression pattern is highly regulated during the development of an organism. A well-described example of a long noncoding RNA is XIST, which associates with the Polycomb group complex 2 and inactivates the X chromosome by inducing heterochromatin formation and repressing gene expression.93 Examples such as XIST and a similarly acting protein, HotAir, have given rise to the possibility that the noncoding regions of the human genome have important functions.92 The function of most noncoding RNAs is unknown, but it is speculated that coding and noncoding RNAs, referred to as competing endogenous RNAs (ceRNAs), are in competition for shared microRNA binding sites in untranslated regions of mRNAs, thereby regulating their expression. The ceRNA hypothesis proposes a new layer of regulation of gene expres- sion that could help explain the function of the large percent- age of the human genome that expresses noneprotein-coding RNA.94-96 role of the UNDERSTANDING OUR GENOME of nucleo- Genomics is recognized as a unique ﬁeld since the ﬁrst free- living organisms were completely sequenced in the 1990s. With the publication of the ﬁrst draft of the human genome in 2001 and the ﬁnal results of the Human Genome Project in 2004, the genomics ﬁeld started to impart greater inﬂuence on biomedical research and its application to medicine.31,97 Genomics is characterized by the comprehensive nature of its collection of data and the technical development necessary to obtain, analyze, store, and make available such large amounts of data. There are also ethical, legal, and social impli- cations of the research and clinical application of genomics.98 Large research projects that were initiated during the latter years of the Human Genome Project produced comprehen- sive biological catalogs of genetic variants, important DNA functional sequences, and expressed products from not only humans but also many other organisms.98 Single nucleotide variants (SNVs) are the most common DNA differences found in the human population, and they number in the millions, with each individual differing on average by 1 in 1000 nucleotides. Human SNVs (including both benign polymorphisms and causative mutations) are cataloged in the SNP database (http://www.ncbi.nlm.nih. gov/SNP). Genome-wide association studies employ microarray tests that use large numbers of SNVs to ﬁnd associations between genetic variations and diseases. DNA variants are often clus- tered into regions by genetic recombination during the forma- tion of sperm and eggs that are inherited as a unit, such that a unique SNV pattern or haplotype can be passed from gener- ation to generation. The International HapMap Project also uses SNVs to investigate haplotype associations and disease. The 1000 Genomes Project complements the previously mentioned projects by sequencing a large number of diverse human samples from around the world. The goal is to build a comprehensive catalog of the most common human genetic variants, which includes single nucleotide variants, as well as insertions, deletions, and copy number variants that are found in the population at greater than 1%. The Exome Diagnostics All of the previously discussed advances have made the ﬁeld of molecular diagnostics an important and exciting area that is going to have an even greater impact on medicine in the future. As an increasing number of diseases are charac- terized at the molecular (eg, nucleic acid and protein) level, new therapeutics and diagnostics speciﬁcally targeting these molecular changes will continue to emerge. of POINTS TO REMEMBER The two strands of DNA are bound together by hydrogen bonds and stacking forces that can be broken and reformed without permanent damage to the DNA. This important property is exploited by many of the methods that are used in molecular diagnostics. This is a requirement for most of the DNA diagnostic assays. Even though human DNA has approximately 20,000 genes, this is far less than what would be expected given the number of proteins in a human cell. The higher number of proteins results from alternative splicing, which occurs in more than 95% of human genes. Only 1.2% to 1.5% of the human genome is translated into protein; however, much more of the genome is made into RNA. The conversion of DNA information into protein is facilitated by aminoacyl tRNA synthetases and their ability to create amino acidespeciﬁc tRNAs. The genetic code is redundant; the 3-base code can have 64 different combinations, but only 20 amino acids are recognized. sequenced that ﬁrst REFERENCES 1. Grifﬁth F. The signiﬁcance of pneumococcal types. J Hyg 1928; 27:113e59. 2. Avery OT, MacLeod CM, McCarty M. Studies on the chemical nature of the substance inducing transformation of pneumo- coccus types: induction of transformation by a desoxy- ribonucleic acid fraction isolated from pneumococcus type III. J Exp Med 1944;79:137e57. 3. Hershey AD, Chase M. Independent functions of viral protein and nucleic acid in growth of bacteriophage. J Gen Physiol 1952; 36:39e56. 4. Chargaff E, Zamenhof S, Green C. Composition of human desoxypentose nucleic acid. Nature 1950;165:756e7. 5. Franklin RE, Gosling RG. Molecular structure of nucleic acids: molecular conﬁguration in sodium thymonucleate, 1953. Ann N Y Acad Sci 1995;758:16e7. 6. Wilkins MH, Stokes AR, Wilson HR. Molecular structure of deoxypentose nucleic acids. Nature 1953;171:738e40. 7. Watson JD, Crick FH. Genetical implications of the structure of deoxyribonucleic acid. Nature 1953;171:964e7. 8. Watson JD, Crick FH. Molecular structure of nucleic acids: a structure for deoxyribose nucleic acid. Nature 1953;171:737e8. 9. Lightman A. The discoveries: great breakthroughs in 20th-century science, including the original papers. New York: Vintage; 2006. 10. Watson JD, Crick FH. The structure of DNA. Cold Spring Harb Symp Quant Biol 1953;18:123e31. 11. Meselson M, Stahl FW. The replication of DNA. Cold Spring Harb Symp Quant Biol 1958;23:9e12. CHAPTER 1 Principles of Molecular Biology 15 34. Crick FH. On protein synthesis. Symp Soc Exp Biol 1958;12: 138e63.

Principles of Molecular Biology PDF

Document Details

Tags

Related

Summary

Full Transcript