Principles of Molecular Biology PDF
Document Details
Uploaded by BestKnownTulip3578
Zarqa University
2018
John Greg Howe
Tags
Summary
This textbook provides a comprehensive introduction to the principles of molecular biology, tracing its historical development and highlighting key concepts. It explores the structure of DNA and its role in inheritance and gene expression. The book also discusses various applications of molecular biology in clinical diagnostics.
Full Transcript
1 Principles of Molecular Biology John Greg Howe ABSTRACT Backg...
1 Principles of Molecular Biology John Greg Howe ABSTRACT Background characterize and help treat patients with a variety of ailments, Molecular diagnostics and its parent field, molecular pathology, including hereditary genetic diseases, cancer neoplasms, and examine the origins of disease at the molecular level, primarily infectious diseases. In this chapter the fundamentals of by studying nucleic acids. Deoxyribonucleic acid (DNA), which molecular biology are reviewed, followed by a focus on ge- contains the blueprint for constructing a living organism, is the nomes and their variants in Chapter 2. In Chapters 3 and 4 centerpiece for research and clinical analysis. Molecular pa- techniques for isolating and analyzing nucleic acids are dis- thology is an outgrowth of the enormous amount of successful cussed. The clinically important subdivisions of molecular research in the field of molecular biology that has discovered diagnostics are then reviewed and include microbiology in over the last seven decades the basic biological and chemical Chapter 5, genetics in Chapter 6, solid tumors in Chapter 7, processes of how a living cell functions. The success of molec- and hematopoietic malignancies in Chapter 8. Chapters 9 and ular biology, as noted by the large number of Nobel prizes 10 are devoted to the molecular diagnostic analysis of awarded for its discoveries, is now used for clinical diagnosis circulating tumor cells and circulating nucleic acids. Finally, and the development and use of therapeutics. pharmacogenetics and identity assessment are the focus of Chapters 11 and 12. Content The following chapters are devoted to describing this field and the specific applications currently being used to HISTORICAL DEVELOPMENTS IN GENETICS and Francis Crick in 1953.7,8 The description of the DNA structure initiated the dramatic increase in the knowledge AND MOLECULAR BIOLOGY of the biology and chemistry of our genetic machinery. The Molecular diagnostics would not be possible without the impact of the Watson and Crick discovery was so significant many significant pioneering efforts in genetics and molecular that it is considered one of the most important scientific biology. Earlier observations in genetics began with the discoveries of the 20th century.9 discovery of the inheritance of biological traits made by One reason the work of Watson and Crick had such a dra- Gregor Mendel in 1866 and the observation in 1910 that matic impact on scientific discovery was that they not only genes were associated with chromosomes by Thomas described the structure of DNA, but hypothesized about Morgan. The initial findings that contributed to determining many of its properties, which took decades to confirm exper- that DNA was the transmittable genetic material were per- imentally.7,8,10 One of those properties was the replication of formed by Griffith in 1928 and Avery, McLeod, and McCarty DNA, which was shown to be semiconservative by Meselson in 1944.1,2 The definitive studies, published by Hershey and and Stahl11 in 1958. At the same time, DNA polymerase, Chase in 1952, demonstrated that radiolabeled phosphate which replicates the DNA, was discovered by Arthur Korn- incorporated into the DNA of a bacteriophage was found in berg.12 Deciphering the genetic code was vital for under- newly synthesized DNA containing bacteriophage instead of standing the information stored in DNA, and cracking the radiolabeled sulfur in protein, which showed that DNA and code in 1965 required many scientists, most prominently not protein was the genetic material.3 Marshall Nirenberg.13 Additional studies described the tran- Deciphering the structure of DNA required several crucial scription and translation processes and uncovered several findings. These included the observation by Erwin Chargaff startling findings. One finding was the isolation of reverse that the quantity of adenine is generally equal to the quantity transcriptase, an enzyme that synthesizes DNA from ribonu- of thymine, and the quantity of guanine is similar to the cleic acid (RNA), which demonstrates that genetic informa- amount of cytosine4 and the pivotal x-ray crystallography re- tion can be transferred in part in a bidirectional sults produced by Rosalind Franklin and Maurice Wilkins.5,6 manner.14,15 Another finding showed that the eukaryotic Molecular biology has historically traced its beginnings to gene structure was composed of alternating noneprotein- the first description of the structure of DNA by James Watson encoding introns and protein-encoding exons.16,17 Along Principles and Applications of Molecular Diagnostics. https://doi.org/10.1016/B978-0-12-816061-9.00001-1 Copyright © 2018 Elsevier Inc. All rights reserved. 1 2 Principles and Applications of Molecular Diagnostics with the discovery of the basic biology of genes and their individual to possess two different sequences, genes, and expression, many important techniques were invented. For alleles on each chromosome, one from each parent. Each example, the isolation of restriction enzymes18 and DNA child has a unique combination of alleles because of homol- ligase allowed for the construction of recombinant DNA,19 ogous recombination between homologous chromosomes which could be transferred from one organism to another, during meiosis in the development of gametes (egg and leading to the cloning of DNA20 and the emergence of genetic sperm cells). This creates genetic diversity within the human engineering. The Southern blot method, which identified population. If a child has a random DNA sequence change or specific electrophoretically separated pieces of DNA, partici- mutation, the child’s genotype is different from that inherited pated in many discoveries and was one of the first molecular from either of the parents (de novo variant). If the child’s diagnostics methods to be used to test for genetic diseases.21 genotype leads to visible disease, the child has acquired a DNA sequencing technologies were invented22,23 and further different phenotype from the parents. advances in these technologies led to the first large biological Human cells have a limited lifespan and die through a science research undertaking, the Human Genome Project. process called apoptosis. Therefore most cells replace them- Along with DNA sequencing, further technical discoveries, selves as they progress naturally through their cell cycle. As a including the polymerase chain reaction in 198624 and micro- cell moves through phases of the cell cycle, its DNA doubles array technology in 1995,25 became methodologic founda- during the synthesis phase when the double-stranded DNA tions for molecular diagnostics. molecule separates. Each strand of DNA is used as a template to make a complementary strand by DNA polymerase in a pro- cess called DNA replication. Eventually during the cell cycle, MOLECULAR BIOLOGY ESSENTIALS two cells are created from one during the final mitotic phase. Whether it is a bacterium, virus, or eukaryotic cell, the genetic DNA is composed of genes that code for proteins and RNA. material located in these organisms dictates their form and For DNA to convert its store of vital information into func- function. For the most part the genetic material is DNA, tional RNA and protein, the DNA strands need to separate which is composed of two strands of a sugar-phosphate back- so that RNA polymerase can bind to the start region of the bone that are bound together by hydrogen bonds between gene. With the help of transcription factors that bind upstream two purines and two pyrimidines attached to the sugar mole- to promoters, the RNA polymerase produces single strands of cule, deoxyribose, in a double helix (Figs. 1.1 and 1.2). DNA RNA that are further processed to remove the introns and in human cells is wrapped around histone proteins and pack- retain the protein-encoding exons. The mature, processed aged into nucleosome units, which are compacted further to RNA molecule, the messenger RNA (mRNA), migrates to form chromosomes (Fig. 1.3). There are 23 pairs of chromo- the cytoplasm, where it is used in the production of protein. somes, two of which are the sex chromosomes, X and Y. Each To start the process of protein synthesis or translation, the chromosome is a single length of DNA with a stretch of short mRNA is bound by various protein factors and a ribosome, repeats at the ends called telomeres and additional repeats in which contains ribosomal RNA (rRNA) and protein. The the centromere region. In humans, there are two sets of 23 mRNA-bound ribosome begins to produce a polypeptide chromosomes that are a mixture of DNA from the mother’s chain by binding a methionine-bound transfer RNA egg and father’s sperm. Each egg and sperm is therefore a (tRNA) to the mRNA’s initiating AUG codon or triplet single or haploid set of 23 chromosomes and the combination code. The conversion of the nucleic acid triplet code to a of the two creates a diploid set of human DNA, allowing each polypeptide is accomplished by the tRNA, which contains a Θ O OΘ PYRIMIDINES PURINES P 5′ H O O 5′ H3C * O H N N H H2C H H Base O 4′ 1′ H 3′ 2′ H H N H N N Θ N N Chain O O H Deoxyribose P Chain O H (ribose†) O O Thymine (uracil*) Adenine H2C H O H Base H H H Θ O O H H N H O N H P O O H N H N N H2C H O H Base N N Chain Phosphodiester O H N linkage H H Chain H HO H A Cytosine Guanine B 3′ FIGURE 1.1 A, Purine and pyrimidine bases and the formation of complementary base pairs. Dashed lines indicate the formation of hydrogen bonds. (*In RNA, thymine is replaced by uracil, which differs from thymine only in its lack of the methyl group.) B, A single-stranded DNA chain. Repeating nucleotide units are linked by phosphodiester bonds that join the 50 carbon of one sugar to the 30 carbon of the next. Each nucleotide monomer consists of a sugar moiety, a phosphate residue, and a base. (yIn RNA, the sugar is ribose, which adds a 20 -hydroxyl to deoxyribose.) CHAPTER 1 Principles of Molecular Biology 3 5' 3' nucleic acid triplet code (anticodon) in its RNA sequence that is specific for an amino acid bound to one end of the tRNA molecule. After synthesis, the protein migrates to its func- tional location and eventually is removed and degraded. NUCLEIC ACID STRUCTURE AND FUNCTION DNA is a rather simple molecule with a limited number of Sugar-phosphate components compared to those of proteins. DNA is backbone composed of a deoxyribose sugar, phosphate group, and four nitrogen-containing bases. Deoxyribose is a pentose One helical turn ⫽ 3.4 nm sugar containing five carbon atoms that are numbered from 10 to 50 , starting with the carbon that will be attached to the base in DNA and progressing around the ring until the last carbon that is not part of the ring structure. The bases Bases consist of the purines, adenine and guanine and the pyrimi- dines, cytosine and thymine; an additional base, uracil, replaces thymine in RNA. A basic building block is the nucle- otide, which consists of a deoxyribose sugar with an attached base at the 10 carbon and a phosphate group at the 50 carbon. The triphosphate nucleotide is the building block for making newly synthesized DNA. Newly synthesized DNA forms a polynucleotide chain that connects the individual nucleotides through the 50 and 30 carbons of each deoxyribose sugar via phosphodiester bonds. Structure of Deoxyribonucleic Acid DNA is double stranded, and the two strands bind to one another through hydrogen bonds between the bases on each strand. Hydrogen bonding is augmented by hydropho- bic attraction (stacking) between bases on adjacent rungs of the DNA ladder. Both hydrogen bonds and base stacking are not covalent, but are weak bonds that can be broken and reestablished. This important property is exploited by many of the methods that are used in molecular diagnostics. The composition of DNA is equal quantities of guanine and cytosine and equal quantities of adenine and thymine, because, in general, guanine binds to cytosine and adenine binds to thymine.4,7 There are two hydrogen bonds between adenine (A) and thymine (T) and three hydrogen bonds between cytosine (C) and guanine (G), and because of this difference in the number of hydrogen bonds, separating a guanine-cytosine (G-C) pair takes more energy than an adenine-thymine (A-T) pair (see Fig. 1.1). Each of the two DNA strands is formed by a phosphate sugar backbone that starts at the 50 phosphate and ends at a 30 hydroxyl group with the complementary bases binding to 3' 5' one another between the two phosphate sugar backbones. Each strand is therefore a polar opposite of the other (see Fig. 1.2). When the two strands are bound to one another Adenine they progress in opposite 50 to 30 directions in an antiparallel Thymine configuration. By convention, the DNA sequence is denoted in a 50 to 30 direction. As discussed later, both the replication Guanine of new DNA and the transcription of DNA to RNA progress Cytosine in the 50 to 30 direction. In addition, the conversion of RNA to protein, a process called translation, proceeds from the 50 end FIGURE 1.2 The DNA double helix, with sugar-phosphate of the RNA to the 30 end. The combination of the base pairing backbone and pairing of the bases in the core-forming planar and the directionality of the two DNA strands allows for the structures. (From Jorde LB, Carey JC, Bamshad MJ, editors: deciphering of the DNA sequence of one strand of DNA when Medical genetics. 4th ed. Philadelphia: Mosby; 2010.) the other complementary strand sequence is known. 4 Principles and Applications of Molecular Diagnostics DNA double helix Histone DNA Nucleosomes Solenoid Chromatin Chromatin Telomeres loop contains approximately 100,000 bp of DNA Centromere Chromatid FIGURE 1.3 Structural organization of human chromosomal DNA. Double-stranded DNA is wound around the octamer core of histone proteins to form nucleosomes, which are further compacted into a helical structure called a solenoid. Nuclear DNA in conjunction with its associated structural proteins is known as chromatin. Chromatin in its most compact state forms chromosomes. The primary constriction of a chromosome is the centromere, and the chromosome’s ends are the telomeres. (From Jorde LB, Carey JC, Bamshad MJ, editors. Medical genetics. 4th ed. Philadelphia: Mosby; 2010.) Types of Deoxyribonucleic Acid outside of the helix, and the bases of each strand are inside Double-stranded DNA in living cells is generally found as the bound to their complement on the other strand by hydrogen right-handed B-DNA helical structure, which has specific bonds. Other conformational structures of DNA occur, dimensions. Each turn of the helix is 3.4 nm long and consists mostly associated with DNA sequences that are repeated. of 10 bases. The DNA sugar-phosphate backbone is on the These non-B DNA forms include a left-handed Z-form, CHAPTER 1 Principles of Molecular Biology 5 A-motif, tetraplex G-quadruplex, i-motif, hairpin, cruciform, the 5S RNA, which is transcribed separately. Ribosomal RNAs and triplex and are abundant in the human genome because a have secondary and tertiary structures that are well conserved large percentage of the genome contains various repeats. with various loops, stem loops, and pseudoknots that Non-B DNA is associated with many biological processes, contribute to their function. Ribosomal RNA and protein, as including transcriptional control. However, these structures the components of ribosomes, function to carry out the trans- also can create genetic instability, which can lead to various lation of proteins. The sequence of the 16S rRNA has alter- diseases such as neurologic disorders.26 nating conserved and divergent regions that can be used to identify microorganisms. The structure of the ribosome is Molecular Composition of Ribonucleic Acid now known, and the rRNA is more important than ribosomal The composition of RNA is similar to that of DNA because it proteins in ribosome functioning. The RNA acts as a catalytic contains four nucleotides linked together by a phosphodiester agent called a ribozyme.27,28 bond, but with several important differences. RNA consists of Another important group of RNAs are the tRNAs, which a ribose sugar with a hydroxyl group at the 20 carbon instead function as key molecules that act as a bridge between the of the hydrogen atom in DNA. The bases attached to the nucleic acids and the proteins. They have a unique cloverleaf ribose sugar are adenine, cytosine, and guanine, but not secondary structure, with the 30 end covalently attached to the thymine because RNA uses another pyrimidineduracildas amino acid by specific aminoacyl tRNA synthetases. In the a substitute for thymine. middle of the tRNA structure is the anticodon sequence that binds to a specific homologous codon in the mRNA. Therefore the codon directs the binding of a specific tRNA Structure of Ribonucleic Acid linked to its corresponding amino acid. The genetic code, One significant difference between DNA and RNA is that which consists of a 64 3-base code, specifies the appropriate RNA does not normally exist as two strands bound to one amino acid to be attached to the growing polypeptide chain another, although a single strand can bind internally to itself (see Figs. 1.7 and 1.8, later in the chapter). There are several creating functionally important secondary structures. different classes of aminoacyl tRNA synthetases, but there is Although in the past several decades the complexity and at least one aminoacyl tRNA synthetase for each of the 20 number of different RNAs has greatly expanded, the majority amino acids. There is also at least one tRNA for each amino of cellular RNA is composed of a rather small number of RNA acid; however, there can be more depending on the species.29 types. These include mRNA, rRNA, and tRNA. Besides the three major types of RNAs, other RNAs include nuclear, nucleolar, and cytoplasmic small RNAs, Ribonucleic Acids Associated With Protein signaling RNAs, telomerase RNA, and micro-RNAs.30 This Production list appears to be growing with each passing year. Some of mRNA is the most diverse group of the three major types of the first characterized small RNAs, the nuclear and nucleolar RNAs, but constitutes only a small percentage of the total small RNAs, are involved with the processing of precursor RNA. mRNAs are transcribed from DNA that codes for pro- RNAs to mature RNAs, including splicing of hnRNA to teins and therefore are used as the template for the translation mRNA and precursor rRNA to mature rRNAs. More recently of proteins. In the case of prokaryotes the mRNA is colinear a large number of microRNAs have been discovered that with the protein that is translated; however, in eukaryotes the partly function in the regulation of translation. In addition, mRNA begins as a precursor RNA called premessenger or there are many other noncoding RNAs whose functions are heterogeneous nuclear RNA (hnRNA) that includes untrans- just beginning to be understood. lated intron and translated exon regions. After transcription the hnRNA is spliced into mature mRNA lacking the introns. Human Chromosome The mature mRNA contains only exons and can be further Human double-stranded DNA that is contained in the sperm modified by the addition of a 7-methylguanosine cap at the or egg is a single copy or haploid amount of DNA made up of 50 end, which protects the mRNA from degradation, and a approximately 3 billion base pairs (bp). To be more precise, polyadenosine (polyA) sequence at the 30 end. In eukaryotes the Human Genome Project consensus sequence of the hu- the production and processing of the hnRNA to mRNA takes man genome was 2.91 109 bp31 and the first human to place in the nucleus, and the final form of the mRNA is then be sequenced, Craig Venter, had a genome size of 2.81 transported to the cytoplasm to be translated. 109 bp,32 not including remaining gaps of highly repetitive se- rRNA is associated with ribosomes, which are the primary quences, many near centromeres and telomeres (see Chapter structures that produce protein through the biological process 2). The DNA in the cell is bound by many proteins to form of translation. rRNA, unlike mRNA, does not code for pro- chromatin (see Fig. 1.3). The proteins in chromatin consist teins. The ribosome is composed of two structures, the 50S of histones, which are bound in precise amounts per a length and 30S subunits found in prokaryotes and the 60S and 40S of DNA, and other proteins called nonhistone proteins that subunits found in eukaryotes. The “S” stands for Svedberg are bound more irregularly and in widely varying amounts. units and is determined by the centrifugal sedimentation The histone proteins consist of eight proteins (two copies rate. The Svedberg unit measures the mass, density, and shape each of H2A, H2B, H3, and H4) that bind as a unit to of an object. The ribosome is a mixture of RNA and protein. In 147 bp of DNA to make up a nucleosome, and the protein, eukaryotes there are four major rRNAs: the 18S rRNAs found H1, that binds between the nucleosomes (Fig. 1.4). The in the 40S subunit and the 28S, 5.8S, and 5S rRNAs found in nucleosomes are the basic structure to which many other the 60S subunit. In prokaryotes, the 50S subunit contains the proteins interact and modify to regulate gene expression. 23S and 5S rRNAs and the 30S subunit contains the 16S rRNA. For example, the access to DNA by transcription factors is Synthesis of eukaryotic rRNA occurs as a large 45S precursor controlled by proteins that remodel the histone proteins RNA that is enzymatically cleaved to form all the rRNAs except through phosphorylation, acetylation, and methylation. The 6 Principles and Applications of Molecular Diagnostics (ENCODE) project has shown that much of the none protein-encoding DNA is transcribed into noncoding RNAs, most with unknown function. CENTRAL DOGMA OF MOLECULAR BIOLOGY H3 H4 Francis Crick originated the concept of the central dogma of biology, which describes the transfer of genetic information into functional macromolecules.34 This was generally H2A H2B depicted to show the movement of genetic information from DNA to RNA via transcription using RNA polymerase and further translated into protein via ribosomes and various H1 factors. This is a simplistic version of the original concept, which took into consideration every possible transfer of information even though no evidence existed at the time. However, since the original publication a number of other postulated transfers have been described. DNA can enzymat- FIGURE 1.4 Schematic illustration of a nucleosome unit. A ically replicate itself by DNA polymerase, and RNA can be segment of DNA is wound around a nucleosome core particle made into DNA using reverse transcriptase.35 Many of these consisting of an octamer of two each of the histone proteins enzymes are used in molecular diagnostics assays. H2A, H2B, H3, and H4. Tails with modifications (indicated by a red star) are shown to protrude from H3 and H4. Adjacent Deoxyribonucleic Acid Replication nucleosomes are separated by a segment of linker DNA and the linker histone, H1. A general principle underlying the synthesis or replication of new DNA is that it uses one of the two DNA strands as a tem- plate to make a new homologous strand. This is termed semi- nucleosomes are condensed into filaments and even more conservative replication and was first theorized by Watson compact structures to form a chromosome (see Fig. 1.3). and Crick.7 DNA replication begins at an adenine and There are 23 pairs of chromosomes; 22 autosomal chromo- thymine (AT)-rich structure called an origin of replication. somes and 2 sex chromosomes, X and Y, with an XX pair In bacteria there is generally only one origin of replication, denoting female and an XY pair denoting male. The DNA but in eukaryotic cells there are thousands. Since DNA can in chromosomes is continuous for each chromosome and be supercoiled into more structures, a topoisomerase is can be as much as several hundred million base pairs in required to first unwind this structure so that the DNA is length for the largest chromosomes. accessible. A DNA helicase binds to the double-stranded From a cytogenetic viewpoint, regions of the chromo- DNA and separates the two strands, providing two single- somes can be classified by their transcriptional activity. The stranded DNA templates. Replication progresses in a 50 to more condensed heterochromatin DNA is transcriptionally 30 direction; therefore one strand, the leading strand, is syn- inactive and stains with Giemsa, a mixture of several dyes thesized as one continuous strand using the 30 to 50 template that bind to AT-rich regions of DNA. The less condensed and the other strand, called the lagging strand, is synthesized euchromatin DNA is transcriptionally active and does not in small segments called Okazaki fragments from the 50 to 30 stain with Giemsa. The ends of the chromosomes, called telo- template. Because the DNA polymerase requires a primer, meres, contain a repeat sequence, such as TTAGGG that is small RNA primers are made by a primase enzyme on the found in humans and shortens with age. The centromeres, 50 to 30 template and the Okazaki fragments are synthesized at the center of most chromosomes, are important for linking starting from the primer. Okazaki fragments are finally linked sister chromatids during mitosis and contain various satellite by a ligase (Fig. 1.5).36 DNAs, such as a-satellite tandem repeats (171 bp) that are DNA polymerases of various types have been identified over several million base pairs (Mb) in length. and they function in many different roles, the most important Surprisingly, most of the human DNA does not code for being the replication of new DNA and the repair of existing the expression of protein. As much as 50% of human DNA DNA. Using the template strand as a guide, the DNA poly- consists of many types of interspersed repeat sequences, merase binds a nucleotide triphosphate to the primer at a such as satellites, telomeres, microsatellites, minisatellites, free 30 hydroxyl group, releasing pyrophosphate. The specific short and long interspersed nuclear elements (SINES, nucleotide selected depends on the base on the template LINES), and retrovirus elements.31 Like other eukaryotes, hu- strand; for example, an adenine nucleotide is used if a man genes are in pieces with the protein-encoding regions, thymine nucleotide is in the template strand. In summary, exons, alternating with the introns, which do not code for a complementary sequence is synthesized opposite the tem- protein sequence and occupy more than a quarter of the plate strand. The insertion of the correct nucleotide does human DNA.33 Other regions around the genes, such as not always occur. Mistakes occur approximately every the promoter regions and the 30 untranslated regions are 100,000 nucleotides; therefore a major function of a DNA also not translated into proteins. After all the noncoding se- polymerase is error correction or proofreading and is accom- quences are removed, the protein-coding DNA sequence plished by an intrinsic 30 to 50 exonuclease activity. DNA spans only approximately 1.2 to 1.5% of human DNA. polymerases are important in molecular diagnostics because Even though most human DNA is not associated with they are used in the polymerase chain reaction (PCR) and protein-producing genes, the Encyclopedia of DNA Elements DNA sequencing. CHAPTER 1 Principles of Molecular Biology 7 3⬘ upstream of the removed base. DNA polymerase and ligase 5⬘ then add a new nucleotide repairing the damage. One of the inherited disorders associated with this repair process that Double-stranded leads to a predisposition to various neoplasms is caused by parent DNA Leading strand mutations in MUTYH, a DNA glycosylase gene.38,40 5⬘ Nucleotide excision repair removes base modifications Daughter that change the helical structure of DNA, including bulky strands DNA distortions and covalently bound structures that may 3⬘ be created by ultraviolet radiation and certain cancer drugs. Lagging strand Direction of unwinding The damage is recognized by global and transcription- of helix mediated repair processes. After the repair is initiated, the transcription factor, TFIIH, binds to a complex of proteins 3⬘ and makes an incision. The damaged DNA is unwound, 5⬘ and the gap is filled by DNA polymerase and finally sealed by DNA ligase. Mutations in the nucleotide excision repair Replication fork genes cause xeroderma pigmentosum, which leaves affected FIGURE 1.5 DNA replication. Double-stranded DNA is separated at individuals susceptible to specific tumors.38,41 the replication fork. The leading strand is synthesized continuously, Mismatch repair recognizes base incorporation errors and whereas the lagging strand is synthesized discontinuously but is joined later by DNA ligase. base damage. DNA polymerase has a 30 to 50 editing exonu- clease with a proofreading function that is not completely effective and allows some mismatches to occur that can DNA replication is part of the cell cycle and occurs during lead to mutations after DNA replication. The mismatched the synthesis phase. The rest of the cell cycle is the interphase, nucleotides must be repaired on the newly synthesized strand further divided into the first growth phase (G1) and the sec- of DNA, which in prokaryotes is recognized by its unmethy- ond growth phase (G2), along with the DNA replication or lated state. In eukaryotes the mechanism is different, and it is synthesis (S) phase that lies between G1 and G2. The mitosis proposed that proteins associated with the replication appa- phase, which involves the splitting of one cell into two cells, ratus, specifically the proliferating cell nuclear antigen protein occurs after the G2 phase. Mitosis is divided into six determines the appropriate DNA strand for repair.38 These subphases: prophase, prometaphase, metaphase, anaphase, mutations are corrected with DNA mismatch repair proteins, telophase, and cytokinesis. which identify the mismatches by their methylation patterns, At important control points in the cell cycle the cell will excise the surrounding sequence, and then repair the excision commit significant resources to proceed further. One of these with new sequence. Mutations in the human mismatch repair control points is between the G1 and S phase, just before it genes are associated with Lynch syndrome (hereditary non- begins DNA replication. The G1/S boundary control point polyposis colorectal cancer). is disrupted in many cancers. It is common for neoplasms Double-stand breaks are a very destructive form of DNA to have mutations in the retinoblastoma gene (RB1), whose damage that destabilizes the genome, sometimes resulting protein product regulates cell cycle progression from G1 to in gross chromosomal changes, such as translocations that S. Another control point is between G2 and M, just as the are frequently found in cancer. Double-stranded breaks are cell commits to creating two cells from one. caused by several processes, including ionizing radiation and chemotherapy drugs, and are repaired by either homolo- Deoxyribonucleic Acid Repair gous recombination or nonhomologous end joining.38,41 The The integrity of DNA is damaged in a variety ways that culmi- homologous recombination repair pathway is initiated by nate in changes or mutations in the DNA sequence. DNA ba- recognition of a double-stranded break, followed by resection ses may be damaged, removed, cross-linked or incorrectly using exonucleases to create a 30 single-stranded overhang. paired with one another, and single- or double-stranded With the assistance of many proteins, RAD51 is bound to breaks may also occur.37,38 When the cell senses that its the single-stranded DNA, which invades the intact homolo- DNA has become damaged, it stops the progression of its gous double-stranded DNA of the sister chromatid and uses cell cycle and initiates DNA repair processes.39 Cells repair it as a template for new double-stranded DNA repair.38 these lesions by employing multiple DNA repair mechanisms DNA repair mechanisms operate independently to repair that are specific for the type of DNA lesion and include base simple lesions. However, the repair of more complex lesions excision repair, nucleotide excision repair, mismatch repair, involves multiple DNA processing steps regulated by the and homologous recombination repair. DNA damage response pathway. When single- and double- stranded DNA breaks occur, a cascade of responses is initiated Mechanisms that culminates in either DNA repair, stopping the cell cycle, Base excision repair removes bases that are damaged by deam- or programmed cell death. After DNA damage has occurred, ination, oxidation, and alkylation. Deamination of guanine, the DNA damage response pathway activates the protein ki- cytidine, and adenine converts them into structures that will nases ATM (ataxia telangiectasia mutated) and ATR (ataxia incorrectly base pair, creating transition mutations, which telangiectasia and Rad3-related protein) to phosphorylate are changes between similar nitrogenous bases such as a signaling proteins, such as p53, which eventually leads to purine to a purine. A transversion mutation is a change cell cycle arrest at the G1/S boundary. This gives time for from a purine to a pyrimidine or vice versa. DNA glycosylases, the DNA repair mechanism to repair the damaged DNA; such as uracil-DNA-glycosylase, cleave the damaged base, and however, if the damage is too extensive, the cell initiates a 50 -deoxyribose phosphate lyase removes the nucleotide apoptosis or cell death.39 8 Principles and Applications of Molecular Diagnostics Deoxyribonucleic Acid Modification Enzymes group is made up of enhancers, silencers, insulators, and There are two groups of nucleases, the endonucleases that cut locus-specific control regions.45,46 These regulatory elements through the sugar-phosphate backbone and exonucleases that contain specific sequences that bind to transcription factors digest the ends of DNA. The commercially important restric- that can upregulate or downregulate the expression of a tion endonucleases, which bacteria have acquired to protect gene. There are only several thousand human transcription themselves from viral infections, are used to cleave DNA at factors, much less than the number of human genes; therefore a specific nucleotide sequence or restriction sites.42 Several each gene has many regulatory elements to provide the needed thousand restriction endonucleases have been characterized complexity to function in 200 different human cell types.45 and are used extensively to manipulate DNA in molecular A surprising property of human genes is that there are so biology and molecular diagnostics. Recent work has described few compared to less complex species. Humans have approx- new nucleases, such as the RNA-guided engineered nuclease, imately 20,000 genes, many fewer than found in rice and only CRISPR/Cas system, that can precisely cleave genomic DNA.43 slightly more than found in the roundworm, Caenorhabditis DNA glycosylases are a family of enzymes associated with elegans.47-49 Recently, results from the ENCODE project base excision repair that are used in the first step of DNA have challenged the concept of “one gene, one protein.”50 repair to remove the damaged base, without disrupting the Their studies show that the exon of one gene can be spliced sugar-phosphate backbone. An important member of that into the exon of another gene.51 This result, along with alter- family, uracil DNA glycosylase, repairs the most common native splicing, demonstrates that one gene can make multi- mutation found in humans, the spontaneous deamination ple proteins and is probably the reason humans have such a of cytosine to uracil, by removing the uracil base. small number of genes. Gene Structure Ribonucleic Acid Transcription and Splicing The structure of prokaryotic genes is straightforward; almost RNA transcription involves synthesizing an RNA strand using all of the gene sequence is used to make protein; however, this DNA as a template. This requires many different proteins, the is not the case with eukaryotic genes. One of the unique hall- most important being the RNA polymerases, of which there marks of eukaryotic genes is that the protein-coding DNA is are three types in eukaryotic cells. RNA polymerase I is interspersed with regions that do not code for DNA, an specific for the rRNAs, 28S, 18S, and 5.8S, which are initially observation made by Richard Roberts and Phillip Sharp in transcribed as a single primary transcript of 45S. RNA poly- 1977. A mature mRNA retains only the protein-coding merase II transcribes all genes that encode proteins and the sequences called exons, and the sequences between the exons small nuclear RNA (snRNA) genes. RNA polymerase III tran- are noneprotein-encoding sequences called introns that are scribes a variety of small RNAs, including the 5S rRNA, and removed during mRNA maturation (Fig. 1.6).44 tRNA. Additional proteins called transcription factors In addition to introns and exons, eukaryotic genes consist function in combination to recognize and regulate transcrip- of regulatory regions, such as promoters and enhancers, and tion of different genes.52 30 regions that contain termination and polyadenylation sig- The synthesis of RNA proceeds in a 50 to 30 direction using nals. The regulation of the expression of eukaryotic genes DNA as a template and a specific DNA sequence acts as a tran- can occur at all levels from transcription to splicing to transla- scription start site. Transcription progresses through three tion to degradation; however, most gene regulation occurs at phases: initiation, elongation, and termination. The initiation the initiation of transcription by various promoters and phase includes the binding of transcription factors to pro- enhancers.45 There are two groups of regulatory elements: moters upstream from the start site and includes the core pro- one is close to the transcriptional start site and is made up moter immediately upstream and the ancillary promoters of the core promoter and ancillary promoters slightly further further away. However, some of the small RNA gene promoters away from the start of transcription. The other group of are in the middle of the gene. Transcription factors binding to regulatory elements can be much further away, not only upstream promoters act as regulators of the transcription of upstream but also downstream from the gene. This second genes. These factors generally bind in pairs or dimers and DNA Transcription start Exons 5⬘ 3⬘ Promoter Transcription Introns Pre-mRNA Mature mRNA mRNA processing Cap AAAAA FIGURE 1.6 DNA transcription and messenger RNA processing. A gene that encodes for a protein contains a promoter region and variable numbers of introns and exons. Transcription commences at the transcription start site. Premessenger RNA or heterogeneous nuclear RNA (hnRNA) is processed by capping, polyadenylation, and intron splicing and becomes a mature messenger RNA. CHAPTER 1 Principles of Molecular Biology 9 have several functional domains. One functional domain of joining of the RNA strand at different locations. Among the the transcription factor binds to a specific promoter DNA types of alternative splicing are exon skipping, alternative 30 sequence via several structures, such as the helix-turn-helix, and 50 splice sites, and intron retention. It is estimated that zinc finger, and leucine zipper structures. Another domain 92% to 95% of all human genes are alternatively spliced.58,59 binds to the other transcription factor of the dimer pair, and The movement of cellular signals from the surface of a cell a third domain may bind to the RNA polymerase complex to the nucleus is called signal transduction, and one of the that carries out transcription.46 Even though promoters and eventual targets is the modification (eg, phosphorylation) of the transcription factors binding to them are far away from transcription factors, which can modulate the binding of the transcription initiation complex, the promoter DNA folds other transcription factors to DNA and their dimerization, back on itself to allow for the transcription factors to interact thereby controlling gene expression.60 A common cascade with the RNA polymerase complex.53 of signaling begins with the activation of a receptor on the Important recurring sequences are found in the core pro- cell surface, such as a tyrosine kinase receptor. The tyrosine moter. For example, the core promoter of an RNA polymerase kinase receptor in the form of a dimer can be activated by II gene contains a TATAAA sequence, called a TATA box located binding to a hormone or growth factor, for example, which upstream 25 to 40 nucleotides from the transcriptional start causes a dimerization and autophosphorylation of the tyro- site. Only 20% to 30% of eukaryotic promoters contain sine receptor protein kinase. This in turn activates a cyto- TATA boxes, but they are highly regulated compared to those plasmic protein, such as the guanine nucleotide exchange without TATA boxes that are mostly housekeeping genes.45,54,55 factor that activates the G-protein Ras, which can then The first step in mRNA transcription is the binding of modify another G-protein, Raf, which propagates the signal transcription factor IID (TFIID) to the TATA box, which in to a common signaling pathway, the mitogen-activated turn promotes the binding of other transcription factors protein (MAP) kinases. The final enzyme in the pathway (TFIIA, TFIIB, TFIIE, TFIIF, and TFIIH), RNA polymerase can then act on downstream targets, including other protein II, and proteins attached to the upstream promoter sites. To kinases, and transcriptional factors. Some mutations in the form a functional transcription complex, the promoter tyrosine kinase receptor or Ras protein switches them to an region’s doubled-stranded DNA separates and the transcrip- unregulated “on” position, which can lead to uncontrolled tion complex moves away from the core promoter region.45 growth of the cell and eventually to cancer.60 Once started, the RNA polymerase adds nucleotides to the 30 free hydroxyl group in a manner similar to that of DNA Translation replication. Transcription is eventually terminated by one of The final phase of the transfer of information from DNA is to several termination mechanisms. In bacteria a termination proteins, the structural and functional molecules that make factor bound to the RNA polymerase recognizes a DNA up the majority of a living organism, such as the human sequence termination signal. In the case of genes transcribed body. Proteins are long single strands of various amino acids by RNA polymerase II, termination is coupled with the and are synthesized by a process called translation, which polyadenylation step (see Fig. 1.6). requires the functioning of many protein factors, tRNAs, Two posttranscriptional processing events are performed and ribosomes. on the newly formed hnRNA, one at each end of the RNA. Amino acids have a common structure consisting of a car- At the 50 end, the hnRNA is capped with a 7-methyl guano- bon atom bound to amino and carboxylic acid groups and a sine molecule to help protect the hnRNA from degradation. unique side chain. There are 20 amino acids each with a At the 30 end, a polyadenosine (poly A) stretch is added by different side chain that give them their unique properties. poly A polymerase after the RNA sequence AAUAAA is syn- The side chains can be divided into four types: nonpolar thesized. Some transcribed mRNAs are not polyadenylated, (hydrophobic), polar (hydrophilic uncharged), and negative such as histone mRNAs.56 and positively charged. Nonpolar (hydrophobic) amino acids Transcription initially produces an hnRNA that contains include alanine, leucine, isoleucine, valine, proline, methio- both exons and introns, which needs to be processed or nine, phenylalanine, and tryptophan. The uncharged polar spliced into mature mRNA for it to be properly translated (hydrophilic) amino acids include glycine, serine, threonine, into protein. RNA splicing involves cleavage and removal of cysteine, tyrosine, glutamine, and asparagine. The negatively intron RNA segments and splicing of exon RNA segments. charged (acidic) amino acids are aspartic acid and glutamic The process uses consensus splice site sequences located at acid, and the positively charged (basic) amino acids are argi- both the 50 (GU) and 30 (AG) ends of the intron and an inter- nine, histidine, and lysine. A protein’s amino acid makeup nal intron sequence. Splicing requires the effort of a number and sequence in the polypeptide chain determine the overall of proteins and small RNAs that come together to form a spli- structure and function of the protein. Some amino acids have ceosome, which directs the splicing of exons and removal of a more significant presence than others. For example, proline, introns.57 Splicing begins with the binding of the U1 small which disrupts secondary structure, and cysteine, which can nuclear ribonucleic protein (snRNP) to the donor splice site cross-link to another cysteine through disulfide bonds, can and the U2 snRNP to the internal intron sequence, followed change the structure of a protein. by the binding of U4, U5, and U6 snRNPs, resulting in Protein structures are grouped into four different classes. excising the intron and joining (splicing) of the ends of the The primary structure is the sequence of the amino acids in two exons on either side of the excised intron (see Fig. 1.6).57 the protein. There are several common types of secondary An important modification of the splicing process, alter- structure, such as b-pleated sheets and a helixes. Proteins native splicing, allows for the generation of different mRNAs can be constructed with a combination of these different types from the same primary RNA transcript by the cutting and of secondary structures. Tertiary structure applies to the 10 Principles and Applications of Molecular Diagnostics folding of the polypeptide chain into a three-dimensional Protein synthesis or translation occurs in the cytoplasm form. Quaternary structure is the structural relationship of and proceeds in three steps: initiation, elongation, and termi- more than one polypeptide/protein joining together, such as nation. The process requires tRNA and rRNA molecules, as in immunoglobulin molecules, that contains light and heavy well as ribosomes and initiation, elongation, and termination proteins bound together by cysteine residues. factors. One of the most important groups of molecules are Once proteins are synthesized, they can be modified in the tRNAs, which are recognized by aminoacyl tRNA synthe- various ways. One of the most common modifications is tase enzymes that attach amino acids to the 30 end of specific phosphorylation of the amino acids serine, threonine, and tRNA molecules. Each tRNA has a 3-base sequence (anti- tyrosine, which can regulate protein activity. Other modifica- codon) that facilitates the specific recognition and interaction tions include proteolytic cleavage, such as removal of the with a codon in the mRNA. signal transport sequence, and acetylation of the N- The initiation step of protein synthesis is the most complex terminus of most eukaryotic proteins that helps to prevent and begins with the binding of initiation factor 4E to the cap degradation. Glycosylation of secreted and membrane structure on the 50 end of the mRNA and binding of poly- proteins on asparagine, serine, and threonine residues and adenosineebinding protein (PABP) to the 30 PABP polyade- formation of disulfide bonds via cysteine cross-linking are nosine tail. The binding of initiation factor 4G to both additional modifications. initiation factor 4E and PABP circularizes the mRNA and pre- Taking into consideration these posttranslational modifi- pares it for binding to the preinitiation complex containing cations and alternatively spliced forms mentioned in an the 40S ribosomal subunit, initiation factor 2, and methionine earlier section, the total number of proteins in the more tRNA. The preinitiation complex then scans the mRNA until than 200 human cell types is estimated to range from it finds a methionine start codon (AUG), at which point the 250,000 to several million.61 60S ribosomal subunit binds forming the 80S initiation com- The genetic code, which was deciphered in the early 1960s, plex and initiates translation elongation.62 This is a simplistic is required to convert a nucleic acid sequence into an amino description of the initiation process because over a dozen acid sequence.13 It was reasoned that if there are 20 amino additional initiation and auxiliary factors are involved. acids, a code of at least 3 nucleotides was necessary to have Ribosomes have at least three structural positions where enough combinations. A 3-nucleotide code gives 64 combi- tRNAs can bind, the acceptor (A), peptidyl (P), and exit nations, and therefore one hallmark of the genetic code is (E) sites. The acceptor site binds the incoming aminoacyl- that it is redundant, meaning that there are several codes tRNA. The peptidyl site holds the peptidyl-tRNA that is cova- for one amino acid. That is the case for most amino acids, lently linked to the growing polypeptide chain, and the exit but not all; for example, methionine and tryptophan have site binds to the outgoing empty tRNA that carries no amino only one code. The redundancy is usually in the third base acid.62,63 of the code. All of the 64 3-nucleotide codon possibilities The first codon (AUG) always codes for methionine; code for an amino acid, except 3 that serve as stop codons therefore to initiate translation the methionine tRNA binds (UAA, UGA, and UAG) (Fig. 1.7). to the aminoacyl-tRNA binding site of the ribosome. The Second Letter U C A G UUU Phenyl- UCU UAU UGU U Tyrosine Cysteine UUC alanine UCC UAC UGC C U Serine UUA UCA UAA Stop Codon UGA Stop Codon A Leucine UUG UCG UAG G Stop Codon UGG Tryptophan CUU CCU CAU CGU U Histidine CUC CCC CAC CGC C C Leucine Proline Arginine CUA CCA CGA Third Letter A First Letter CAA Glutamine CUG CCG CAG CGG G AUU ACU AAU AGU U Asparagine Serine AUC Isoleucine ACC AAC AGC C A Threonine AUA ACA A AAA Lysine AGA Arginine AUG Methionine ACG AAG AGG G GUU GCU GAU Asparatic GGU U GUC GCC GAC Acid GGC C G Valine Alanine Glycine GUA GCA GAA Glutamic GGA A GUG GCG GAG Acid GGG G FIGURE 1.7 Genetic code. Translation of messenger RNA to amino acids during protein synthesis. CHAPTER 1 Principles of Molecular Biology 11 3' mRNA Poly -A Growing Ribosomal polypeptide Activated subunits amino acid chain Peptide bonds Cap P 5' Anticodon (mRNA bonding site) E A Codon Large ribosome unit Small ribosome unit Ribosome Direction of protein synthesis FIGURE 1.8 Translation. Shown is a ribosome bound to a messenger RNA converting the messenger RNA triplet code (codon) via a specific amino acidebound transfer RNA containing a complementary anticodon sequence. There are three transfer RNA positions. A new amino acidebound transfer RNA first arrives on the ribosome at the A or acceptor site at the front of the moving ribosome and then moves to the P or peptidyl site where the amino acid on the newly arrived transfer RNA combines with the growing polypeptide chain. Finally the now empty transfer RNA moves to the E, or exit site, where it prepares to leave the ribosome. (Modified from Huether SE, McCance KL. Understanding pathophysiology. 6th ed. St. Louis, Elsevier; 2017.) tRNA specific for the next 3-base codondfor example, can bind to specific sites on mRNA while associated with the lysinedbinds to the acceptor site of the ribosome and with Argonaute protein and either reversibly inhibit translation or the help of elongation factors (eg, eEF2), the amino acid in degrade the mRNA.62,66 For example, microRNAs Mir 15a/ the peptidyl site is bound to the amino acid in the acceptor 16-1 are deleted in chronic lymphocytic leukemia, thereby site by the formation of a peptide bond. A peptide bond is increasing Bcl2 expression and inhibiting apoptosis or cell created between the amino group of one amino acid and death to prolong the life span of the cell.67 the carboxyl group of the next amino acid through conden- After proteins are synthesized there are two major sation releasing water. At the same time the tRNA shifts processes to remove excess or damaged proteins. One process positions, with the methionine tRNA shifting to the exit degrades the proteins ingested and uses nonspecific proteases, site and the tRNA containing the growing chain of amino such as pepsin and trypsin, to digest proteins associated with acids shifting to the peptidyl site. At the same time, the ribo- foodstuff in the gut into amino acids so they can be absorbed. some moves forward one codon and the next tRNA specific The second process digests extracellular and intracellular for the next codon through its anticodon binds in the proteins by either general proteinases within lysosomes or acceptor site, and the process is repeated until a termination by protein degradation via ubiquination. With the latter codon is reached (Fig. 1.8). Termination factors then bind mechanism, proteins are tagged for degradation by binding and stop the translation process.62 Protein synthesis occurs to ubiquitin, which is recognized by a large multiprotein in the eukaryotic cytoplasm in the endoplasmic reticulum structure, the proteasome that degrades the ubiquinated where multiple ribosomes called polyribosomes are involved proteins by proteolysis.68 in translating an individual mRNA. Regulation of translation is not as extensive as that for EPIGENETICS transcription. However, there is global regulation of eukary- otic translation at the initiation step with phosphorylation Although the original meaning of epigenetics encompassed of initiation factor 2B by four different protein kinases. all molecular pathways that affect the expression of genes, This occurs when the cells are under stress, such as amino over time the definition has focused on the regulation of acid starvation or DNA damage.64 In addition, mRNA- gene expression by heritable modifications that do not change specific translational regulation can occur through binding the DNA sequence.69 More recently this has been broadened to specific sequences located in the 50 and 30 untranslated re- to include nonheritable modifications.70-73 Currently there gions. Furthermore, there are over 1000 microRNAs in are three major areas of epigenetic modifications or marks: humans,65 many of which regulate transcription. The micro- (1) DNA methylation; (2) chromatin conformation RNA genes are transcribed as precursor RNA and then pro- regulation through histone modifications, including ATP- cessed into a mature 22-nucleotide form by the processing dependent remodeling enzymes and histone variants; and enzymes Dicer and Drosha. The mature form of microRNAs (3) noncoding RNAs.74 12 Principles and Applications of Molecular Diagnostics Deoxyribonucleic Acid Methylation Chromatin Conformation Regulation DNA methylation is a well-known epigenetic change that is Many basic cellular functions require proteins to interact important in X chromosome inactivation, gene imprinting with DNA. However, DNA is generally not freely accessible (eg, Prader-Willi, Angelman syndromes), and cancer. The but is wound around histones to form nucleosomes and most common methylation event is the methylation of cyto- further condensed or compacted into heterochromatin sine to form 5-methylcytosine. DNA methylation typically that decreases gene expression. The cell requires the DNA occurs at cytosines directly upstream of guanines, or CpG di- to be accessible to carry out DNA replication, repair, and nucleotides. Cytosine is both methylated and demethylated transcription.74,79 The chromatin, therefore, is a very dy- by a variety of enzymes. The initial methylation state is cata- namic structure; at any one point in time portions of the lyzed by one type of DNA cytosine-5-methyltransferase, DNA are being exposed and other portions are being whereas the maintenance of the methylated state is performed covered. The mechanisms that control chromatin confor- by another type of DNA cytosine-5-methyltransferase and mation include histone modifications, histone variants, occurs during each cell division after being established in and ATP-dependent remodeling enzymes. early embryonic development.75 Specific histones are reversibly and posttranslationally Demethylation involves three members of the ten-eleven modified at their N-terminal tails and globular regions to translocation (TET) family of dioxygenases, which catalyze change the chromatin from a euchromatin state to a hetero- the conversion of 5-methylcytosine to other modified forms, chromatin state and back (see Fig. 1.9). These modifications such as 5-hydroxymethylcytosine during demethylation.76 include acetylation of lysine residues at the N-terminal tails of 5-Hydroxymethylcytosine is found in high amounts in neural H2A, H3, and H4 by histone acetyltransferases (HATs) and cells and is postulated to regulate gene expression.76 deacetylation by histone deacetylases (HDACs). Histone acet- Gene expression is altered by methylation via several ylation removes the positive charge on the lysine residue, mechanisms. The most direct effect is through altering the leaving the lysine less attracted to the negatively charged ability of transcription factors to bind to promoters. Methyl- DNA phosphate backbone and thereby opening the DNA.77 ation decreases the affinity of transcription factors to a DNA Histone methylation of lysine and arginine residues occurs promoter and enhances the binding of methylation-specific mostly on histone protein H3, but also histone protein H4, transcription factors (Fig. 1.9). Additionally, methylation and is carried out by histone methyltransferases (HMTs) compacts the chromatin structure, thus reducing the access and histone demethylases (HDMs). The effect of methylation of transcription factors to a promoter.77 Cancer is the most on chromatin structure ranges from active to poised to common human disease associated with aberrant DNA repressed. Histone lysine and arginine residues can be methylation.78 Interestingly, the overall level of 5- mono-, di-, and tri-methylated, but the positive charge is un- methylcytosine in cancer cells is 60% less than in normal changed.39,79 Histone methylation is found associated with cells; however, certain promoter-specific CpG islands are DNA transcription, replication, and repair. hypermethylated.78 Other human diseases that are associ- Histones are phosphorylated at serine, threonine, and ated with methylation include lupus and many neurologic tyrosine residues and are associated with DNA repair and diseases. transcription. The addition of a negatively charged phosphate DNA methylation No gene expression Gene expression Me Me C G C G C G C G Methylated CpG Unmethylated CpG containing promoter containing promoter Histone modification No gene expression P Gene expression Me Ac No histone Histone modification modifications FIGURE 1.9 Epigenetics. Top, DNA methylation of CpG island regions indicated by Me in and around gene promoters is associated with loss of gene expression and silencing of the gene. When CpG islands are unmethylated, shown by absence of Me, gene expression is unaffected. Bottom, Modifications of the tails of histone proteins, such as methylation, acetylation, and phosphorylation, shown as Me, Ac, and P, respectively, can increase gene expression. (Modified from Zaidi SK, Young DW, Montecino M, van Wijnen AJ, Stein JL, Lian JB, et al. Bookmarking the genome: maintenance of epigenetic information. J Biol Chem 2011;286:18355e18361.) CHAPTER 1 Principles of Molecular Biology 13 group to the histone will repel the histone away from the and 200 nucleotides.92 Only recently has the extent of long negatively charged DNA and loosen up the chromatin struc- noncoding RNAs been appreciated.89 The diversity of the ture.80 Other modifications include poly(ADP-ribosyl)ation, long noncoding RNAs is predicted to be in the hundreds of ubiquitination, SUMOylation, and glycosylation.81 thousands in vertebrates and their expression pattern is Histone variants have been known for decades, but many highly regulated during the development of an organism. A of their functions are not well established. Histone protein well-described example of a long noncoding RNA is XIST, variants H3.3 and H2A.Z are the most well-known and are which associates with the Polycomb group complex 2 and shown to function in regulation of gene expression.82 Histone inactivates the X chromosome by inducing heterochromatin variant H3.3 incorporates into chromatin independent of formation and repressing gene expression.93 Examples such replication and is associated with active chromatin.83,84 as XIST and a similarly acting protein, HotAir, have given ATP-dependent remodeling enzymes use the energy from rise to the possibility that the noncoding regions of the the hydrolysis of ATP to change the structure of chro- human genome have important functions.92 matin.84,85 ATP-dependent remodeling enzymes are grouped The function of most noncoding RNAs is unknown, but it into four families including SWItch/Sucrose NonFermentable is speculated that coding and noncoding RNAs, referred to as (SWI/SNF), imitation switch (ISWI), inositol requiring 80 competing endogenous RNAs (ceRNAs), are in competition (INO80), and chromodomain (CHD).79,85 for shared microRNA binding sites in untranslated regions The remodeling enzymes have similar properties, including of mRNAs, thereby regulating their expression. The ceRNA (1) specific interaction with nucleosomes, (2) attraction to the hypothesis proposes a new layer of regulation of gene expres- modified histone tail residues found in nucleosomes, (3) sion that could help explain the function of the large percent- contain an ATPase domain, (4) ATPase regulatory function, age of the human genome that expresses noneprotein-coding and (5) ability to interact with transcription factors and RNA.94-96 chromatin-associated proteins.81,85 The primary role of the enzymes is to remodel the chromatin structure. The SWI/ UNDERSTANDING OUR GENOME SNF proteins function in the sliding and ejecting of nucleo- somes, but do not function in chromatin assembly. The Genomics is recognized as a unique field since the first free- IWSI family of enzymes changes the nucleosome spacing living organisms were completely sequenced in the 1990s. through sliding that is necessary after DNA replication. This With the publication of the first draft of the human genome family interacts with unmodified histone tails and functions in 2001 and the final results of the Human Genome Project to regulate transcription. The CHD family functions to slide in 2004, the genomics field started to impart greater influence and eject nucleosomes, by which it regulates transcription. on biomedical research and its application to medicine.31,97 The INO80 family of proteins has an insertion in the middle Genomics is characterized by the comprehensive nature of of its ATPase domain and functions in promoting transcrip- its collection of data and the technical development necessary tion and DNA repair. A mammalian member of this family, to obtain, analyze, store, and make available such large SWR1, can exchange histones to facilitate DNA repair.81,85-87 amounts of data. There are also ethical, legal, and social impli- cations of the research and clinical application of genomics.98 Noncoding Ribonucleic Acids Large research projects that were initiated during the latter Most of the expressed RNA in a cell is not translated into pro- years of the Human Genome Project produced comprehen- tein. Only the mRNAs are translated into protein, and they sive biological catalogs of genetic variants, important DNA represent only 1% to 5% of the total RNA depending on functional sequences, and expressed products from not only cell type. Much of this noncoding RNA is known and humans but also many other organisms.98 includes rRNA and tRNAs. However, over the last several Single nucleotide variants (SNVs) are the most common decades two large groups of noncoding RNAs have been DNA differences found in the human population, and they discovered, the short and long noncoding RNAs. The number in the millions, with each individual differing on ENCODE project tested for the expression from DNA not average by 1 in 1000 nucleotides. Human SNVs (including associated with genes by using probes that overlapped one both benign polymorphisms and causative mutations) are another regardless of the location of genes. Over 80% of cataloged in the SNP database (http://www.ncbi.nlm.nih. the human DNA could be assigned a biochemical function, gov/SNP). although biochemical function was liberally defined.88 None- Genome-wide association studies employ microarray tests theless, it was determined that the bulk of the human genome that use large numbers of SNVs to find associations between is expressed into RNA.89 genetic variations and diseases. DNA variants are often clus- The short noncoding RNAs consist of microRNAs, small tered into regions by genetic recombination during the forma- interfering RNAs and piwi interacting RNAs.90,91 tion of sperm and eggs that are inherited as a unit, such that a MicroRNAs regulate gene expression by binding to a specific unique SNV pattern or haplotype can be passed from gener- sequence of the mRNA and inhibiting its translation. Small ation to generation. The International HapMap Project also interfering RNAs (siRNA) inhibit translation by also binding uses SNVs to investigate haplotype associations and disease. to a region of the mRNA, but do so by initiating the degrada- The 1000 Genomes Project complements the previously tion of the mRNA by the associated Argonaute protein. Piwi mentioned projects by sequencing a large number of diverse interacting RNAs (piRNA) function in the repression of human samples from around the world. The goal is to build a transposons and are important in the development of comprehensive catalog of the most common human genetic gametes in many multicellular eukaryotic species. variants, which includes single nucleotide variants, as well The long RNAs are arbitrarily designated to be greater as insertions, deletions, and copy number variants that are than 200 nucleotides while the short RNAs are between 20 found in the population at greater than 1%. The Exome 14 Principles and Applications of Molecular Diagnostics Aggregation Consortium (ExAC) has sequenced over 60,000 All of the previously discussed advances have made the exomes to delineate common genetic variation within human field of molecular diagnostics an important and exciting exomes. The SNP database, International HapMap Project, area that is going to have an even greater impact on medicine 1000 Genomes Project, ExAC, and genome-wide association in the future. As an increasing number of diseases are charac- studies have helped to define genetic variability within terized at the molecular (eg, nucleic acid and protein) level, individuals and populations to understand the basis of new therapeutics and diagnostics specifically targeting these many genetic diseases.98 molecular changes will continue to emerge. A more fundamental biology project is the encyclopedia of DNA elements, or ENCODE, whose goal is a catalog of the functional elements of the genomes of humans and other POINTS TO REMEMBER species. The functional elements include the genes and all The two strands of DNA are bound together by hydrogen their expressed RNA forms and epigenetic modifications.51 bonds and stacking forces that can be broken and reformed One of the most important findings is the discovery that without permanent damage to the DNA. This important much of the human genome is expressed into RNA. property is exploited by many of the methods that are With the introduction of the first massively parallel DNA used in molecular diagnostics. This is a requirement for sequencing instrument in 2005 and subsequent instruments most of the DNA diagnostic assays. from 2006 onward, the current technologic era of genomics Even though human DNA has approximately 20,000 genes, has progressed over the last decade to make significant this is far less than what would be expected given the inroads into applying genomics to patient care.99 Along number of proteins in a human cell. The higher number of with the technologic innovation in DNA sequencing, there proteins results from alternative splicing, which occurs in has been innovation in bioinformatics, which is required to more than 95% of human genes. manage and interpret the large amount of information Only 1.2% to 1.5% of the human genome is translated into generated by massively parallel DNA sequencing instruments. protein; however, much more of the genome is made into Although the Human Genome Project is a significant feat, RNA. it was not the first whole genome to be sequenced. Whole The conversion of DNA information into protein is facilitated genome sequencing initially focused on infectious pathogens, by aminoacyl tRNA synthetases and their ability to create because of their impact on human health and also their size. amino acidespecific tRNAs. The first free-living organism to be sequenced was Haemophi- The genetic code is redundant; the 3-base code can have 64 lus influenzae in 1995.100 Subsequently, many species from a different combinations, but only 20 amino acids are cross-section of living organisms have been sequenced. The recognized. first individual human to have their whole genome sequenced was Craig Venter, who led one of the two groups that first sequenced the human genome. The second person to have their whole genome sequenced was James Watson, whose REFERENCES genome was the first to be sequenced by using massively parallel DNA sequencing. 1. Griffith F. The significance of pneumococcal types. J Hyg 1928; An important clinical application of genomics is cancer 27:113e59. diagnostics (see Chapters 7 and 8); however, the diversity 2. Avery OT, MacLeod CM, McCarty M. Studies on the chemical and complexity of cancer requires a significant amount of nature of the substance inducing transformation of pneumo- basic biological information to interpret molecular diagnostic coccus types: induction of transformation by a desoxy- testing results of patient samples. The first whole genome ribonucleic acid fraction isolated from pneumococcus type III. sequencing of a cancer was an acute myeloid leukemia in J Exp Med 1944;79:137e57. 2008,101 and many others have subsequently been sequenced. 3. Hershey AD, Chase M. Independent functions of viral protein The Cancer Genome Atlas project includes large numbers of and nucleic acid in growth of bacteriophage. J Gen Physiol 1952; the most common cancers to identify all their associated 36:39e56. mutations. For example, a recent study describes mutational 4. Chargaff E, Zamenhof S, Green C. Composition of human data for 12 of the most common cancers.102 The significant desoxypentose nucleic acid. Nature 1950;165:756e7. amount of basic information now available on human 5. Franklin RE, Gosling RG. Molecular structure of nucleic acids: cancers and the availability of new therapeutics targeting molecular configuration in sodium thymonucleate, 1953. Ann specific cancer-associated genes allow the clinical use of N Y Acad Sci 1995;758:16e7. molecular profiling in cancer patients.103 6. Wilkins MH, Stokes AR, Wilson HR. Molecular structure of With the increasing use of genetic and genomic informa- deoxypentose nucleic acids. Nature 1953;171:738e40. tion to characterize a patient’s disease, an interesting conver- 7. Watson JD, Crick FH. Genetical implications of the structure of gence of electronic medical records and genomics is deoxyribonucleic acid. Nature 1953;171:964e7. emerging. The implementation of electronic medical records 8. Watson JD, Crick FH. Molecular structure of nucleic acids: a throughout the United States will allow for greater access to structure for deoxyribose nucleic acid. Nature 1953;171:737e8. the large amount of genomic data that will be available on pa- 9. Lightman A. The discoveries: great breakthroughs in 20th-century tients, which will eventually be a source for scientific research science, including the original papers. New York: Vintage; 2006. and discovery. The Electronic Medical Records and Genomics 10. Watson JD, Crick FH. The structure of DNA. Cold Spring Harb Network is currently developing tools and conditions under Symp Quant Biol 1953;18:123e31. which genomic research can be pursued using electronic 11. Meselson M, Stahl FW. The replication of DNA. Cold Spring medical records.104 Harb Symp Quant Biol 1958;23:9e12. CHAPTER 1 Principles of Molecular Biology 15 12. Kornberg A. Biologic synthesis of deoxyribonucleic acid. 34. Crick FH. On protein synthesis. Symp Soc Exp Biol 1958;12: Science 1960;131:1503e8. 138e63. 13. N