Lec 4 Mol genetics PDF
Document Details
Uploaded by EventfulQuantum
null
Tags
Summary
This document details molecular genetics, including functional genomics, proteomics, and bioinformatics at an introductory level.
Full Transcript
Because learning changes everything. ® Chapter 23 GENOMICS II: FUNCTIONAL GENOMICS, PROTEOMICS, AND BIOINFORMATICS Genetics: Analysis & Principles SEVENTH EDITION Robert J. Brooker © 2021 McGra...
Because learning changes everything. ® Chapter 23 GENOMICS II: FUNCTIONAL GENOMICS, PROTEOMICS, AND BIOINFORMATICS Genetics: Analysis & Principles SEVENTH EDITION Robert J. Brooker © 2021 McGraw Hill. All rights reserved. Authorized only for instructor use in the classroom. No reproduction or further distribution permitted w ithout the prior w ritten consent of McGraw Hill. Chapter 23 GENOMICS II: FUNCTIONAL GENOMICS, PROTEOMICS, AND BIOINFORMATICS © McGraw Hill ©Alfred Pasieka/SPL/Science Source 2 OMICS In biology the word omics refers to the sum of constituents within a cell at all levels. © McGraw Hill 3 INTRODUCTION The goal of functional genomics is to elucidate the roles of genetic sequences in a given species In most cases, it aims to understand gene function Epigenomics deal with chemicals that can tell the genome what to do The entire collection of proteins that an organism can make is termed the proteome The goal of proteomics is to understand the functional roles of the proteins of a species It aims to understand the interplay among many different proteins Bioinformatics is the analysis of biological information using a mathematical/computational approach Often aimed at extracting information from genetic data © McGraw Hill 4 Transcriptomics: analyzes the complete set of gene expression. Genes are expressed as transcripts or mRNA. Metaboliomics; is the large-scale study of small molecules, commonly known as metabolites, within cells, biofluids, tissues or organisms. © McGraw Hill 5 23.1 FUNCTIONAL GENOMICS In the past, our ability to study genes involved many of the techniques described in Chapter 20 Gene cloning, Northern blotting, gene editing, etc. Recent genome-sequencing projects have enabled us to consider gene function at a more complex level We can now examine groups of many genes simultaneously © McGraw Hill 6 A Microarray Can Identify Genes That Are Transcribed Researchers developed a technology called DNA microarrays (also called gene chips) This technology makes it possible to monitor thousands of genes simultaneously A DNA microarray is a small silica, glass or plastic slide that is dotted with many sequences of DNA Each of these sequences corresponds to a known gene These fragments are made synthetically These sequences of DNA act as probes to identify genes that are transcribed © McGraw Hill 7 DNA Microarrays The DNA fragments on a microarray can be either Amplified by PCR and then spotted onto the microarray Synthesized directly on the microarray itself A single slide contains tens of thousands of different spots in an area the size of a postage stamp The relative location of each spot is known The technology for making DNA microarrays is quite amazing It involves spotting technologies that are quite similar to the way that an inkjet printer works Once a DNA microarray has been made, it is used as a hybridization tool © McGraw Hill 8 Figure 23.1 © McGraw Hill 9 Table 23.1 Applications of DNA Microarrays TABLE 24.1 Applications of DNA Microarrays Application Description A comparison of microarray data using cDNAs derived from RNA of different cell types can identify genes that are Cell-specific gene expression expressed in a cell-specific manner. Environmental conditions play an important role in gene regulation. A comparison of microarray data may reveal Gene regulation genes that are induced under one set of conditions and repressed under another. Genes that encode proteins that participate in a common metabolic pathway are often expressed in a parallel Elucidation of metabolic pathways manner. This application overlaps with the study of gene regulation via microarrays. Different types of cancer cells exhibit striking differences in their profiles of gene expression. Such a profile can be revealed by a DNA microarray analysis. This approach is gaining widespread use as a method of subclassifying Tumor profiling tumors that are sometimes morphologically indistinguishable. Tumor profiling may provide information that can improve a patient’s clinical treatment. A mutant allele may not hybridize to a spot on a microarray as well as a wild -type allele. Therefore, microarrays have been used as a tool for detecting genetic variation. For example, they are used to identify disease -causing alleles in Genetic variation humans and mutations that contribute to quantitative traits in plants and other species. In addition, microarrays are used to detect chromosomal deletions and duplications. Microbial strain identification Microarrays can distinguish between closely related bacterial species and subspecies. Chromatin immunoprecipitation, which is illustrated in Figure 24.2, can be used with DNA microarrays to determine DNA-protein binding where in the genome a particular protein binds to the DNA. © McGraw Hill 10 Analysis of DNA Microarrays Cells respond to environmental changes via the coordinate regulation of their genes Some genes are turned on while others are turned off In the past, gene regulation was studied using tools that can analyze the expression of only a few genes at a time Microarrays make it possible to study the expression of the whole genome under different environmental conditions © McGraw Hill 11 RNA-Seq is a Newer Method to Identify Expressed Genes RNA Sequencing (RNA-Seq) is also used to study the simultaneous transcription of many genes It has several important applications in comparing transcriptomes- the set of all RNA molecules, including mRNAs and non-coding RNAs, that are transcribed in one cell or a population of cells RNA-Seq is used to compare transcription in: Different cell types Healthy vs diseased cells Different stages of development Response to different environmental agents such as hormones or toxic chemicals © McGraw Hill 12 Figure 23.3 (1) © McGraw Hill 13 Figure 23.3 (2) © McGraw Hill 14 Advantages of RNA-Seq RNA-Seq has several advantages over microarrays More accurate at quantifying the amount of each RNA transcript Superior at detecting RNA transcripts that are in low abundance Identifies the exact boundaries between exons and introns; identifies new splice variants Identifies the 5’ and 3’ ends of RNA transcripts © McGraw Hill 15 Gene Knockout Collections Allow Researchers to Study Gene Function at the Genomic Level A broad goal is to determine the function of every gene in a species’ genome One approach is to generate a collection of organisms from one species, each with one gene knocked out The phenotype could indicate function Could combine knockouts to study pathways Many ways to generate collections of knockouts Transposable element jumps CRISPR-Cas technology Knockout collections exist for E. coli, S. cerevisiae, C. elegans and mice © McGraw Hill 16 23.2 PROTEOMICS Proteomics examines the functional roles of the proteins that a species can make The entire collection of a species’ proteins is its proteome Genomic data can provide important information about the proteome RNA-Seq and DNA microarrays provide insight about the transcription of genes May not provide an accurate measure of protein abundance Genomic insights are often followed up with research that involves protein analysis directly © McGraw Hill 17 The Proteome Is Much Larger Than the Genome Sequencing and analysis of an entire genome can identify all the genes that a given species contains The proteome is larger, however, and its actual size is more difficult to determine This larger size is rooted in a number of cellular processes Alternative splicing RNA editing Postranslational covalent modification All of these processes increase the number of potential proteins in the proteome © McGraw Hill 18 Alterations that Affect the Proteome 1 1. Alternative splicing (Chap 12) Most important alteration that occurs in eukaryotes A single pre-mRNA is spliced into more than one version Splicing is often cell specific or related to environmental conditions 2. RNA editing (Chap 12) Much less common than alternative splicing Leads to changes in the coding sequence of mRNA after the mRNA is made Figure 23.3a © McGraw Hill 19 Alterations that Affect the Proteome 2 3. Postranslational covalent modification Irreversible changes may be necessary to produce a functional protein Proteolytic processing; attachment of prosthetic groups, sugars or lipids Reversible changes that transiently affect the function of the protein Phosphorylation; acetylation; methylation Figure 23.3b Access the text alternative for slide images. © McGraw Hill 20 Two-Dimensional Gel Electrophoresis Is Used to Separate a Mixture of Different Proteins Any given cell of a multicellular organism will produce only a subset of the proteins in its proteome The subset that it makes depends primarily on the Cell type Stage of development Environmental conditions A technique in the field of proteomics is two-dimensional gel electrophoresis It is a separation technique that can distinguish hundreds or even thousands of different proteins in a cell extract © McGraw Hill 21 Two-dimensional gel electrophoresis 1 The technique involves two different gel electrophoresis experiments The first separates by pH/charge interactions (isoelectric focusing) The second separates by size (mass) Figure 23.5a © McGraw Hill 22 Two-Dimensional Gel Electrophoresis 2 Specific spots may be of special interest Proteins that are very abundant in a cell type May be important for that cells structure or function Spots present in only given circumstances Cells exposed to a hormone versus those that are not Spots present only in abnormal cells Very common in cancer cells Figure 23.5b © McGraw Hill (b): © SPL/Science Source 23 © McGraw Hill 24 Mass Spectrometry Is Used to Identify Proteins The next step is to correlate a given spot on a two- dimensional gel with a particular protein A spot is cut out from the gel The protein is eluted from the gel in a purified form The amino acid sequence of the protein is revealed via a technique called tandem mass spectrometry Two spectrometers are used The first measures the mass of a given peptide (that was generated from protein digestion) The second analyzes the peptide after it has been digested into even smaller fragments, one amino acid at a time © McGraw Hill 25 Tandem Mass Spectrometry 1. Peptides are mixed with an organic acid and dried onto a metal slide. 2. The sample is subjected to a laser beam. The peptides become ejected as an ionized gas in which the peptide contains one or more positive charges 3. The charged peptides are then accelerated via an electric field and fly toward a detector. The time they spend in flight is determined by their mass and net charge and reveals the mass of the peptide Each amino acid has its own characteristic mass Figure 23.6 Access the text alternative for slide images. © McGraw Hill 26 Determining Gene Sequence from Peptide Sequence Once the amino acid sequence is obtained, the gene sequence is not far behind Genome sequences of many species have been determined Computer software allows amino acid sequences obtained by mass spec to be used to search a database of protein sequences Computer program may locate a match between the identified sequence and a protein within a species Mass spectrometry can also be used to identify protein covalent modifications For example, the mass of a phosphorylated protein increases by the mass of a phosphate © McGraw Hill 27 Protein Microarrays Are Used to Study Protein Expression and Function The technology to make DNA microarrays is being applied to make protein microarrays Proteins rather than DNA are spotted onto a slide The development of protein microarrays is a bit more challenging Proteins are much more easily damaged by the manipulations that occur during microarray formation Synthesis and purification of proteins tend to be more time-consuming compared to DNA Nevertheless, the last few years have seen progress © McGraw Hill 28 Table 23.2 Some Applications of Protein Microarrays TABLE 24.2 Some Applications of Protein Microarrays Application Description An antibody microarray can measure protein expression because each antibody in a given spot recognizes a specific amino acid sequence. Such Protein expression a microarray can be used to study the expression of proteins in a cell- specific manner. It can also be used to determine how environmental conditions affect the levels of particular proteins. The substrate specificity and enzymatic activities of groups of proteins can Protein function be analyzed by exposing a functional protein microarray to a variety of substrates. The ability of proteins to interact with each other interactions can be Protein-protein determined by exposing a functional protein microarray to fluorescently interactions labeled proteins. The ability of drugs to bind to cellular proteins can be determined by exposing a functional protein microarray to different kinds of labeled drugs. Pharmacology This type of experiment can help to identify the proteins within a cell to which a given drug may bind. © McGraw Hill 29 Types of Protein Microarrays There are two common approaches to protein microarray analysis 1. Antibody microarrays Consist of a collection of antibodies that recognize short peptide sequences Used to assess the level of protein expression 2. Functional microarrays Consist of many different cellular proteins Used to probe the function of proteins For instance, are certain proteins phosphorylated by a set of kinases © McGraw Hill 30 23.3 BIOINFORMATICS The computer has become an important tool in genetic studies The marriage between genetics and biocomputing has yielded an important branch of science: bioinformatics Computer analysis of genetic sequences usually relies on three basic components: A computer A computer program Some type of data © McGraw Hill 31 Sequence Files Are Analyzed by Computer Programs A computer program is a defined series of operations that can analyze data in a desired way A first step in the computer analysis of genetic data is the generation of a computer data file This file is simply a collection of information in a form suitable for storage and manipulation by a computer A file, for example, could contain the DNA sequence of the lacY gene from E. coli as shown next © McGraw Hill 32 Figure 23.7 © McGraw Hill 33 Creating Computer Data Files Entering the data into the computer file is done Manually (that is, typing) Usually by instruments Reading data directly from a sequencing ladder The genetic sequence can be analyzed in many ways: 1. Does a sequence contain a gene? 2. Where are functional sequences, such as promoters and splice sites? 3. Does a sequence encode a polypeptide? If so, what is the amino acid sequence of that polypeptide? 5. Is a sequence homologous to other sequences? 6. What is the evolutionary relationship between two or more genetic sequences? © McGraw Hill 34 Example: Translating a DNA sequence into an amino acid sequence 1 Consider a program aimed at translating a DNA sequence The geneticist (i.e. the user) has a DNA sequence that needs to be translated into an amino acid sequence DNA sequence of the lacY gene (shown in Figure 23.7) The user is sitting at a computer that can run the TRANSLATION program The program can translate in all three forward reading frames; the longest reading frame is shown here: © McGraw Hill 35 Example: Translating a DNA sequence into an amino acid sequence 2 The speed and accuracy of this program far surpasses the abilities of a human to accomplish the same task Moreover, a program such as TRANSLATION, can translate a genetic sequence into six reading frames This is useful if the user does not know Where the start codon is located Or the direction of the coding sequence Many programs Translate into amino acids Locate introns within genes © McGraw Hill 36 Different Computational Strategies Can Identify Functional Genetic Sequences Computer programs can be designed to locate meaningful features within very long sequences Consider this alphabetic sequence file of 54 letters: GJTRLLAMAQLHEOGYLTOBWENTMNMTORXXX TGOODNTHEQALLRTLSTORE We will now see how three different computer programs can analyze the sequence to identify meaningful features © McGraw Hill 37 Search Strategies The first computer program locates all the English words within this sequence: GJTRLLAMAQLHEOGYLTOBWENTMNMTORXXXTGOOD NTHEQALLRTLSTORE The second locates a series of words that are organized in the correct order to form a grammatically logical sentence: GJTRLLAMAQLHEOGYLTOBWENTMNMTORXXXTGOOD NTHEQALLRTLSTORE The third locates patterns of five letters that occur both in the forward and reverse directions: GJTRLLAMAQLHEOGYLTOBWENTMNMTORXXXTGOOD NTHEQALLRTLSTORE © McGraw Hill 38 Sequence and Pattern Recognition In the three previous examples, we can distinguish between sequence and pattern recognition Sequence recognition The program has the information that a specific sequence of symbols has a specialized meaning With information from a dictionary, the first program can identify sequences of letters that make words Pattern recognition Does not rely on specialized sequence information Rather, programs like the third program, look for a pattern of symbols that can occur within any group of symbol arrangements © McGraw Hill 39 Identification Strategies Three general principles of ID strategies: 1. Locate specialized sequences (sequence elements) within a very long sequence Sequence elements are predefined and are contained explicitly within the computer program A genetic sequence with a particular function is called a sequence element or sequence motif Examples in Table 23.3 2. Locate an organization of sequences or sequence elements 3. Locate a pattern of sequences Computers perform these types of analysis with great speed and accuracy on very long sequences © McGraw Hill 40 Table 23.3 TABLE 24.3 Short Sequence Elements That Can Be Identified by Computer Analysis Type of Sequence Examples* Many E. coli promoters contain TTGACA (−35 site) and TATAAT (−10 site). Promoter Eukaryotic core promoters may contain CAAT boxes, GC boxes, TATA boxes, etc. Glucocorticoid response element (AGRACA), cAMP response element Response elements (GTGACGTRA) Start codon ATG Stop codons TAA, TAG, TGA Splice site GTRAGT YNYTRAC(Y)nAG Polyadenylation signal AATAAA Highly repetitive Relatively short sequences that are repeated many times throughout a genome sequences Transposable elements Often characterized by a pattern in which direct repeats flank inverted repeats *The sequences shown in this table are found in DNA. For gene sequences, only the coding strand is shown. R = purine (A or G) ; Y = pyrimidine (T or C); N = A, T, G, or C; U in RNA = T in DNA. © McGraw Hill 41 Approaches to Identify Genes Gene prediction refers to the process of identifying regions of genomic DNA that encode genes Protein-encoding genes Genes for non-coding RNAs Computer programs can employ different strategies to locate genes: Search by signal The program tries to locate an organization of known sequence elements that are normally found within a gene (promoter, start/stop codons, etc.) Search by content The program tries to identify sequences that differ significantly from a random distribution due to codon bias within protein-encoding genes Some codons are used more frequently than others for the same amino acid © McGraw Hill 42 Open Reading Frames 1 Another way to locate coding regions within a DNA sequence is to examine reading frames In a DNA sequence, the reading of codons could begin with the first, second or third nucleotide These are called reading frame 1, 2 and 3, respectively An open reading frame (ORF) is a nucleotide sequence that does not contain any stop codons In prokaryotes, long ORFs are contained within the chromosomal gene sequences In eukaryotes, however, the chromosomal coding sequences may be interrupted by introns © McGraw Hill 43 Open Reading Frames 2 A computer program can translate a genomic DNA sequence in all three reading frames, seeking to locate a long ORF A reading frame could also proceed from right to left Thus, six reading frames are possible in a newly discovered sequence Figure 23.8 © McGraw Hill 44 Homologous Genes Are Derived from the Same Ancestral Gene When comparing genetic sequences, researchers sometimes find two or more similar sequences Example: DNA sequences of the lacY gene in E. coli and K. pneumoniae ~ 78% of the bases are a perfect match; Fig. 23.9a © McGraw Hill 45 Formation of Homologous Genes in Two Species In this case, the two sequences are similar because the genes are homologous to each other These are orthologs They are found in different species and have been derived from the same ancestral gene Figure 23.9b © McGraw Hill 46 Orthologs and Paralogs When two homologous genes are found in different species, these genes are termed orthologs When two homologous genes are found in a single organism, these genes are termed paralogs A gene family consists of two or more copies of homologous genes within the genome of a single organism Homologous genes are often analyzed and identified using computer data bases, which are described next © McGraw Hill 47 Computer Databases The amount of genetic information that is generated by researchers has become enormous Large numbers of computer data files are collected and stored in a single location, a database The files in a database typically contain annotations They contain the genetic sequence and a concise description of it In addition, they contain other features of significance The scientific community has created several large databases containing data from thousands of labs © McGraw Hill 48 Table 23.4 Examples of Major Computer Databases TABLE 24.4 Examples of Major Computer Databases Type Description DNA sequence data are collected in three internationally collaborating databases: GenBank (a U.S. database), EMBL (European Molecular Biology Laboratory Nucleotide Sequence Nucleotide sequence Database), and DDBJ (DNA Databank of Japan). These databases receive sequence and sequence annotation data from genome projects, sequencing centers, individual scientists, and patent offices. These databases are accessed via the Internet. Amino acid sequence data are collected in a few international databases including Swissprot (Swiss protein database), PIR (Protein Information Resource), Genpept (translated peptide Amino acid sequence sequences from the GenBank database), and TrEMBL (Translated sequences from the EMBL database). PDB (Protein Data Bank) collects the three-dimensional structures of biological macromolecules with an emphasis on protein structure. These are primarily structures that Three-dimensional have been determined by X-ray crystallography and nuclear magnetic resonance (NMR). structure These structures are stored in files that can be viewed on a computer with the appropriate software. Prosite is a database containing a collection of amino acid sequence motifs that are Protein motifs characteristic of a protein family, domain structure, or certain posttranslational modifications. Pfam is a database of protein families with multiple amino acid sequence alignments. Gene Expression Omnibus (GEO) contains data regarding the expression patterns of genes Gene expression data within a data set, such as the data obtained from microarrays or ChIP-chip assays. © McGraw Hill 49 Searching Databases for Homologous Sequences In general, a strong correlation is typically found between homology and function Homology between genetic sequences can be identified by computer programs and databases A powerful tool for predicting the function of genetic sequences In 1990, Stephen Althschul, David Lipman and their colleagues developed an approach called BLAST Basic local alignment search tool Relationship between the query sequence and each matching sequence is given an E-value (Expect value) Represents the number of times that the match or a better one would be expected to occur purely by random chance in the entire database © McGraw Hill 50 BLAST BLAST starts with a genetic sequence and then locates homologous sequences in a large database Homology among protein sequences is easier to identify than is DNA sequence homology Table 23.5 (next slide) shows the results of a database search The amino acid sequence of human phenylalanine hydroxylase was used as a “query sequence” by the BLAST program Within minutes, BLAST can search the entire database and determine which sequences are the closest matches Small E-Value indicates that similarity is unlikely to be due to random events; the genes are likely to be homologous E-values depend on length of query, number of gaps in alignment, and database size © McGraw Hill 51 Table 23.5 TABLE 24.5 Results from a BLAST Program Comparing Human Phenylalanine Hydroxylase with Database Sequences Percentage of Identical Match* Species Function of Sequence ‡ E-value Amino Acids† 1 100 Human (Homo sapiens) Phenylalanine hydroxylase 0 2 99 Orangutan (Pongo pygmaeus Phenylalanine hydroxylase 0 3 92 Mouse (Mus musculus) Phenylalanine hydroxylase 0 4 92 Rat (Rattus norvegicus) Phenylalanine hydroxylase 0 5 83 Chicken (Gallus gallus) Phenylalanine hydroxylase 0 6 78 Western clawed frog (Xenopus tropicalis) Phenylalanine hydroxylase 0 7 75 Zebrafish (Danio rerio) Phenylalanine hydroxylase 0 8 72 Japanese pufferfish (Tak ifugu rubripes) Phenylalanine hydroxylase 0 9 62 Fruit fly (Drosophila melanogaster) Phenylalanine hydroxylase 10−154 10 57 Nematode (Caenorhabditis elegans) Phenylalanine hydroxylase 10−141 * The10 examples shown here were randomly chosen from the results of a BLAST program using human phenylalanine hydroxylase as the starting sequence. †The number indicates the percentage of amino acids that are identical to the amino acid sequence of human phenylalanine hydro xylase. ‡In some cases, the function of the sequence was determined by biochemical assay. In other cases, the function was inferred due to the high degree of sequence similarity with other species. © McGraw Hill 52 Homologous Genes Can Be Compared with Each Other in a Multiple Sequence Alignment This approach was originally proposed by Saul Needleman and Christian Wunsch in 1970 They demonstrated that whale myoglobin and human - hemoglobin have similar sequences Since then, this approach has been used to compare more than two genetic sequences This produces a multiple sequence alignment Later, we will see this approach as it is applied to the globin gene family in humans © McGraw Hill 53 Globin Gene Family (1) Let’s apply the multiple sequence alignment approach to the globin gene family In humans, 9 paralogs are functionally expressed The globin family also contains several pseudogenes that are not expressed and a myoglobin gene The 9 paralogs fall into two categories -chains α1,α2 ,θ and ξ -chains β,δ,γ A ,γG and ε Each hemoglobin protein is composed of two -chains and two -chains © McGraw Hill 54 Globin Gene Family (2) The composition of hemoglobin changes during the course of development For example and genes are expressed during early embryonic development and genes are expressed in the adult Comparing the sequences of the hemoglobin chains can give insight into their structure and function The sequences of the globin polypeptides can be compared using multiple sequence alignment A conserved site is a site that is identical or similar across multiple homologs Conserved sites tend to be functionally important © McGraw Hill 55 Figure 23.10 © McGraw Hill 56 Because learning changes everything. ® www.mheducation.com © 2021 McGraw Hill. All rights reserved. Authorized only for instructor use in the classroom. No reproduction or further distribution permitted w ithout the prior w ritten consent of McGraw Hill.