Genome Annotation and Gene Finding PDF
Document Details
Uploaded by Deleted User
null
2024
Dr. Parisa Shooshtari
Tags
Summary
This document provides lecture notes for a class on genome annotation and gene finding, covering topics like introduction to genome annotation, mapping, and regulatory regions, with associated questions. It details the process of taking raw DNA, and adding layers of analysis for understanding biological processes.
Full Transcript
Genome Annotation and Gene Finding Instructor: Dr. Parisa Shooshtari MBI 3100 – Week 10 November 6, 2024 Introduction to Genome Annotation The genome sequence of an organism is a very information-rich resource that biologists have access to now. However, the...
Genome Annotation and Gene Finding Instructor: Dr. Parisa Shooshtari MBI 3100 – Week 10 November 6, 2024 Introduction to Genome Annotation The genome sequence of an organism is a very information-rich resource that biologists have access to now. However, the value of the genome is only as good as its annotation. Annotation bridges the gap from the sequence to the biology of the organism. The primary aim of high-quality annotation: identify the key features of the genome (in particular, the genes and their products). The tools and resources for annotation have been developing rapidly, and the scientific community continue to rely on this information for all aspects of biological research. 2 Introduction to Genome Annotation (cont.) Numerous whole-genome sequencing projects have been either completed or are advanced substantially ØMicrobial genomes ØSaccharomyces cerevesiae (yeast) ØCaenorhabditis elegans (worm) ØDrosophila melanogaster (fruitfly) ØArabidopsis thaliana (mustard weed) ØHuman genome ØMouse ØRat ØZebrafish ØNon-human primates 3 Introduction to Genome Annotation (cont.) When we look at genome sequences, it may seem that they are random strings of A/C/G/T nucleotides. However, there are a lot more to understand about the genome, and this will continue to surprise us. ØFragments of viral genomes that infected ancestral individuals ØMobile elements ØPseudogenes ØRepetitive elements 4 Introduction to Genome Annotation (cont.) Interestingly, principal aspects of the basic organization of the genome are still not well-understood. ØThe regulation of alternative splicing ØThe control of transcription ØThe role of intergenic material ØThe function of many non-coding RNAs ØThe function of gene regulatory elements such as enhancers or promoters 5 What is Genome Annotation? The process of taking the raw DNA sequence (which are generated by the genome-sequencing projects), and adding the layers of analysis and interpretation necessary to extract its biological significance and place it into the context of our understanding of biological processes. 6 Genome Annotation: A Multi-Step Process Nucleotide-level Protein-level Process-level Image from: Lincoln Stein, “Genome Annotation: From Sequence to Biology”, Nature Reviews 2001 7 Protein-Level Annotation This stage of genome annotation seeks to Ø compile a definitive catalogue of the proteins of the organisms Øname them Øassign them assumed functions Image from: Lincoln Stein, “Genome Annotation: From Sequence to Biology”, Nature Reviews 2001 8 Process-Level Annotation This stage of annotation is focused on relating the genome to biological processes. How do the building blocks of genes and proteins relate to the cell cycle, cell death, metabolism, and the maintenance of health and disease? Image from: Lincoln Stein, “Genome Annotation: From Sequence to Biology”, Nature Reviews 2001 9 Question: What is the purpose of annotating genome? Question: What are the three levels of genome annotation? 10 Nucleotide-Level Annotation Mapping Finding Genomic Landmarks Gene Finding Non-coding RNAs Regulatory Regions Transcription Factors Binding Sites 11 Nucleotide-Level Annotation Mapping Finding Genomic Landmarks Gene Finding Non-coding RNAs Regulatory Regions Transcription Factors Binding Sites 12 Mapping The first step in genome annotation is to identify the punctuation marks. Where are the known genes? Where are the genetic markers? Where are the other landmarks previously identified by genetic, cytogenetic or radiation hybrid mapping? Where are the RNAs (tRNAs, rRNAs and other non-translated RNAs)? Where are the repetitive elements? Is there evidence for ancient duplications in the genome and, if so, where are the end points of the putative duplicated regions? 13 Mapping The first step in genome annotation is to identify the punctuation marks. Where are the known genes? Where are the genetic markers? Where are the other landmarks previously identified by genetic, cytogenetic or radiation hybrid mapping? Where are the RNAs (tRNAs, rRNAs and other non-translated RNAs)? Where are the repetitive elements? Is there evidence for ancient duplications in the genome and, if so, where are the end points of the putative duplicated regions? All these questions are really an extended form of physical mapping, attempting to convert the raw DNA sequence into a set of easily recognized landmarks and reference points. 14 Mapping (cont.) Along with ‘gene finding’, the principal activity of this phase of annotation is identifying and placing all known landmarks into the genome. For example, annotators search for known genetic markers, radiation hybrid markers*, and place them. This bridges between the genomic sequence and pre- existing genetic, radiation hybrid and physical maps, and provides a path to connect the pre-genomic literature, which is often based on such landmarks, with post-genomic research. * Radiationhybrid mapping is a genetic technique that was originally developed for constructing long- range maps of mammalian chromosomes. It is based on a statistical method to determine not only the distances between DNA markers but also their order on the chromosomes. 15 Question: What is mapping in simple words in the context of nucleotide-level annotations? 16 Nucleotide-Level Annotation Mapping Finding Genomic Landmarks Gene Finding Non-coding RNAs Regulatory Regions Transcription Factors Binding Sites 17 Finding Genomic Landmarks (cont.) Finding landmarks is a relatively straightforward task. ØShort sequences, such as PCR-based genetic markers, can be identified using the Primer-BLAST program. ØLonger sequences, such as restriction-fragment length polymorphism markers, can be found using BLASTN, SSAHA or other rapid sequence- similarity searching algorithm. 18 Ø Short sequences, such as PCR-based genetic markers, can be identified using the PRIMER-BLAST program. Link: https://www.ncbi.nlm.nih.gov/tools/primer-blast/index.cgi Ø Longer sequences, such as restriction-fragment length polymorphism markers, can be found using BLASTN, SSAHA or other rapid sequence-similarity searching algorithm. 19 Question: Name one tool/algorithm that can be used to find genomic landmarks in short sequences? Question: Name one tool/algorithm that can be used to find genomic landmarks in long sequences? 20 Nucleotide-Level Annotation Mapping Finding Genomic Landmarks Gene Finding Non-coding RNAs Regulatory Regions Transcription Factors Binding Sites 21 Gene Finding Gene finding is the most visible part of this phase of annotation. ØIn small prokaryotic genomes, gene finding is largely a matter of identifying long open reading frames (ORFs)*. Ø Even here, however, ambiguities arise if long ORFs overlap on opposite strands, and the true coding region must be sorted out. ØAs genomes get larger, gene finding becomes increasingly tricky. The main issue is the signal-to-noise ratio. * Anopen reading frame (ORF) is the part of a reading frame that has the ability to be translated. An ORF is a continuous stretch of codons that may begin with a start codon and ends at a stop codon. 22 Gene Finding (cont.) In a prokaryotic genome, such as Haemophilus influenzae, 85% of its 1.8-Mb genome is in coding regions. The corresponding number in yeast is not much lower, at 70%. For these genomes ‘calling genes’ is an exercise in running a computer program that identifies all ORFs that are longer than a chosen threshold. But even in these small genomes, finding genes was not entirely effortless. For example, the number of predicted yeast genes, took several years to settle down, and there are still several short ORFs that have an uncertain status as “real” genes. 23 Gene Finding (cont.) In the fly and the worm, however, less than 25% of the genome is in coding regions, and the number falls to just a few per cent in humans. The process of finding genes is further complicated by the presence of splicing and alternative splicing. In the human genome, a typical exon is 150 bp and a typical intron is several kilobases, and there is no clear boundary between the intergenic regions that separate adjacent genes and the intragenic regions that separate exons. Defining the precise start and stop position of a gene and the splicing pattern of its exons among all the non-coding sequence is like finding a very small and indistinct needle in a very large and distracting haystack. 24 Gene Finding (cont.) Traditionally, several software algorithms were devised to handle gene prediction in eukaryotic genomes. Examples include: ØGENSCAN ØGenie ØGeneMark.hmm ØGrail ØHEXON ØMZEF ØFgenes ØGeneFinder ØHMMGene. 25 Gene Finding (cont.) These algorithms typically consist of one or more “sensors” that attempt to identify the presence of a gene feature from motifs or statistical properties of the DNA. ØFor example, as transcribed regions are associated with (G+C)-rich regions, a sensor for transcriptional start sites might measure the G+C content of the region being scanned. ØA sensor for splice sites compares the current region to splice consensus sequences. 26 Gene Finding (cont.) Some gene predictors stop with the prediction of a single feature, such as the exon predictors HEXON and MZEF. However, attempt to use the output of several sensors to generate a whole gene model, in which a gene is defined as a series of exons that are coordinately transcribed. Examples include: ØNeural network-based methods (Grail) ØRule-based systems (GeneFinder) ØHidden Markov Models or HMM (GenScan, Genie, HMMGene, GeneMark.hmm, Fgenes). Ø The HMM approach has the advantage of explicitly modelling how the individual probabilities of a sequence of features are combined into a probability estimate for the whole gene. 27 Gene Finding (cont.) Neural Network Hidden Markov Models (HMM) 28 Question: What proportion of prokaryotic genome (e.g. influenzae or yeast) is in coding regions? Question: What proportion of the genome of the fly and the worm is in coding region? Question: What proportion of human genome is in coding region? 29 Question: Can you name one or two categories of multi-”sensor” algorithms that are used to predict the whole-gene models? 30 Gene Finding (cont.) Despite great progress, gene prediction entirely based on DNA analysis was still far from perfect. In a comparison of gene-prediction programs, the authors of nearly a dozen algorithms were asked to predict genes in two well-annotated regions of the fruitfly genome. The best algorithms could achieve sensitivity* and specificity** of 95% and 90%, respectively, when asked to predict whether a particular nucleotide is in an exon. * Sensitivity is a measure of the ability to detect true positives. ** Specificity is a measure of the ability to discriminate against false positives. 31 Gene Finding (cont.) The accuracy dropped off rapidly if the criterion was changed to calling the boundaries of an exon correctly, and still further if the algorithm was required to predict the entire gene structure correctly. Under the latter requirements, the best gene predictors had a sensitivity of 40% and a specificity of 30%. This means that most of the genes predicted by these programs contain errors ranging from an incorrect exon boundary to a missed exon. Between 5% and 15% of genes were missed entirely in this contest. * Sensitivity is a measure of the ability to detect true positives. ** Specificity is a measure of the ability to discriminate against false positives. 32 Gene Finding (cont.) Although they had not perform equivalent comparison of gene-prediction programs on the human genome, it is safe to assume that these programs would have performed more poorly because of the lower signal-to-noise ratio. One study showed that the GENSCAN accuracy dropped rapidly as intergenic lengths in a simulated data set increased. 33 Gene Finding (cont.) Fortunately, we do not have to rely completely on ab initio gene prediction* programs. The similarity of a region of the genome to a sequence that is already known to be transcribed is a much more powerful predictor of whether a sequence is transcribed. Ø A nucleotide match to a cDNA **, Ø expressed sequence tag (EST) ***, and Ø even a BLASTX **** match to a gene in another species are good evidence that a region belongs to a gene. * Ab Initio gene prediction is an intrinsic method based on gene content and signal detection. **Complementary DNA (cDNA) is DNA synthesized from a single-stranded RNA (e.g. mRNA or microRNA) template in a reaction catalyzed by the enzyme reverse transcriptase. *** Expressed sequence tag (EST) is a short sub-sequence of a cDNA sequence. **** BLASTX is a program that searches protein databases using a translated nucleotide query. 34 Gene Finding (cont.) However, the process of deriving a complete gene model from one or more sequence similarities is not nearly as straightforward as it might sound. There are many problems: Ø Pseudogenes are a common feature of eukaryotic genomes. Ø Many similarity-based gene-prediction algorithms require evidence that the gene is spliced and that the splices maintain an in-phase open reading frame (ORF). However, this criterion biases gene prediction against single-exon genes. Ø cDNA sequences might contain repetitive elements that will cause false genomic matches. Ø Similarities to proteins in other species might suffer from evolutionary divergence. Ø The presence of alternative splicing considerably complicates the interpretation of alignments between genomic DNA and cDNAs. Ø Similarity data is never complete. Even the most comprehensive EST projects will miss low-copy- number transcripts and those transcripts that are expressed only unusual conditions. 35 Question: Which approach for gene finding is more accurate? A. Ab initio gene prediction entirely based on DNA analysis B. Using similarity data (i.e. similarity of a region of the genome to a sequence that is already known to be transcribed) Question: Can you list a few issues related to approach B (gene finding using similarity of a region of the genome to a sequence that is already known to be transcribed)? 36 Gene Finding (cont.) One of the trends in gene prediction was to make as much use of sequence-similarity data as possible. The algorithms that took similarity data into account generally were more successful than those that did not take them into account. Some gene-prediction algorithms combine ab initio predictions with similarity data into a single probability model. Ø Grail/Exp Ø Genie EST Ø GenomeScan 37 Gene Finding (cont.) Combine ab initio predictions with similarity data. Ø GrailExp (http://pbil.univ-lyon1.fr/members/duret/cours/insa2004/exercise4/pgrail.html) Ø Genie EST Ø GenomeScan (http://hollywood.mit.edu/genomescan.html) 38 Gene Finding (cont.) Several genome-wide gene-annotation systems so far have run sequence-similarity searches and ab initio gene predictors separately, then combined and reconciled the predictions later. For worm genome, this reconciliation was initially carried out by curators who manually examined each gene prediction in the context of matching ESTs and homologues from other species. The process was accelerated significantly later on by automated procedures for reconciling EST alignments with gene, and by systematically PCR amplifying a cDNA library using primer pairs that span predicted genes. 39 Gene Finding (cont.) For the case of human working draft, an automated rules-based gene-prediction system was developed that attempted to mimic how a human annotator might examine a sequence. This system gives sequence similarity the highest priority, drawing evidence that a region is transcribed from sequence similarities found in the: ØRefSeq library of well-characterized human genes ØUnigene set of human ESTs ØSWISS-PROT and other protein databases. It then uses an algorithm such as GENSCAN to find and refine the splicing pattern of the predicted gene. 40 Gene Finding (cont.) The Human Sequencing Consortium (based on the Ensembl gene annotation system) took almost the reverse approach. ØIt begins with ab initio gene predictions from GENSCAN, and ØThen strengthening the predictions using nucleotide and protein similarities. ØThese predicted gene models were merged and reconciled with the output of Genie EST. ØFinally merged with the contents of the RefSeq library. 41 Gene Finding (cont.) Although the two groups (human working draft and the human sequencing consortium) approached the gene finding problem from different directions, both gave greater weight to cDNA and EST alignments than to ab initio gene prediction. So, it is not too surprising that the estimates from both groups of the number of genes were very close: ~30,000. 42 Question: What are the two ways to combine ab initio programs and similarity data? Question: Why were the gene annotations derived by the Human Sequencing Consortium (i.e. Ensmbl) and Human Working Draft similar? 43 Gene Finding (cont.) RefSeqGene is a subset of NCBI Reference Sequence (RefSeq) project, defines genomic sequences to be used as reference standards for well- characterized genes. These sequences serve as a stable foundation for establishing conventions for numbering exons and introns, and for defining the coordinates of other variations. RefSeq mRNA and protein sequences have long been used for this purpose, but have the obvious weakness of not providing explicit coordinates for flanking or intronic sequence. RefSeq chromosome sequences do provide explicit coordinates no matter the relationship to any gene annotation, but have large coordinate values that will change when the sequence is updated because of a re-assembly. 44 Gene Finding (cont.) Sequences of the RefSeqGene project address both of these drawbacks by providing more stable gene-specific genomic sequence for each gene, as well as including upstream and downstream flanking regions. If modifications must be made to any RefSeqGene sequence, it will be versioned and tools will be provided to facilitate conversion of coordinates. The RefSeqGene sequences are aligned to reference chromosomes, and current and previous chromosome coordinates are available because of that re-alignment. Link: https://www.ncbi.nlm.nih.gov/refseq/rsg/about/ 45 Gene Finding (cont.) GENECODE (https://www.gencodegenes.org) The goal of the GENCODE project is to identify and classify all gene features in the human and mouse genomes with high accuracy based on biological evidence, and to release these annotations for the benefit of biomedical research and genome interpretation. 46 Gene Finding (cont.) GENCODE continues to improve the coverage and accuracy of human and mouse gene sets by enhancing and extending the annotation of all evidence-based gene features in the human genome at a high accuracy, including protein-coding loci with alternatively splices variants, non-coding loci and pseudogenes. The process to create this annotation involves manual curation, computational analysis and targeted experimental approaches. The human and mouse GENCODE resources will continue to be available to the research community with regular releases of Ensembl genome browser and the UCSC genome browser. They will continue to present the current release of the GENCODE gene set. 47 Question: Provide the names of two large-scale resources (i.e. databases) that are commonly used for gene annotations. 48 Nucleotide-Level Annotation Mapping Finding Genomic Landmarks Gene Finding Non-coding RNAs Regulatory Regions Transcription Factors Binding Sites 49 Non-coding RNAs There is much more to the genome than coding genes. On the cutting edge of nucleotide-level annotation is the search for non-coding RNAs and transcriptional regulatory regions. Non-coding RNAs include tRNAs* rRNAs**, and small nuclear RNAs. ØtRNAs can be predicted using algorithms that search for characteristic structural signatures. ØrRNAs can be found easily by similarity searching, but the rest are tricky, because of both their short length and their nucleotide diversity. * Transfer RNA (tRNA) is a small RNA molecule that plays a key role in protein synthesis. ** Ribosomal RNA (rRNA) are found in the ribosomes and account for 80% of the total RNA present in the cell. 50 Non-coding RNAs One of the most widely used tRNA prediction program is tRNAScanSE, which combines several algorithms to identify tRNAs with high accuracy in good running times. It can also distinguish active tRNAs from tRNA pseudogenes. This program was used during annotation of the public human sequence to identify 497 tRNAs and 324 pseudogenes. Link: http://trna.ucsc.edu/tRNAscan-SE/ 51 Non-coding RNAs Other non-coding RNAs, such as telomerase RNA and the U1–12 series of spliceosome RNAs, can be identified by sequence similarity, but there are likely to be many non-coding RNAs that have not yet been identified. There are algorithms based on identifying characteristic patterns of mismatched base pairs in cross-species alignments, for example mouse and human. There might be hundreds of previously unrecognized non-coding RNAs in the genome. It will be interesting to learn the role and function of non-coding RNAs discovered in this way. 52 Non-coding RNAs Identifying Non-coding RNAs ØSimilarity searching ØSearching for characteristics structure signatures ØIdentifying characteristic patterns of mismatched base pairs in cross-species alignments ØCombining several algorithms 53 Question: Can you name one software tool that can be used for tRNA prediction? 54 Nucleotide-Level Annotation Mapping Finding Genomic Landmarks Gene Finding Non-coding RNAs Regulatory Regions Transcription Factors Binding Sites 55 Regulatory Regions Detecting regulatory sites has its own challenges, particularly because these sites are cell-type specific and they vary across different cell types. In the past decade there have been a significant progress in annotating regulatory regions in the genome. Multiple large-scale projects have attempted to identify and annotate regulatory regions (such as promoters and enhancers) in the genome across different cell types. 56 Regulatory Regions ENCODE: Encyclopedia of DNA Elements (> 14,000 samples) Link: https://www.encodeproject.org 57 Regulatory Regions ENCODE (https://www.encodeproject.org) Assays Ø Transcription factors Ø Open chromatin regions Ø Histone modifications Ø DNA methylation arrays Ø Whole-genome bisulfite sequencing Cell types Ø Tissues (liver, heart, stomach, etc.) Ø Cell lines Ø Primary cells Ø In vitro differentiated cells Ø Organoids 58 Regulatory Regions Roadmap Epigenomics Project (Link: http://www.roadmapepigenomics.org) 59 Regulatory Regions Roadmap Epigenomics Project (Link: http://www.roadmapepigenomics.org) Assays Ø Histone modifications Ø Open chromatin data Ø Bisulfite-sequencing Cell types Ø Stem cells Ø Brain cells Ø Immune cells Ø Heart, lung, skin, etc 60 Regulatory Regions Blueprint Epigenome (Link: https://www.blueprint-epigenome.eu) International Human Epigenome Consortium (IHEC) Link: https://epigenomesportal.ca/ihec/ 61 Regulatory Regions ChromHMM (Chromatin State Discovery and Characterization) Link: http://compbio.mit.edu/ChromHMM/ 62 Regulatory Regions ChromHMM (Chromatin State Discovery and Characterization) Link: http://compbio.mit.edu/ChromHMM/ 98 Cell Types 63 Genomic Regions Question: Can you name a few databases that can be used to identify regulatory sites across different cell types? 64 Nucleotide-Level Annotation Mapping Finding Genomic Landmarks Gene Finding Non-coding RNAs Regulatory Regions Transcription Factors Binding Sites 65 Transcription Factors Binding Sites TRANSFAC (Transcriptional Regulation, from Patterns of Profiles) Link: https://genexplain.com/transfac/#section0 Chroma'n 66 Transcription Factors Binding Sites TRANSFAC considers interactions between transcription factors (TFs) and their DNA binding sites (TFBS). ØTFs are described considering their structural and Chroma'n functional features, extracted from the original scientific literature. ØBinding of a TF to a genomic site is documented by specifying the localization of the site, its sequence and the experimental method applied. 67 https://en.wikipedia.org/wiki/TRANSFAC Transcription Factors Binding Sites JASPAR (https://jaspar.genereg.net) The JASPAR CORE database contains a curated, non- redundant set of profiles, derived from published collections of experimentally defined transcription factor binding sites for eukaryotes. JASPAR data is open access, non-redundant and has high Chroma'n quality. 68 Question: Can you name two sequence-based transcription factor databases that can be used for transcription factor annotations? 69 Summary Mapping Finding Genomic Landmarks Gene Finding Non-coding RNAs Regulatory Regions Transcription Factors Binding Sites 70 References: Lincoln Stein, “Genome Annotation: From Sequence to Biology”, Nature Reviews, 2001 RefSeqGene: https://www.ncbi.nlm.nih.gov/refseq/rsg/ GENECODE: https://www.gencodegenes.org ENCODE Project: https://www.encodeproject.org Roadmap Epigenomics Project: http://www.roadmapepigenomics.org Blueprint Epigenomes: https://www.blueprint-epigenome.eu IHEC: https://epigenomesportal.ca/ihec/ ChromHMM: http://compbio.mit.edu/ChromHMM/ TRANSFAC: https://genexplain.com/transfac/#section0 JASPAR: https://jaspar.genereg.net 71 References (cont.): Primer-BLAST: https://www.ncbi.nlm.nih.gov/tools/primer-blast/index.cgi BLASTN: https://blast.ncbi.nlm.nih.gov/Blast.cgi SSAHA: https://www.sanger.ac.uk/tool/ssaha/ Xu et. al., “GRAIL: a multi-agent neural network system for gene identification”, IEEEXplore GENSCAN: http://hollywood.mit.edu/GENSCAN.html and https://www.genes.mit.edu/genscan.html Genie: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC310881/ HMMGene: https://services.healthtech.dtu.dk/service.php?HMMgene-1.1 GeneMark.hmm: http://exon.gatech.edu/GeneMark/ GrailExp: http://pbil.univ-lyon1.fr/members/duret/cours/insa2004/exercise4/pgrail.html GenomeScan: http://hollywood.mit.edu/genomescan.html Ensmbl: https://useast.ensembl.org/index.html 72