Lecture 7 - Genome Annotation PDF
Document Details
Uploaded by EfficientHurdyGurdy
McMaster University
Tags
Summary
This document is a lecture on genome annotation, covering various aspects including techniques like RNA-sequencing, gene identification, and functional annotation. It also discusses tools like MAKER and SNPEff, along with the role of deep learning in genome annotation.
Full Transcript
Lecture 7 – Genome Annotation BIO4BI3 - Bioinformatics Where are we going? DNA Sequencing DNA DNA Read Sequencing Quality Assembly Mapping...
Lecture 7 – Genome Annotation BIO4BI3 - Bioinformatics Where are we going? DNA Sequencing DNA DNA Read Sequencing Quality Assembly Mapping Control Genome Expression Annotation Analysis Marker-Trait Population Polymorphis Genotyping Associations Analysis m Discover Define genome annotation and explain its significance in understanding genome function. Explain structural and functional annotation processes, and their respective roles. Differentiate between the four classes of gene prediction with real-world examples. Learning Describe the role of MAKER in gene annotation Objectives and how SNPEff is used for SNP functional impact assessment. Discuss Deep Learning and how it is used in annotation Explore the Gene Ontology (GO) classification system and its application in functional annotation. What is Annotation Genome annotation is the process of identifying the locations of genes and other features in a genome and assigning functions to these elements. This can be broken down into: Structural Annotation: Identifying the precise locations of genes, regulatory elements, and repetitive sequences. Functional Annotation: Assigning biological functions to the identified genes (e.g., using databases like Gene Ontology or KEGG). Genome annotation adds biological meaning to a DNA sequence and is critical for linking genotype to phenotype. Why is annotation important? Genome sequences without annotation are like a book in a language we don’t understand— annotation translates these sequences into something we can interpret. Steps of Genome Annotation 1.Gene Identification (Structural Annotation): Identify genes, regulatory elements, and other genomic features based on DNA sequences and RNA-seq data. 2.Functional Annotation: Assign functional roles to identified genes using databases like Gene Ontology and KEGG pathways. Focus for this lecture will be on structural annotation, with a brief look into how functional annotation adds deeper biological insight. RNA-based Annotation RNA-seq (RNA sequencing) is the dominant method for discovering gene coding regions in the genome. RNA-seq helps identify expressed genes, including alternative splice isoforms, especially when used with long-read sequencing technologies Why is RNA-seq important? Gene expression varies across tissues and developmental stages. Long-read RNA-seq helps capture full-length transcripts, which short- read sequencing misses. Limitations: Not all genes are expressed in all tissues or at all times, making it difficult to annotate the full set of genes using RNA-seq alone. Gene Prediction Approaches 1. ab initio (signal and context-based): Predicts genes based on sequence features (start/stop codons, splice sites). Tools: Augustus, GeneMark. 2. Homology-based: Compares genome sequences to known genes in other species. Tools: BLASTX, Exonerate. 3. Comparative genomics: Compares entire genomes to predict conserved genes. Tools: MUMmer, MAUVE. 4. Machine learning-assisted approaches: Algorithms learn patterns from existing datasets to predict gene models more accurately. Tools: DNABERT2, DeepGene. Gene Prediction ab initio methods fall into the searching by signal and searching by context paradigm This is gene prediction without directly comparing the region of interest to other sequences Also called ‘template’ or ‘intrinsic’ gene prediction Homology and comparative approaches involve using other DNA sequences to assist in gene prediction Also called ‘look-up’ or ‘extrinsic’ gene prediction Eukaryotic gene prediction is much more difficult than prokaryotes (why?) Steps in Gene Prediction Identify splice sites and start/stop signals along the query sequences Predicting candidate exons based on the signals determined in the first step Scoring of exons as a function of both the signals used to detect the exon and the coding statistics of the exon. Note: in comparative and homology-based methods the quality of the alignment is factored into developing an exon score Assembly of a subset of the candidate exons into a predictive gene model. The choice of exons is made to maximize a scoring function dependent on the individual exons and the overall structure of the putative gene structure Defining an exon Signal Example Sequence Four basic signals involved Translational start ATG in defining an exon site 1. The translational start site 5’ donor splice site (A/C)AG|GT(A/G)AGT 2. 5’ donor splice site 3. 3’ donor splice site 3’ donor splice stie (C/T)AA|G 4. Translational stop codon Translation stop TAA, TAG, TGA These four signals are codon derived from position weight matrices from known functional signals; Scoring an Exon Besides the 4 signals used to search for exons there are also important content-based features Initial exons: ORFs delimited by a start site and a 5’ donor site Internal exons: ORFs delimited by a 3’ acceptor site and a 5’ donor site Terminal exon: ORFs delimited by a 3’ acceptor site and stop codon Also consider upstream elements such as TATA box elements Also consider downstream elements such as poly-A signals Composition Bias Organisms have a ‘preference’ for a particular codons for an amino acid Recall that an amino acid is coded in DNA by more than one triplet. Methionine is the exception as it is coded by a single triplet (ATG) This bias can be observed by plotting the frequency of pairs of nucleotides in introns and exons Cumulative Frequency of Pairs From 500 exons and 500 introns the computed frequency of the nucleotide pair A..A where the nucleotides are separated by some distance, k. In exons the A’s are separated by distances of k while introns are not. This also occurs with the other bases to and is indication of codon bias in The Human Codon Usage Table Gly GGG 17.08 0.23 Arg AGG 12.09 0.22 Trp TGG 14.74 1.00 Arg CGG 10.40 0.19 Gly GGA 19.31 0.26 Arg AGA 11.73 0.21 End TGA 2.64 0.61 Arg CGA 5.63 0.10 Gly GGT 13.66 0.18 Ser AGT 10.18 0.14 Cys TGT 9.99 0.42 Arg CGT 5.16 0.09 Gly GGC 24.94 0.33 Ser AGC 18.54 0.25 Cys TGC 13.86 0.58 Arg CGC 10.82 0.19 Glu GAG 38.82 0.59 Lys AAG 33.79 0.60 End TAG 0.73 0.17 Gln CAG 32.95 0.73 Glu GAA 27.51 0.41 Lys AAA 22.32 0.40 End TAA 0.95 0.22 Gln CAA 11.94 0.27 Asp GAT 21.45 0.44 Asn AAT 16.43 0.44 Tyr TAT 11.80 0.42 His CAT 9.56 0.41 Asp GAC 27.06 0.56 Asn AAC 21.30 0.56 Tyr TAC 16.48 0.58 His CAC 14.00 0.59 Val GTG 28.60 0.48 Met ATG 21.86 1.00 Leu TTG 11.43 0.12 Leu CTG 39.93 0.43 Val GTA 6.09 0.10 Ile ATA 6.05 0.14 Leu TTA 5.55 0.06 Leu CTA 6.42 0.07 Val GTT 10.30 0.17 Ile ATT 15.03 0.35 Phe TTT 15.36 0.43 Leu CTT 11.24 0.12 Val GTC 15.01 0.25 Ile ATC 22.47 0.52 Phe TTC 20.72 0.57 Leu CTC 19.14 0.20 Ala GCG 7.27 0.10 Thr ACG 6.80 0.12 Ser TCG 4.38 0.06 Pro CCG 7.02 0.11 Ala GCA 15.50 0.22 Thr ACA 15.04 0.27 Ser TCA 10.96 0.15 Pro CCA 17.11 0.27 Ala GCT 20.23 0.28 Thr ACT 13.24 0.23 Ser TCT 13.51 0.18 Pro CCT 18.03 0.29 Ala GCC 28.43 0.40 Thr ACC 21.52 0.38 Ser TCC 17.37 0.23 Pro CCC 20.51 0.33 Codon Usage Log-Likelihood Let F(c) be the frequency (probability) of codon c in the genes of the species under consideration (from the previous codon usage table) Then, given a sequence of codons C=C1C2...Cm, and assuming independence between adjacent codons P(C)=F(C1)F(C2)...F(Cm) is the probability of finding the sequence of codons C knowing that C codes for a protein. For instance, if S is the sequence S=TCTACG, when read in frame 1, it results in the sequence of codonsC1=TCT, C2=ACG. Then P(S)=P(C)=F(TCT)F(ACG) Substituting the appropriate values from the codon usage table we obtain P(S)=P(C)=0.014 x 0.007=0.000098 Codon Usage Log-Likelihood Let F0(C) be the frequency of codon c is a non-coding sequence P0(S)=P0(C)=F0(C1)F0(C2)...F0(Cm) Assuming the random model of coding DNA, F0(C)=1/64=0.0156 P0(C)=0.0156 x 0.0156=0.000244 Therefore the TCT and ACG codons occur less frequently than random in coding regions. Therefore this frame for the sequence is unlikely This is expressed as a log odds ratio LP(S)=(logP(S)/ P0(S))=log(0.00098/0.000244)=log(0.402)=-0.396 LP(S) becomes above zero when the cumulative frequency of Codon Usage log-likelihood of B- globin Hidden Markov Model in Gene Finding Markov Models refer to a series of observations in which the probability of the observation depends on a number of previous observations The order of the model is based on how many previous observations are required for the current probability Markov chain of order 5 are most often used for DNA where codon bias and codon adjacencies are important HMM are used to predict whether a base is in a intron, exon, or intergenic region The model must take into account the known structures of genes HMM in Gene Finding How would we describe An HMM the elements of a gene Must ‘know’ about conserved Promoter region sequence or compositional bias that Transcriptional start exists for the regions site Must take into account a controlled 5’ untranslated syntax of gene structure i.e.- region promoters come before start Start codons codons Exons Note: The elements are called ‘states’ Splice donors in an HMM introns Splice acceptors Stop codons 3’ untranslated HMM in Gene Finding The syntax mentioned above allows a transition probability to be assigned, that is how likely is it that we move from an exon to an intron based on composition or bias The transition probabilities are calculated from training sets, known genes with carefully defined regions The parameters used to tune an HMM differ from organism to organism so a proper training set is essential to success GENESCAN, GENIE and HMM-gene are all gene prediction programs which use hidden markov models Homology-based Gene Prediction Using closely related sequences to search a DNA space for similar genes Programs such as BLASTX are well suited for this where the DNA region of interest is compared to a protein database A limitation of using the BLAST suite of programs for gene finding is that they don’t define intron/exon boundaries well This limitation can be addressed by using ab initio methods mentioned above ESTs from other species can be compared using BLASTN or TBLASTX. This can be very useful to identify conserved coding regions Like other BLAST methods intron/exon boundaries may not be well defined Exonerate is a slower but very effective similarity-based prediction pool using the est2genome or protein2genome models Homology-based Gene Prediction Historically, in annotation projects tremendous effort was put into generating cDNAs from species under investigation Multiple tissues under multiple conditions need to be captured. With older sequencing methods this became very expensive With NGS the near transcriptional state of a tissue under various conditions can be captured in a single run The high sampling allows for the capture of alternative splicing events NGS has changed the landscape of gene annotation where ab initio methods are used as a secondary approach Comparative Prediction of Conserved Regions Much effort has been put into predicting gene regions Many important chromosome regions are not genic Comparing genomes of different species or ecotypes can reveal regions that are under some evolution pressure and therefore have not diverged as much as other regions microRNA can be coded in regions far from other genes. Theses small non-transcribed elements can be difficult to identify Conservation of miRNA across species is one method of identification The human genome has approximately 1900 miRNA likely controlling 10s of thousands of genes Mass Spec assisted Annotation Protein Assisted Annotation Annotation with Long-Reads Challenges of Short-Read Sequencing: Short reads can miss large, complex regions and alternative splicing isoforms. Why Long-Read Sequencing Matters: Captures full-length transcripts, improving gene model accuracy. Essential for resolving complex genomic regions, such as repeats, transposons, and structural variants. Applications: Discovering new isoforms, lncRNAs (long non-coding RNAs), and transposable elements. Improving functional annotation by revealing novel regulatory regions. SNPEff If you have re-sequenced a genome you may want to better understand the impacts of the polymorphisms you’ve discovered The software SNPEff is a very population suite of software which will predict the functional impact of a polymorphism in a gene (https://pcingola.github.io/SnpEff/) SNPEff requires the gene annotations for the genome against which one aligned their re- sequencing reads to SNPEff also requires a file of the polymorphism detected and their position in the reference genome SNPEff Output MAKER Genome Annotation 1.Repeat Masking: Identifies and masks repetitive sequences in the genome. 2.Evidence Alignment: Uses known genes, proteins, and RNA-seq data to align against the genome, providing a reference for gene models. 3.Gene Prediction (ab initio): Combines alignments with signal-based predictions to model gene structures. 4.Final Gene Model Integration: Integrates evidence-based models with ab initio predictions to produce the final set of gene models. 5.Functional Annotation: Uses BLAST, InterProScan, and other tools to assign functional information to genes. Tools involved: RepeatMasker, BLAST, SNAP, Augustus, GeneMark. MAKER continued Evident Ab initio Repeat Masking Alignment Prediction Repeat Masker BLASTX SNAP Repeat Builder TBLASTX Augustus BLASTN GeneMark BWA Exonerate Final Model MAKER Functional Annotation BLASTN BLASTP MAKER continued http://gmod.org/wiki/MAKER Genome Browser Visualization Deep Learning and Genome Annotation What is Deep Learning? Deep Learning is a subset of machine learning that uses neural networks with multiple layers to automatically learn features from large datasets. It is particularly useful for complex data patterns that traditional methods struggle to identify. Why Use Deep Learning for Genome Annotation? Genomes are massive and complex, requiring sophisticated models that can handle vast amounts of data. Traditional rule-based methods or simpler machine learning models rely heavily on manual feature engineering, while deep learning can automatically discover relevant patterns. Applications of Deep Learning in Annotation Gene Prediction: The latest software use transformer neural networks to predict gene coding elements by learning from DNA sequence features. Examples: DeepGene, DNABERT2 Alternative Splicing Detection Deep learning models can predict splice sites (intron-exon boundaries) with high accuracy, addressing the challenge of complex splicing events in eukaryotes. Examples: SpliceAI Enhancer and Promoter Identification Deep learning has been employed to classify non-coding regulatory regions such as enhancers and promoters using sequence data and chromatin state information. Examples: DeepEnhancer, DeepCAPE How Deep Learning Works in Annotation 1. Input Data: Raw DNA sequences are input into the model along with other data that provide context, such as existing annotations or information chromatin states. These are ‘labeled data’ 2. Feature Extraction: Neural networks use the input data to automatically detect features, such as motifs (e.g., start/stop codons, splice sites) or regulatory elements through a process called ‘training’. 3. Training the Model: The model is trained on labeled data (known genes or regulatory elements) to learn patterns. It improves its prediction by minimizing error through backpropagation. 4. Prediction: Once trained, the model can predict features in unlabeled genomes, like identifying gene structures, splice sites, or regulatory regions The Future of Genome Annotation Challenges: Data availability: Deep learning requires large, labeled datasets to train effective models. Computational resources: Training deep learning models is computationally intensive. Future Directions: Transfer Learning: Pre-trained models on one genome can be fine-tuned for annotation of other species, reducing the need for massive datasets. Integration of multi-omics data: Deep learning models that combine DNA, RNA, epigenetic, and proteomic data can provide even more accurate annotations. Crowdsourcing and community-driven efforts: Combining human expertise with deep learning models to improve annotations in a continuous cycle. Gene Ontology Consortium A systematic classification of gene function It defines a controlled vocabulary with standardized terms and the relationships among them Can be viewed as a dictionary and rules of syntax Three categories organize gene based on broad criteria A gene can exist in multiple categories Started in 1999 with the fruit fly project but has expanded to include many species including mammals and plants www.geneontology.org GO Categories Molecular function – The function of the gene from a biochemists point of view. The description can be general or specific. Examples: zinc ion binding, ubiquitin-protein ligase activity Biological process – The function of the gene from a cell’s point of view. A component of the activities of a living system. Examples: cell division, histone methylation Cellular component – The region of activity in general or specific terms. Examples: nucleus, mitochondrion Molecular Function Biological Process Cellular Component Why Use GO Gives context to the identity of genes Allows a higher level of abstraction to identify trends in expression content Allows a common vocabulary to identify genes in pathway Good to characterize genome content Excellent to identify gaps in metabolic pathways Excellent to identify trends in differential gene expression BLAST2GO GO Chart AMIGO Functional Annotation and Pathway Databases Gene Ontology (GO) helps categorize the function of a gene into molecular function, biological process, and cellular component. Pathway Databases such as KEGG (Kyoto Encyclopedia of Genes and Genomes) or Reactome map genes to known metabolic or signaling pathways. KEGG Pathways: Focuses on molecular interactions, reaction networks, and higher-order functional contexts. Reactome: Focuses on reactions and pathways in human biology but covers other species too. KEGG Pathway Database Why Use Pathway Databases They help to visualize how a gene or a set of genes function in the context of biological pathways. Linking genes to pathways allows researchers to: 1. Understand interactions between genes and gene products. 2. Identify disrupted pathways in disease states. 3. Perform network analysis to see how genes interact with each other in broader biological processes Network Analysis Network Analysis is the study of biological networks, where nodes represent genes/proteins and edges represent interactions (physical, regulatory, or signaling). Applications of Network Analysis: 1.Gene Co-expression Networks: Identifying groups of genes that are co- expressed under certain conditions, which often implies functional relationships. 2.Protein-Protein Interaction Networks (PPIs): Studying how proteins interact to form complexes and carry out functions. 3.Pathway Integration: Linking annotated genes to known pathways can reveal: 1. Hub genes (genes that interact with many others, often key regulators). 2. Bottleneck genes (genes critical to information flow in biological networks). Co-Expression Network Example Zhu, Xuan & Li, Tao & Niu, Xing & Chen, Lijie & Ge, Chunlin. (2020) 20. 10.3892/ol.2020.11903.