Genome Annotation PDF

Document Details

HeroicAntigorite4836

Uploaded by HeroicAntigorite4836

Tags

genome annotation gene prediction biological information molecular biology

Summary

This document provides an overview of genome annotation, covering topics such as gene identification, structure prediction, functional annotation, and methods used in the process. It includes detailed explanations of different approaches used in genome analysis, such as identifying open reading frames (ORFs), structural analysis, homology-based prediction, and use of ontologies.

Full Transcript

GENOME ANNOTATION GENOME ANNOTATION The process of identifying the locations of genes and the coding regions in a genome to determe what those genes do Finding and attaching the structural elements and its related function to each genome locations...

GENOME ANNOTATION GENOME ANNOTATION The process of identifying the locations of genes and the coding regions in a genome to determe what those genes do Finding and attaching the structural elements and its related function to each genome locations 2 Genome Annotation gene structure prediction gene function prediction Identifying elements Attaching biological information to these (Introns/exons,CDS,stop,start) in the genome elements- eg: for which protein exon will cod for 3 Steps in genome annotation Identify repetitive sequences (mask these for subsequent steps). Identify structural RNA encoding genes (by comparison to known rRNA / tRNA sequences). Identify protein-encoding genes (ORFs). Identify functions of these genes. Identifying ORFs (1) Relatively easy in bacteria, sequence is scanned for ORFs (sequences between start and stop codon) of greater than a fixed length (e.g. 100 amino acids). More complicated in eukaryotes because of introns. Identifying ORFs (2) Consensus sequences for splicing are short and vary amongst species. Gene prediction programs trained using sequences of known genes EST sequences / RNA-Seq data often used for training set. Genome annotation - workflow Genome sequence Map repeats Masked or un-masked Gene finding- structural annotation nc-RNAs, Introns Protein-coding genes Functional annotation Viewed & Released in Genome viewer 7 STRUCTURAL ANNOTATION Structural annotation Identification of genomic elements: Open reading frame and their localization Coding regions Location of regulatory motifs Start/Stop Splice Sites Non coding Regions/RNA’s 9 Eukaryote genome annotation Find locus Genome Transcription Primary Transcript RNA processing Find exons ATG STOP using transcripts Processed mRNA m7G AAAn Translation Find exons Polypeptide using peptides Protein folding Folded protein Find function Enzyme activity Functional activity A B 10 Prokaryote genome annotation Find locus Genome Transcription Primary Transcript RNA processing Find CDS START STOP START STOP Processed RNA Translation Polypeptide Protein folding Folded protein Find function Enzyme activity Functional activity A B 11 Genome Repeats & features Polymorphic between individuals/populations § Percentage of repetitive sequences in different organisms Genome Genome Size % Repeat (Mb) Aedes aegypti 1,300 ~70 Anopheles gambiae 260 ~30 Culex pipiens 540 ~50 Ø Microsatellite Ø Minisatellite Ø Tandem repeat Ø Short tandem repeat 12 Ø SSR Finding repeats as a preliminary to gene prediction § Repeat discovery § Literature and public databanks §Homology based approaches § Automated approaches (e.g. RepeatScout or RECON) §Tandem repeats: Tandem, TRF §Use RepeatMasker to search the genome and mask the sequence 13 Masked sequence Repeat masked sequence is an artificial construction where those regions which are thought to be repetitive are marked with X’s Widely used to reduce the overhead of subsequent computational analyses and to reduce the impact of TE’s (echo time) in the final annotation set >my sequence >my sequence (repeatmasked) atgagcttcgatagcgatcagctagcgatcaggct atgagcttcgatagcgatcagctagcgatcaggct actattggcttctctagactcgtctatctctatta actattxxxxxxxxxxxxxxxxxxxxxxxxxxxxx gctatcatctcgatagcgatcagctagcgatcagg xxxxxxatctcgatagcgatcagctagcgatcagg ctactattggcttcgatagcgatcagctagcgatc ctactattxxxxxxxxxxxxxxxxxxxtagcgatc aggctactattggcttcgatagcgatcagctagcg aggctactattggcttcgatagcgatcagctagcg atcaggctactattggctgatcttaggtcttctga atcaggctxxxxxxxxxxxxxxxxxxxtcttctga tcttct tcttct Positions/locations are not affected by masking 14 Methods q Similarity Similarity between sequences which does not necessarily infer any evolutionary linkage q Ab- initio prediction Prediction of gene structure from first principles using only the genome sequence 15 Genefinding ab initio similarity 16 Genefinding - ab initio predictions Use compositional features of the DNA sequence to define coding segments (essentially exons) § ORFs § Coding bias § Splice site consensus sequences § Start and stop codons Methods §Training sets are required § Each feature is assigned a log likelihood score §Use dynamic programming to find the highest scoring path for accuracy Examples: Genefinder, Augustus, Glimmer, SNAP, fgenesh 17 Genefinding - similarity § Use known coding sequence to define coding regions § EST sequences § Peptide sequences §Problem to handle fuzzy alignment regions around splice sites §Examples: EST2Genome, exonerate, genewise Gene-finding - comparative q Use two or more genomic sequences to predict genes based on conservation of exon sequences q Examples: Twinscan and SLAM 18 Gene-finding omissions Alternative isoforms Currently there is no good method for predicting alternative isoforms Only created where supporting transcript evidence is present Pseudogenes Each genome project has a fuzzy definition of pseudogenes Badly curated/described across the board Promoters Rarely a priority for a genome project Some algorithms exist but usually not integrated into an annotation set 19 FUNCTIONAL ANNOTATION 20 Functional annotation Attaching biological information to genomic elements ØBiochemical function ØBiological function ØInvolved regulation and interactions ØExpression Utilise known structural information to predicted protein sequence 21 Tools to identify functions of genes Sequence similarity (BLAST) searches Protein family / domain analysis (Pfam). Predicted sub-cellular localisation (SignalP / PSORT). Transcriptomic analysis – what conditions are genes expressed under. Comparative genomics – comparison with genomes of other closely related organisms. Functional classification Protein-coding genes classified based on their function. Hierarchical gene ontologies used, e.g. GO, MIPS, EC Functional annotation – Homology Based Predicted Exons/CDS/ORF are searched against the non- redundant protein database (NCBI, SwissProt) to search for similarities Visually assess the top 5-10 hits to identify whether these have been assigned a function Functions are assigned 24 Functional annotation - Other features Other features which can be determined – Signal peptides – Transmembrane domains – Low complexity regions – Various binding sites, glycosylation sites etc. – Protein Domain See http://expasy.org/tools/ for a good list of possible prediction algorithms 25 Functional annotation - Other features (Ontologies) Use of ontologies to annotate gene products – Gene Ontology (GO) Cellular component Molecular function Biological process 26 Functional annotation- output Bioinformatics tools for Comparative August 2008 27 Genomics of Vectors Annotation-a summary Annotation accuracy is only as good as the available supporting data at the time of annotation- update information is necessary Gene predictions will change over time as new data becomes available (ESTs, related genomes) that are much similar than previous ones Functional assignments will change over time as new data becomes available (characterization of hypothetical proteins) 28 Ontology An ontology is a "formal, explicit specification of a shared conceptualization“ Two formal major ontology schemes: – EC – Enzyme Commission Number – GO – Gene Ontology 29 Enzyme Commission (EC) A large scale comprehensive attempt to organize and classify enzymes according to its function For inclusion in the list, direct experimental evidence is to be provided for its claimed activity Organizes the list of enzymes in four levels of hierarchy, starting with the top most 7 classes: 1. Oxidoreductases 2. Transferases 3. Hydrolases 4. Lyases 5. Isomerases 6. Ligases 7. Translocases 30 Chronology: Enzyme Commission (EC) Cons of EC: Hierarchy only provides parent to child relationship Only specific to enzymes (doesn't cover all of the proteins) 31 What is the Gene Ontology (GO)? Molecular Function Biological Process Cellular Component Relations between the terms – ‘is_a’ – ‘part_of’, ‘has_part’ – ’regulates’ 32 Structure of GO du Plessis L, Skunca N, Dessimoz C (2011). The what, where, how and why of gene ontology–a primer for bioinformaticians. Brief Bioinform. Doi: 10.1093/bib/bbr002 33 Where Do Annotations Come From? Inferred from experiment – Most reliable – Base for computational method Inferred from computational method – Sequence similarity, structural similarity, etc. Inferred from author statement Curator statement and Obsolete evidence codes 34 Why use the GO? The ‘GO Consortium’ consists of a number of large databases working together to define standardized ontologies and provide annotations to the GO. Search for interacting genes Reason across the relations Analyze the results of high-throughput experiment Infer function of un-annotated genes and inter protein- protein interactions. 35 Pros and Cons Homology Useful but different from “same” function – Simply implies common ancestry 36 Pros and Cons Quality of Prediction is as good as the quality of annotation of the database Eukaryotic function predictor can not be used for Prokaryotes and vice versa 38 Flowchart 27th Feb 2012 39 Comparison of gene catalogues of eukaryotic genomes Comparative genomics Comparison of gene catalogues between different species. Can be used to identify groups of proteins that are specific to certain phylogenetic groups. Example, comparison of human gene catalogue with other species. Proportion of human genes that have orthologues in other groups

Use Quizgecko on...
Browser
Browser