Bioinformatics BIOT8010 Lecture Notes PDF
Document Details
Uploaded by CongratulatoryHeather
University College Cork
Dr. Francesca Bottacini
Tags
Summary
These lecture notes cover the fundamentals of gene prediction, including the identification of predicted coding exons in a DNA sequence and the statistical scoring systems used to locate true exons. The notes discuss the challenges of gene prediction when considering diverse genetic differences between organisms and introduce the concept of open reading frames.
Full Transcript
Bioinformatics - BIOT8010 Week4 (Lectures7-8) Dr. Francesca Bottacini Gene prediction Gene prediction is the first step of sequence annotation. >My new sequence (5’ to 3’)...
Bioinformatics - BIOT8010 Week4 (Lectures7-8) Dr. Francesca Bottacini Gene prediction Gene prediction is the first step of sequence annotation. >My new sequence (5’ to 3’) tgaccacgccggtcaagccggtggaattcaacgggcagcaggaagccgattccgatgaatcgatggactacc gctacctgctggtgccgatgcgctttaacagctgacgcagaatctctcacatactgcacatgcatatctcccgtc Identification of predicted coding exons in an unannotated ttgcactcgaccactaccgctcatggagccaggtagtggtcgatttcgtaccaggagtcaatatcctggtcggc aagaacggcttgggcaaaaccaatctggtggaggccgttgaagtgctctccaccggagctagtcatcgtgcct DNA sequence (typically a genome sequence obtained by NGS ccagcatgctgccgcttatcgaacgcggccaaaccactgccactaggaggcagccaacgtatgcgacgacga sequencing and assembly). tgggcaaagcaccacgtatgcggcatcgattcatgctcgcggcgcgaatcgggcacgcatcaattcaggaacc tcgctctatttacgcgatatcatcggccgcattcccagcgtttcgtttactcctgtagaccagagattggtatcgg gtggtcctggtgcccggcgaacgttgctcaaccaagccggagccctgctggaacccggctatatgcagtcgttg Gene finding is a predictive method using a statistical scoring catcaattcacacgcatcggcaagcagcgcgccacattgctgtagcagcttggcgccagcgcgaatactgggc aaccggtggatgccgtattaagcggtttggaaatatggaccggacagtttatcgaagcgggtgtggaactgcc system to determine the probability of a given sequence to be a ccgtatgcgtgcgagagtcatcgacttgctggctgggccgtttgccgcgctgtacgctgggttgcctggcaacg true exon. atgtcaccgtcagcctggcgtatgcgccgtcatttgccgaagtgctgctgcaagacgatccacgattgggcatc agtggacatttccagcgcatctaccctggagaagtggctcgaggcgtcaatcttatcggccctcaacgtgccga tttggccctgcatcttgctgctatgccagcccgcgaattcgcctcgaacggcgagatgtggaccatggcgctgg Computational identification of genes models remains a ccttgtaaatggcgctattccaagtgatacgccagagattgggccttaggcccatcgttatccttgatgatgtgtt challenging task, because of the genetics and differences in gene cgcccagcttgacgacaaccgtcgtacccagattctggattttgcgcgccggcaagatcaggtgcttatcaccg tggcgtcagaaggcgatgtgcccacgcatgaatcggacaatgtgcttgatattgcgcaattgatgccatccgcc encoding between organisms (e.g. prokaryotes, eukaryotes and gcgcagtctgggggcgggagcgagacacaatcatgaagcctcccattgcctgtcagctgcatgtggacgaaa viruses). ccaaactacccgcagaaatatttgaacgtctctctcaccgcggggcgatactgaaagaccgacgccgcagac gcgaggaagcctttgagaacttcggcaagcccggccgtgatccggccgagttgggcagcgtgatgagctcaat cgcaggcggtggcgtatgggctgcgaacatgaaattggcgcaattgcgcaaccattgggatcaggtggtaggc A SINGLE MODEL DOESN’T FIT ALL ! gaggcaatcgccaatca The choice of an appropriate gene finding algorithm designed for a particular target organism is necessary to ensure optimal gene prediction Open reading frame prediction >My new sequence (5’ to 3’) tgaccacgccggtcaagccggtggaattcaacgggcagcaggaagccgattccgatgaatcgatggactacc gctacctgctggtgccgatgcgctttaacagctgacgcagaatctctcacatactgcacatgcatatctcccgtc Finding genes in a nucleotide DNA sequence ttgcactcgaccactaccgctcatggagccaggtagtggtcgatttcgtaccaggagtcaatatcctggtcggc aagaacggcttgggcaaaaccaatctggtggaggccgttgaagtgctctccaccggagctagtcatcgtgcct is the fist step in understanding the function ccagcatgctgccgcttatcgaacgcggccaaaccactgccactaggaggcagccaacgtatgcgacgacga of an organism! tgggcaaagcaccacgtatgcggcatcgattcatgctcgcggcgcgaatcgggcacgcatcaattcaggaacc tcgctctatttacgcgatatcatcggccgcattcccagcgtttcgtttactcctgtagaccagagattggtatcgg gtggtcctggtgcccggcgaacgttgctcaaccaagccggagccctgctggaacccggctatatgcagtcgttg catcaattcacacgcatcggcaagcagcgcgccacattgctgtagcagcttggcgccagcgcgaatactgggc aaccggtggatgccgtattaagcggtttggaaatatggaccggacagtttatcgaagcgggtgtggaactgcc ccgtatgcgtgcgagagtcatcgacttgctggctgggccgtttgccgcgctgtacgctgggttgcctggcaacg atgtcaccgtcagcctggcgtatgcgccgtcatttgccgaagtgctgctgcaagacgatccacgattgggcatc agtggacatttccagcgcatctaccctggagaagtggctcgaggcgtcaatcttatcggccctcaacgtgccga tttggccctgcatcttgctgctatgccagcccgcgaattcgcctcgaacggcgagatgtggaccatggcgctgg ccttgtaaatggcgctattccaagtgatacgccagagattgggccttaggcccatcgttatccttgatgatgtgtt cgcccagcttgacgacaaccgtcgtacccagattctggattttgcgcgccggcaagatcaggtgcttatcaccg tggcgtcagaaggcgatgtgcccacgcatgaatcggacaatgtgcttgatattgcgcaattgatgccatccgcc gcgcagtctgggggcgggagcgagacacaatcatgaagcctcccattgcctgtcagctgcatgtggacgaaa ccaaactacccgcagaaatatttgaacgtctctctcaccgcggggcgatactgaaagaccgacgccgcagac gcgaggaagcctttgagaacttcggcaagcccggccgtgatccggccgagttgggcagcgtgatgagctcaat cgcaggcggtggcgtatgggctgcgaacatgaaattggcgcaattgcgcaaccattgggatcaggtggtaggc gaggcaatcgccaatca Open Reading Frame: portion of a sequence (reading frame) with the potential of being translated into a polypeptide. Is a continuous stretch of codons not containing a stop codon Open reading frame prediction >My new sequence (5’ to 3’) tgaccacgccggtcaagccggtggaattcaacgggcagcaggaagccgattccgatgaatcgatggactacc gctacctgctggtgccgatgcgctttaacagctgacgcagaatctctcacatactgcacatgcatatctcccgtc ttgcactcgaccactaccgctcatggagccaggtagtggtcgatttcgtaccaggagtcaatatcctggtcggc aagaacggcttgggcaaaaccaatctggtggaggccgttgaagtgctctccaccggagctagtcatcgtgcct ccagcatgctgccgcttatcgaacgcggccaaaccactgccactaggaggcagccaacgtatgcgacgacga Finding signals in DNA sequence: tgggcaaagcaccacgtatgcggcatcgattcatgctcgcggcgcgaatcgggcacgcatcaattcaggaacc tcgctctatttacgcgatatcatcggccgcattcccagcgtttcgtttactcctgtagaccagagattggtatcgg gtggtcctggtgcccggcgaacgttgctcaaccaagccggagccctgctggaacccggctatatgcagtcgttg RBS (5’AGGAGG3’): Ribosomal binding site catcaattcacacgcatcggcaagcagcgcgccacattgctgtagcagcttggcgccagcgcgaatactgggc Start codon: ATG/GTG/TTG aaccggtggatgccgtattaagcggtttggaaatatggaccggacagtttatcgaagcgggtgtggaactgcc Stop codon: TGA/TAG/TAA ccgtatgcgtgcgagagtcatcgacttgctggctgggccgtttgccgcgctgtacgctgggttgcctggcaacg atgtcaccgtcagcctggcgtatgcgccgtcatttgccgaagtgctgctgcaagacgatccacgattgggcatc agtggacatttccagcgcatctaccctggagaagtggctcgaggcgtcaatcttatcggccctcaacgtgccga A gene is a region enclosed between a start codon and a stop codon, with an RBS tttggccctgcatcttgctgctatgccagcccgcgaattcgcctcgaacggcgagatgtggaccatggcgctgg signal positioned ~10bp upstream the gene start. ccttgtaaatggcgctattccaagtgatacgccagagattgggccttaggcccatcgttatccttgatgatgtgtt cgcccagcttgacgacaaccgtcgtacccagattctggattttgcgcgccggcaagatcaggtgcttatcaccg tggcgtcagaaggcgatgtgcccacgcatgaatcggacaatgtgcttgatattgcgcaattgatgccatccgcc Difficult to find signals, as their sequence may be degenerated (variable) and gcgcagtctgggggcgggagcgagacacaatcatgaagcctcccattgcctgtcagctgcatgtggacgaaa deviate from the consensus. ccaaactacccgcagaaatatttgaacgtctctctcaccgcggggcgatactgaaagaccgacgccgcagac gcgaggaagcctttgagaacttcggcaagcccggccgtgatccggccgagttgggcagcgtgatgagctcaat cgcaggcggtggcgtatgggctgcgaacatgaaattggcgcaattgcgcaaccattgggatcaggtggtaggc gaggcaatcgccaatca All possible polypetides found in all six reading frames in a double stranded DNA sequence Only a few polypeptides represent genuine coding regions. ORF prediction programs identify genuine coding regions, based on the presence of specific signals: RBS (5’AGGAGG3’): Ribosomal binding site upstream the start codon Start codon: ATG/GTG/TTG Stop codon: TGA For each sequence there is the correspondent reverse complement +3 (genomic DNA is a double stranded molecule)! +2 +1 Six possible Open Reading Frames in any given DNA sequence. -1 Three on the forward strand and three on the reverse strans -2 -3 +3 Predicted genes in the +2 three reading frames +1 Forward strand Reverse strand +1 +2 Predicted genes in the +3 three reading frames Translated amino acid sequence Nucleotide gene sequence Ribosomal Binding Site (~10 bp upstream start) Stop codons in all 6 reading frames Gene Start (ATG codon) Gene list with relative start and end base position (genomic coordinates) +3 FW +2 +1 +1 RV +2 +3 Region of continuous codons not containing a stop codon There is a potential gene encoded in both forward and reverse strand (but only one is true!) The prediction software needs to choose which strand encodes for the true ORF Using a probabilistic approach, the ORF finding algorithm will search for : RBS (5’AGGAGG3’): Ribosomal binding ~10bp upstream the start codon Start codon: ATG/GTG/TTG Stop codon: TGA/TAG/TAA +3 FW +2 +1 +1 RV +2 +3 Only the forward strans meets the criteria of encoding for a potential true gene: RBS (5’AGGGGG3’): Ribosomal binding ~10bp upstream the start codon with 1 mismatch Start codon: ATG Stop codon: TGA This process is repeated for the full length of forward and reverse complement sequence Overlapping genes (encoded in both forward and reverse strand) are corrected Gene prediction methods Gene prediction of protein-coding genes consist in two approaches: Ab-initio methods: not based on previous knowledge, they rely on statistical parameters and “signals” in the DNA sequence for gene identification. Known also as “intrinsic methods”, as they rely on the information contained in the DNA sequence itself. Homology and evidence based: based on previous knowledge, they rely on the presence of homologous sequence of predicted and verified sequences in a database, to identify genes in a new sequence. These are “extrinsic methods” Successful agorithms rely on a combination of both! Following genome assembly, the nucleotide sequence is used an an input to predict both protein-coding sequences and functional RNAs Genes and transcripts Prokaryotic transcript Eukaryotic transcript The prokaryotic gene includes the entire sequence represented in Coding region (exons) are flanked by non-coding regions mRNA. (introns) which do not carry coding information. The process of splicing removes introns from the pre-mRNA to generate an The gene is constituted of a continuous stretch of DNA which is RNA that has a continuous open reading frame collinear with its mRNA and polypeptide. Coding regions prokaryotic DNA Identification of protein coding regions in prokaryotes is relatively straightforward since they are gene dense with very little non-coding DNA. Gene prediction is easier in prokaryotes Prokaryotes lack introns. Promoter region and ribosomal binding sites are generally more conserved than eukaryotes Gene density is higher in prokaryotes, so genees are generally easier to identify Lower density of repeated regions Highest challenge in gene prediction for eukaryotic DNA is the prediction of splicing isoforms Mainly relies on evidence-based methods Gene prediction is easier in prokaryotes Prokaryotes lack introns. Promoter region and ribosomal binding sites are generally more conserved than eukaryotes Gene density is higher in prokaryotes, so genes are generally easier to identify Lower density of repeated regions Highest challenge in gene prediction for eukaryotic DNA is the prediction of splicing isoforms Mainly relies on evidence-based methods Ori (replication initiation) GC Skew index GC Skew is and index representing the GC composition in a DNA Reverse strand Forward strand molecule. leading leading Many bacterial genomes have asymmetric GC composition between leading and lagging strand, which also influence gene positioning Leading strand is characterised by a higher gene density and positive GC skew (higher frequency of G bases) than lagging strand GC skew = (G − C)/(G + C) Cytosine bases are more prone to spontaneous mutation, and lagging strand has a higher error rate during replication is more convenient for a bacterium to encode genes in a G-rich leading strand GC skew can be used by predictive software to identify leading Negative GC skew Positive GC skew and lagging strand assigns a higher probability score in fiding a Reverse strand Forward strand gene in the leading strand Ter (replication termination) Challenges in ORF finding programs Open Reading Frame prediction softwares are sequence analysis tools which find genes in all 6 open reading frames in a DNA sequence Setting a minimum size for a predicted gene is an important parameter, allowing to avoid overprediction of small genes (usually 100bp is an accepted size cut-off for a gene) Gene prediction can be complicated if there are sequencing errors! 1-3% error rate in short reads, while ~15% in long reads sequencing). Problems which can occur include: Frameshift errors or incorrect stop codons leading to abnormally short predicted proteins. Identification of the incorrect start codon if there are several candidates near the predicted gene start. Frameshift mutation A frameshift mutation occurs when a base is added or removed. As the DNA is translated in triplets, the addition or removal of a single base causes a shift in the coding frame, often resulting in a premature stop codon. Frameshifts are often a problem in DNA regions rich in homopolymers: Poly-A Poly-G Poly-C Poly-T If the frameshift is caused by an erroneously added base during sequencing, it is not a real frameshift, it will lead to an incorrect gene and protein sequence prediction. Prokaryotes ORF finding – Prodigal Prodigal is a well known gene prediction software for Gram positive bacteria. Below are the steps taken by this algorithm: Read the sequence and collect all start and stop codons Compute GC content and GC bias for forward and reverse strand and remove overlapping genes Compute statistics on the predicted genes, based on start, stop and GC bias. Correct the start codons Weight and correct the initial prediction based on the presence of an RBS upstrewam the start site Remove short genes Print the output! Prokaryotes ORF finding – Prodigal The result of the gene prediction is a series of non-overlapping predicted Open Reading Frames. The ORFs or “Coding regions” are separated by “Non-coding regions”. A minimum overlap between predicted genes is usually tolerated, but the overlap between two genes should not extend over 50% of a whole gene length Codon usage Codon usage is the frequency of occurrence of synonymous codons in a given organism. Different organisms have differences in the frequency of occurrence of synonymous codons in their coding DNA, meaning that some codons are rarely used while other codons are frequently used Different organisms use a preferred combination of codons for translation and this is called codon bias. This is why a gene cloned from an organism may not be translated (and expressed) in another organism! Deviations in codon usage bias may be helpful in identifying genes that have been acquired by horizontal gene transfer (they show a different codon composition compared to the housekeeping genes ORFs can be identified based on the fact that if it has codons more “likely” to occur in a given organism Coding regions eukaryotic DNA Predicting coding regions in genomic DNA is much more difficult in higher eukaryotes. Presence of introns and exons Exons of a few hundred bases may be separated by kilobases of intron. Computational challenge: identify all the exons at their proper length without missing any or predicting false ones Overcome the challenge: use of combinations of three methods: (i) statistical information, including codon usage; (ii) splice sites and sequence similarity to previously identified proteins and genes; and (iii) experimental evidence of transcript-derived sequences of cDNAs or expressed sequence tags The methods used to find an ORF in a prokaryotic DNA will not work when looking for coding regions in a genomic sequence from a higher eukaryote! The gene structure is much more complex than the final mRNA. A typical multi-exon gene starts with the promoter region. This is followed by a transcribed but non-coding region called the 5' untranslated region (5' UTR). The initial exon, which includes the start codon, is followed by an alternating series of introns and internal exons. The terminating exon then contains the stop codon (TAA). This is followed by another non-coding region, the 3' UTR, and the polyadenylation (polyA) signal. Exons and introns are transcribed into RNA in their linear order. Splicing then takes place, during which, the intronic sequences are excised and discarded. The remaining RNA segments, which correspond to the exons, are ligated to form the mature RNA strand, which is then translated into protein. Two types of DNA sensor (content and signal) are used to locate genes in the genomic sequence of higher eukaryotes: Content sensors classify DNA into coding regions (exons) and non-coding regions (introns, intergenic regions and un-translated regions) (i) Extrinsic content sensors use homology searches (using FASTA and BLAST) to identify highly conserved exons in the databases. Largely effective, but their success is limited to homologies within the database; if no homologs exist no data can be extracted. (ii) Intrinsic content sensors, on the other hand, focus on specific innate characteristics of the DNA sequence itself, which help to predict the likelihood of whether the sequence in question “codes” for a protein or not. Useful intrinsic content sensors include: nucleotide composition codon usage base occurrence periodicity (B) Signal sensors detect the presence of functional sites specific to a gene. Signal sensors are DNA motifs found in a sequence. Signal motifs relating to transcription, translation and splicing have all been employed to facilitate gene identification and structure prediction. Transcriptional signal sensors (TSS) – include: the initiator or cap signal located at the transcriptional start site upstream TATA box promoter signal polyadenylation signal (a consensus AATAAA hexamer) located 20 to 30 bp downstream of the coding region. Translational signals - include the “Kozak signal” or translation initiation start site (GCC[A/G]CCaugG[not U]) located immediately upstream of the start codon. Splice signals - identification of splice site signals specifically donor and acceptor sites (GT-AG on the introns sequence) and branch points (CU[A/G]A[C/U] located 20-50 bp upstream of the AG acceptor). Probabilistic models for gene prediction in eukaryotes Probabilistic model for a gene prediction incorporates several prior knowledge on coding and non-coding regions to predict genes Prior knowledge: The translated region must have a length that is a multiple of 3 Some codons are more common than others Exons are usually shorter than introns The translated region begins with a start signal and ends with a stop codon 5’ splice sites (exon to intron) are usually GT 3’ splice sites (intron to exon) are usually AG The distribution of nucleotides and dinucleotides is usually different in introns and exons. The model reads the sequence and assigns a probability to every possible parse, based on signal location. The parse that receives the highest probability is chosen Predicted gene! Hidden Markov Models for gene finding A hidden Markov model (HMM) is a statistical model that can be used to describe the evolution of observable events that depend on internal factors, not directly observable. Example: discriminating coding and non-coding regions based on internal factors not directly observable (e.g. codon bias observed in certain organisms between coding and noncoding regions, presence of sequence signals that may not be conserved An HMM consist of two elements: invisible process hidden states visible process observable symbols The model is first trained using a series of known observations (e.g. predicted genes from a known related organism). This initial training phase will create the model structure. From the training set the model try to extrapolate the internal factors responsible for the known observations (e.g. gene always positioned in proximity of a specific signal, etc..) The model can be applied to predict genes in an unknown sequence. The more reprtesentative is the training set, the more reliable is the prediction! Eukaryotes ORF finding – Augustus Augustus is an example of gene prediction model using HMM. Augustus parameters have been trained based on data available from known species. Can also be used for predicting genes in novel species. tRNA gene prediction tRNAscan-SE is a popular tool for the prediction of tRNA encoding genes in a DNA sequence from various organisms (e.g. a genome, a plasmid, a mobile genetic element or a virus). Prediction based on covariance model containing known frequent and unusual tRNA genes observed in various organisms. Used for detection and classification of tRNA genes in a new sequence. rRNA gene prediction RNAmmer is a popular tool for the prediction of rRNA encoding genes in a DNA sequence from various organisms. The HMM model is trained for the detection of the various rRNA subunits present in prokaryotes (typically 5s/16s/23s) and additional subunits (8s/18s/28s) characteristic of eukaryotes Prediction of ncRNAs Infernal is a popular tool for the prediction of functional RNA and ncRNA. Combines both sequence and secondary structure consensus, in order to attempt prediction of functional RNA with conserved secondary structure, but distant sequence homology Limitations in gene prediction Predicting methods relying on known sequences are highly conservative and as such relatively inflexible. Accuracy of gene prediction is highly dependent on database quality, especially in the case of extrinsic gene prediction (based on rior knowledge). Variable size observed in introns (e.g. the human dystrophin gene consists of >99% of introns, some of which are >100 kb). This can be particularly problematic when large introns flank short exons extremely difficult to detect. In this case a size cut-off would not work. Unusual examples of eukaryote gene structure and function continue to be identified, including overlapping genes. Not so much relevant in prokaryotes, but reported in the genomes of both plants and animals. As non-canonical cases continue to be uncovered, ever increasing levels of sophistication and (including generation of new training sets) is constantly required. Problem in data reproducibility the same DNA sequence processed with different predictive method will result in different genes being predicted. Irrespective of the level of sophistication achieved or the reliability of the data obtained, gene prediction methods remain just that – predictions. In silico analysis must always be confirmed by in vitro and/or in vivo ‘wet lab’ experimentation to confirm the existence of a putative gene and the functionally of its predicted protein product.