BMD311 Introduction To Bioinformatics PDF
Document Details
Uploaded by PoignantJasper525
2024
Dr. Mohammed A. Hassan
Tags
Summary
This document appears to be lecture notes for a bioinformatics course covering multiple sequence alignment. The document introduces the topic, outlining background, scoring functions, dynamic programming, and heuristic algorithms. The author is Dr. Mohammed A. Hassan and the date is 21/11/2024.
Full Transcript
BMD311 - INTRODUCTION TO BIOINFORMATICS (MULTIPLE SEQUENCE ALIGNMENT) Dr. Mohammed A. Hassan 2 MULTIPLE SEQUENCE ALIGNMENT BMD311 - Introduction to Bioinformatics - Multiple Sequence Alignment - Dr. Mohammed A....
BMD311 - INTRODUCTION TO BIOINFORMATICS (MULTIPLE SEQUENCE ALIGNMENT) Dr. Mohammed A. Hassan 2 MULTIPLE SEQUENCE ALIGNMENT BMD311 - Introduction to Bioinformatics - Multiple Sequence Alignment - Dr. Mohammed A. 21/11/2024 Hassan BIOMEDICAL INFORMATICS PROGRAM LECTURE OUTLINE ▪Background ▪Scoring function ▪Dynamic programming ▪Heuristic Algorithms BMD311 - Introduction to Bioinformatics - Multiple Sequence Alignment - Dr. Mohammed A. Hassan 21/11/2024 3 BIOMEDICAL INFORMATICS PROGRAM BACKGROUND ▪ The multiple sequence alignment of a set of sequences may be viewed as an evolutionary history of the sequences. ▪ No sequence ordering is required. ▪ Natural extension of Pairwise Sequence Alignment ▪ Much more sensitive in detecting sequence 8 fragments from immunoglobulin sequences alignment highlights with (conserved residues and conserved regions) relationship and patterns BMD311 - Introduction to Bioinformatics - Multiple Sequence Alignment - Dr. Mohammed A. Hassan 21/11/2024 4 BIOMEDICAL INFORMATICS PROGRAM WHY DO MSA? ▪ Help prediction of the secondary and tertiary structures of proteins of new sequences. ▪ Help to find motifs or signatures characteristic of protein family. 8 fragments from immunoglobulin sequences alignment highlights with (conserved residues and conserved regions) BMD311 - Introduction to Bioinformatics - Multiple Sequence Alignment - Dr. Mohammed A. Hassan 21/11/2024 5 BIOMEDICAL INFORMATICS PROGRAM WHY DO MSA? Example: ▪ more sophisticated patterns, like the dominance of hydrophobic residues (V,L,I) at fragment positions 1 and 3. ▪ The alignment can also enable us to infer the evolutionary history of the sequences. ▪ It looks like the first 4 sequences and the last 4 sequences are derived from 2 different common ancestors, that in turn derived from a 8 fragments from immunoglobulin sequences alignment "root" ancestor. highlights with (conserved residues and conserved regions) ▪ But true phylo-gentic analysis is more complex BMD311 - Introduction to Bioinformatics - Multiple Sequence Alignment - Dr. Mohammed A. Hassan 21/11/2024 6 BIOMEDICAL INFORMATICS PROGRAM ▪ Give hints about the function and evolutionary history of a set of sequences ▪ Foundation for phylogenic tree construction and protein family classification ▪ Useful for protein structure prediction. BMD311 - Introduction to Bioinformatics - Multiple Sequence Alignment - Dr. Mohammed A. Hassan 21/11/2024 7 BIOMEDICAL INFORMATICS PROGRAM ▪MSAs can be global or local, as it happens with their pairwise counterparts. ▪We will here deal predominantly with global MSA ▪local MSA is strongly related to motif discovery BMD311 - Introduction to Bioinformatics - Multiple Sequence Alignment - Dr. Mohammed A. Hassan 21/11/2024 8 BIOMEDICAL INFORMATICS PROGRAM LECTURE OUTLINE ▪Background ▪Scoring function ▪Dynamic programming ▪Heuristic Algorithms BMD311 - Introduction to Bioinformatics - Multiple Sequence Alignment - Dr. Mohammed A. Hassan 21/11/2024 9 BIOMEDICAL INFORMATICS PROGRAM SCORING FUNCTION ▪Requirement of a good quality of alignment measure ▪Additive function ▪Function must be independent of order of arguments ▪Should reward presence of many equal or strongly related symbols (in the same column) and penalize unrelated symbols and spaces. BMD311 - Introduction to Bioinformatics - Multiple Sequence Alignment - Dr. Mohammed A. Hassan 21/11/2024 10 BIOMEDICAL INFORMATICS PROGRAM SUM-OF-PAIRS’ (SP) MEASURE ▪ The Sum of Pairs (SP) method is as follows: ▪ Given (1) a set of N aligned sequences each of length L in the form of a L x N MSA alignment matrix M and ▪ (2) a substitution matrix (PAM or BLOSUM) that gives a cost c(x,y) for aligning two characters x, y. ▪ The SP score 𝑆𝑃(𝑚𝑖 ) for the I-th column of M denoted by 𝑚𝑖 is calculated: 𝑆𝑃(𝑚𝑖 ) = 𝑐(𝑚𝑖𝑗 , 𝑚𝑖𝑘 ) 𝑗,𝑘 ▪ The SP score for M is 𝑀 = 𝑆𝑃(𝑚𝑖 ) 𝑖 21/11/2024 11 BMD311 - Introduction to Bioinformatics - Multiple Sequence Alignment - Dr. Mohammed A. Hassan BIOMEDICAL INFORMATICS PROGRAM FEATURE OF SP MEASURE ▪ Theorem: Let alpha (𝛼) be a multiple alignment of the set of sequences 𝑠1 , …, 𝑠𝑘 ; and 𝛼(𝑖, 𝑗) denote the pairwise alignment of 𝑠𝑖 and 𝑠𝑗 as induced by 𝛼. Then 𝑆𝑃_𝑠𝑐𝑜𝑟𝑒 𝛼 = σ𝑖,𝑗 𝑠𝑐𝑜𝑟𝑒(𝛼 𝑖, 𝑗 ) ▪ The above is only true if we have s(-,-) = 0. ▪ This is because in pairwise alignment the presence of two aligned spaces (–) in the two sequences are ignored. BMD311 - Introduction to Bioinformatics - Multiple Sequence Alignment - Dr. Mohammed A. Hassan 21/11/2024 12 BIOMEDICAL INFORMATICS PROGRAM Score = (sm(A,A)+2g)+ 3×sm(R,R)+ (sm(A,C)+2g)+ 3×sm(D,D)+ (2×g)+ (2×sm(S,A)+sm(A, A)) =(4-8)+3×5+(0-8)+3×6+1(-8×2)+(1+1+4)=11 BMD311 - Introduction to Bioinformatics - Multiple Sequence Alignment - Dr. Mohammed A. Hassan 21/11/2024 13 BIOMEDICAL INFORMATICS PROGRAM DISTANCE FROM CONSENSUS ▪ Consensus sequence: single sequence which represents the most common amino acid/base in that position D D G A V - E A L D G G - - - E A L E G G I L V E A L D - G I L V Q A V E G G A V V Q A L D G G A/I V/L V E A L ▪ distance from consensus: total number of characters in the alignment that differs from the consensus character of their columns BMD311 - Introduction to Bioinformatics - Multiple Sequence Alignment - Dr. Mohammed A. Hassan 21/11/2024 14 BIOMEDICAL INFORMATICS PROGRAM PROFILE DISTANCE ▪In the minimum entropy approach the score of multiple alignment is defined as the sum of entropies of the columns (the entropy of a column is related with the frequency of letter x in the column) ▪Information content ▪Profile score BMD311 - Introduction to Bioinformatics - Multiple Sequence Alignment - Dr. Mohammed A. Hassan 21/11/2024 15 BIOMEDICAL INFORMATICS PROGRAM LECTURE OUTLINE ▪Background ▪Scoring function ▪Dynamic programming ▪Heuristic Algorithms BMD311 - Introduction to Bioinformatics - Multiple Sequence Alignment - Dr. Mohammed A. Hassan 21/11/2024 16 BIOMEDICAL INFORMATICS PROGRAM DYNAMIC PROGRAMMING ▪ Multiple sequence alignment can be done with dynamic programming, using (n+1)k memory cells, where n is the length of each of the k sequences to be aligned BMD311 - Introduction to Bioinformatics - Multiple Sequence Alignment - Dr. Mohammed A. Hassan 21/11/2024 17 BIOMEDICAL INFORMATICS PROGRAM DYNAMIC PROGRAMMING ▪ Running time: ▪ (n+1)k number of entries in the table ▪ For each entry we need to find the maximum of 2k -1 elements ▪ Finding the SP-score corresponding to each element means adding O(k2) numbers ▪ Total = O(k22knk) i.e., NP- hard. BMD311 - Introduction to Bioinformatics - Multiple Sequence Alignment - Dr. Mohammed A. Hassan 21/11/2024 18 BIOMEDICAL INFORMATICS PROGRAM LECTURE OUTLINE ▪Background ▪Scoring function ▪Dynamic programming ▪Heuristic Algorithms BMD311 - Introduction to Bioinformatics - Multiple Sequence Alignment - Dr. Mohammed A. Hassan 21/11/2024 19 BIOMEDICAL INFORMATICS PROGRAM HEURISTIC ALGORITHMS ▪ Heuristic (also called approximation) algorithms, which are not able to assure optimal solutions, but have an acceptable complexity in terms of their running times. ▪ Broadly, heuristic algorithms for MSA can be classified as: ▪ Progressive – start by aligning two sequences and then iteratively add the remaining sequences to the alignment; ▪ Iterative – consider an initial alignment and then try to improve this alignment by moving, adding or removing gaps; ▪ Hybrid – can combine different strategies, and use complementary information (e.g. protein structural information, libraries of good local alignments). BMD311 - Introduction to Bioinformatics - Multiple Sequence Alignment - Dr. Mohammed A. Hassan 21/11/2024 20 BIOMEDICAL INFORMATICS PROGRAM ▪ Most successful implementation of progressive alignment: ▪ CLUSTAL - gives equal weight to all sequences ▪ CLUSTALW - has the ability to give different weights to the sequences ▪ CLUSTALX - provides a GUI to CLUSTAL ▪ http://searchlauncher.bcm.tmc.edu/multi-align/multi- align.html BMD311 - Introduction to Bioinformatics - Multiple Sequence Alignment - Dr. Mohammed A. Hassan 21/11/2024 21 BIOMEDICAL INFORMATICS PROGRAM CLUSTAL: A HEURISTIC PROGRESSIVE ALGORITHM ▪ Steps: 1. Calculate the pairwise alignments of all pairs of sequences in the set. 2. Create a matrix containing the similarities between each pair of sequences. 3. Create a guide tree from this matrix. 4. Create a consensus sequence or a profile BMD311 - Introduction to Bioinformatics - Multiple Sequence Alignment - Dr. Mohammed A. Hassan 21/11/2024 22 BIOMEDICAL INFORMATICS PROGRAM ▪Profile is the represent of the relative frequencies of the different characters. ▪Example: BMD311 - Introduction to Bioinformatics - Multiple Sequence Alignment - Dr. Mohammed A. Hassan 21/11/2024 23 BIOMEDICAL INFORMATICS PROGRAM MULTIPLE ALIGNMENTS ▪Analyse gene families ▪reveal (subtle) conserved family characteristics characters 1 2 3 4 5 6 7 8 9 10 S1 Y D G G A V - E A L sequences S2 Y D G G - - - E A L S3 F E G G I L V E A L S4 F D - G I L V Q A V S5 Y E G G A V V Q A L Consensus Y D G G AI VL V E A L sequnece BMD311 - Introduction to Bioinformatics - Multiple Sequence Alignment - Dr. Mohammed A. Hassan 21/11/2024 24 BIOMEDICAL INFORMATICS PROGRAM PROFILE (FREQUENCY MATRIX) characters 1 2 3 4 5 6 7 8 9 10 S1 Y D G G A V - E A L sequences S2 Y D G G - - - E A L S3 F E G G I L V E A L S4 F D - G I L V Q A V S5 Y E G G A V V Q A L Consensus Y D G G AI VL V E A L sequnece Y=.6 D=.6 G=1 G=1 A=.5 V=.5 V=1 E=.6 A=1 L=.8 Profile F=.4 D=.4 I=.5 L=.5 Q=.4 V=.2 (Can further weight the profile using PAM or BLOSUM matrices) BMD311 - Introduction to Bioinformatics - Multiple Sequence Alignment - Dr. Mohammed A. Hassan 21/11/2024 25 BIOMEDICAL INFORMATICS PROGRAM SEQUENCE LOGOS ▪ A graphic representation of an aligned set of binding sites. A logo displays the frequencies of bases at each position, as the relative heights of letters, along with the degree of sequence conservation as the total height of a stack of letters, measured in bits of information. Subtle frequencies are not lost in the final product as they would be in a consensus sequence BMD311 - Introduction to Bioinformatics - Multiple Sequence Alignment - Dr. Mohammed A. Hassan 21/11/2024 26 BIOMEDICAL INFORMATICS PROGRAM UPDATING AN EXISTING ALIGNMENT ▪Adding a new sequence to an existing alignment. Using an algorithm similar to pairwise alignment but in this case we align a sequence to a profile. ▪ Thus, the score will be a weighted average of the scores of the possible pairs. ▪ the moto “once a gap, always a gap” prevails, since when a gap appears in an alignment it will be maintained throughout the next steps. BMD311 - Introduction to Bioinformatics - Multiple Sequence Alignment - Dr. Mohammed A. Hassan 21/11/2024 27 THE END BMD311 - INTRODUCTION TO BIOINFORMATICS (MOTIF DISCOVERY ALGORITHMS) Dr. Mohammed A. Hassan BIOMEDICAL INFORMATICS PROGRAM LECTURE OUTLINE ▪ Motif ▪ Definition of Motifs ▪ Representation of Motifs ▪ Identifying Motifs ▪ Phylogenies ▪ Motivation ▪ Tree Basis and Assumptions ▪ Molecular clock ▪ Types of trees BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) 05/12/2024 2 BIOMEDICAL INFORMATICS PROGRAM DEFINITION OF MOTIFS ▪ The term motif refers to a non-trivial sequential pattern that is shared across multiple sequences, its main feature is its recurrence. ▪ With non-trivial, we mean that the motif has a minimum relevant length and captures a combination of symbols that differs from the underlying symbol distribution. ▪ In genomics, motifs share a certain biological property or are under the same regulatory control and play a role in the associated biological phenomenon. BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) 05/12/2024 3 BIOMEDICAL INFORMATICS PROGRAM DEFINITION OF MOTIFS ▪ For instance, in DNA sequences, motifs may occur in the promoter region of genes and indicate the presence of binding sites of single proteins or protein complexes that have a regulatory role in gene transcription. ▪ In protein sequences, motifs may indicate the existence of conserved domains, i.e. parts of the protein that play specific biological functions, such as enzyme binding sites for substrates or other molecules. BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) 05/12/2024 4 BIOMEDICAL INFORMATICS PROGRAM MOTIF TYPES ▪ There are two main classes of motifs: deterministic and probabilistic. ▪ Deterministic motifs are often captured by enhanced regular expression syntax, and as it name indicates are either present or absent in the input sequences. ▪ A deterministic sub-string motif is a continuous combination of symbols. ▪ A deterministic subsequence motif is a motif such that it may include parts that do not add any relevant information. ▪ Probabilistic motifs form a loose model that captures the underlying variability of the motif occurrences. ▪ When a given segment of a sequence is presented to the model, the probability of being part of the motif can be derived. BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) 05/12/2024 5 BIOMEDICAL INFORMATICS PROGRAM The occurrence of the strings along the input sequence and their alignment. BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) 05/12/2024 6 BIOMEDICAL INFORMATICS PROGRAM LECTURE OUTLINE ▪ Definition of Motifs ▪ Representation of Motifs ▪ Identifying Motifs BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) 05/12/2024 7 BIOMEDICAL INFORMATICS PROGRAM REPRESENTATION OF MOTIFS ▪ Possible representations of a motif 1. to store the alignment of all its instances in the input sequences. 2. to represent the motif with regular expression syntax. This is called a consensus motif. 3. to build a profile or matrix of the frequencies of the symbols at each position of the motif, represented as columns of the matrix. This representation only depends on the size of the alphabet and the length of the motif and is independent on the number motif. BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) 05/12/2024 8 BIOMEDICAL INFORMATICS PROGRAM ▪ Motif regular expression is a notation commonly used for representing motifs of amino acids or nucleotides. ▪ The following conventions are used for motif regular expression: ▪ Each character, written by itself, denotes a specific amino acid or nucleotide; ▪ {X} denotes that any amino acid or nucleotide may be used except for 'X'; ▪ [XY] denotes that either 'X' or 'Y' may be used; ▪ The question mark symbol ('?') denotes that the previous item may or may not appear; ▪ Thus, the amino acid pattern "AR[ND]C?E" encompasses the four protein strings "ARNCE", "ARDCE", "ARNE", and "ARDE"; ▪ the DNA pattern "CC{T}AG" encompasses the three DNA strings "CCAAG", "CCCAG", and "CCGAG". BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) 05/12/2024 9 BIOMEDICAL INFORMATICS PROGRAM REGULAR EXPRESSIONS: METACHARACTERS ▪ Metacharacters are characters with a special meaning: Character Description Example [] A set of characters "[a-m]" \ Signals a special sequence (can also be used to escape special characters) "\d". Any character (except newline character) "he..o" ^ Starts with "^hello" $ Ends with "planet$" * Zero or more occurrences "he.*o" + One or more occurrences "he.+o" ? Zero or one occurrences "he.?o" {} Exactly the specified number of occurrences "he.{2}o" | Either or "falls|stays" () Capture and group https://www.w3schools.com/python/python_regex.asp BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) 05/12/2024 10 A special sequence BIOMEDICAL is a \ followed INFORMATICS PROGRAMby one of the characters in the list below, and has a special meaning: REGULAR EXPRESSIONS: SPECIAL SEQUENCES ▪ A special sequence is a \ followed by one of the characters in the list below, and has a special meaning: Character Description Example \A Returns a match if the specified characters are at the beginning of the string "\AThe" Returns a match where the specified characters are at the beginning or at the end of a word r"\bain" \b (the "r" in the beginning is making sure that the string is being treated as a "raw string") r"ain\b" Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word r"\Bain" \B (the "r" in the beginning is making sure that the string is being treated as a "raw string") r"ain\B" \d Returns a match where the string contains digits (numbers from 0-9) "\d" \D Returns a match where the string DOES NOT contain digits "\D" \s Returns a match where the string contains a white space character "\s" \S Returns a match where the string DOES NOT contain a white space character "\S" Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ \w "\w" character) \W Returns a match where the string DOES NOT contain any word characters "\W" \Z Returns a match if the specified characters are at the end of the string "Spain\Z" BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) 05/12/2024 11 https://www.w3schools.com/python/python_regex.asp BIOMEDICAL INFORMATICS PROGRAM REGULAR EXPRESSIONS: SETS ▪ A set is a set of characters inside a pair of square brackets [] with a special meaning: Set Description [arn] Returns a match where one of the specified characters (a, r, or n) is present [a-n] Returns a match for any lower case character, alphabetically between a and n [^arn] Returns a match for any character EXCEPT a, r, and n Returns a match where any of the specified digits (0, 1, 2, or 3) are present [0-9] Returns a match for any digit between 0 and 9 [0-5][0-9] Returns a match for any two-digit numbers from 00 and 59 Returns a match for any character alphabetically between a and z, lower case OR [a-zA-Z] upper case In sets, +, *,., |, (), $,{} has no special meaning, so [+] means: return a match for any + [+] character in the string https://www.w3schools.com/python/python_regex.asp BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) 05/12/2024 12 BIOMEDICAL INFORMATICS PROGRAM EXAMPLE ABOUT RE IN GENERAL https://www.novixys.com/blog/wp-content/uploads/2018/02/regex.png BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) 05/12/2024 13 BIOMEDICAL INFORMATICS PROGRAM ▪ Thus the regular expression for our simple open reading frame finder that starts with the codon "ATG", followed by one or more codons composed of three nucleotides (A, C, T, or G), and ending with a stop codon ("TAA", "TAG", or "TGA") https://open.oregonstate.education/app/uploads/sites/6/2016/10/I.11_5_regex_orf.png#fixme BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) 05/12/2024 14 BIOMEDICAL INFORMATICS PROGRAM RECURRENCE A MOTIF ▪ Frequency or support of a a G g t a c T t C c A t a c g t motif is a score measure a c g t T A g t equals the total number of a c g t C c A t C c g t a c g G matches of M in input set of _________________ sequences D (more than one A 3 0 1 0 3 1 1 0 count per sequence). C G 2 4 0 0 1 4 0 0 0 1 4 0 0 0 3 1 ▪ The frequent motif is a motif T 0 0 0 5 1 0 1 4 _________________ that has frequency higher or Consensus a c g t a c g t equal than a user defined threshold; and infrequent. Score 3+4+4+5+3+4+3+4=30 BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) 05/12/2024 15 BIOMEDICAL INFORMATICS PROGRAM atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGa tgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttatag gtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa Implanting Motif AAAAAAAGGGGGGG BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) 05/12/2024 16 BIOMEDICAL INFORMATICS PROGRAM WHY FINDING (15,4) MOTIF IS DIFFICULT? atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag AgAAgAAAGGttGGG ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa..|..|||.|..||| cAAtAAAAcGGcGGG BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) 05/12/2024 17 BIOMEDICAL INFORMATICS PROGRAM REGULATORY REGIONS ▪ Every gene contains a regulatory region (RR) typically stretching 100-1000 bp upstream of the transcriptional start site ▪ Located within the RR are the Transcription Factor Binding Sites (TFBS), also known as motifs, specific for a given transcription factor ▪ TFs influence gene expression by binding to a specific location in the respective gene’s regulatory region - TFBS BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) 05/12/2024 18 BIOMEDICAL INFORMATICS PROGRAM TRANSCRIPTION FACTOR BINDING SITES ▪ A TFBS can be located anywhere within the Regulatory Region. ▪ TFBS may vary slightly across different regulatory regions since non- essential bases could mutate BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) 05/12/2024 19 BIOMEDICAL INFORMATICS PROGRAM MOTIFS AND TRANSCRIPTIONAL START SITES ATCCCG gene TTCCGG gene ATCCCG gene ATGCCG gene ATGCCC BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) gene 05/12/2024 20 BIOMEDICAL INFORMATICS PROGRAM BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) 05/12/2024 21 BIOMEDICAL INFORMATICS PROGRAM MOTIF LOGO TGGGGGA ▪ Motifs can mutate on non important bases TGAGAGA ▪ The five motifs in five different genes TGGGGGA have mutations in position 3 and 5 TGAGAGA ▪ Representations called motif logos TGAGGGA illustrate the conserved and variable regions of a motif BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) 05/12/2024 22 BIOMEDICAL INFORMATICS PROGRAM MOTIF LOGOS: AN EXAMPLE BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) 05/12/2024 23 BIOMEDICAL INFORMATICS PROGRAM LECTURE OUTLINE ▪ Definition of Motifs ▪ Representation of Motifs ▪ Identifying Motifs BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) 05/12/2024 24 BIOMEDICAL INFORMATICS PROGRAM IDENTIFYING MOTIFS ▪ Genes are turned on or off by regulatory proteins ▪ These proteins bind to upstream regulatory regions of genes to either attract or block an RNA polymerase ▪ Regulatory protein (TF) binds to a short DNA sequence called a motif (TFBS) ▪ So finding the same motif in multiple genes’ regulatory regions suggests a regulatory relationship amongst those genes BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) 05/12/2024 25 BIOMEDICAL INFORMATICS PROGRAM ▪ We do not know the motif sequence ▪ We do not know where it is located relative to the genes start ▪ Motifs can differ slightly from one gene to the next ▪ How to discern it from “random” motifs? BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) 05/12/2024 26 BIOMEDICAL INFORMATICS PROGRAM THE MOTIF FINDING PROBLEM ▪ Given a random sample of DNA sequences: cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc ▪ Find the pattern that is implanted in each of the individual sequences, namely, the motif BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) 05/12/2024 27 BIOMEDICAL INFORMATICS PROGRAM THE MOTIF FINDING PROBLEM (CONT’D) ▪Additional information: ▪ The hidden sequence is of length 8 ▪ The pattern is not exactly the same in each array because random point mutations may occur in the sequences BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) 05/12/2024 28 BIOMEDICAL INFORMATICS PROGRAM THE MOTIF FINDING PROBLEM (CONT’D) ▪ The patterns revealed with no mutations: cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc acgtacgt Consensus String BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) 05/12/2024 29 BIOMEDICAL INFORMATICS PROGRAM ▪ The patterns with 2 point mutations: cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) 05/12/2024 30 BIOMEDICAL INFORMATICS PROGRAM DEFINING MOTIFS ▪ To define a motif, lets say we know where the motif starts in the sequence ▪ The motif start positions in their sequences can be represented as s = (s1,s2,s3,…,st) BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) 05/12/2024 31 BIOMEDICAL INFORMATICS PROGRAM MOTIFS: PROFILES AND CONSENSUS ▪ Line up the patterns by their a G g t a c T t start indexes Alignment C a c c A g t t a T c A g g t t a c g t C c A t s = (s1, s2, …, st) C c g t a c g G ▪ Construct matrix profile with _________________ frequencies of each A 3 0 1 0 3 1 1 0 nucleotide in columns Profile C 2 4 0 0 1 4 0 0 G 0 1 4 0 0 0 3 1 ▪ Consensus nucleotide in T 0 0 0 5 1 0 1 4 each position has the highest _________________ score in column Consensus A C G T A C G T BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) 05/12/2024 32 BIOMEDICAL INFORMATICS PROGRAM CONSENSUS ▪ Think of consensus as an “ancestor” motif, from which mutated motifs emerged ▪ The distance between a real motif and the consensus sequence is generally less than that for two real motifs BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) 05/12/2024 33 BIOMEDICAL INFORMATICS PROGRAM EVALUATING MOTIFS ▪ We have a guess about the consensus sequence, but how “good” is this consensus? ▪ Need to introduce a scoring function to compare different guesses and choose the “best” one. BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) 05/12/2024 34 BIOMEDICAL INFORMATICS PROGRAM DEFINING SOME TERMS ▪ t - number of sample DNA sequences ▪ n - length of each DNA sequence ▪ DNA - sample of DNA sequences (t x n array) ▪ l - length of the motif (l-mer) ▪ si - starting position of an l-mer in sequence i ▪ s=(s1, s2,… st) - array of motif’s starting positions BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) 05/12/2024 35 BIOMEDICAL INFORMATICS PROGRAM l=8 DNA cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc n = 69 s s1 = 26 s2 = 21 s3= 3 s4 = 56 s5 = 60 BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) 05/12/2024 36 BIOMEDICAL INFORMATICS PROGRAM SCORING MOTIFS l a G g t a c T t C c A t a c g t ▪ Given s = (s1, … st) and DNA: a c g t T A g t a c g t C c A t t C c g t a c g G _________________ Score(s,DNA) = A 3 0 1 0 3 1 1 0 C 2 4 0 0 1 4 0 0 l G 0 1 4 0 0 0 3 1 max count(k , i) T 0 0 0 5 1 0 1 4 _________________ i =1 k{ A,T ,C ,G} Consensus a c g t a c g t Score 3+4+4+5+3+4+3+4=30 BMD311 - Introduction to Bioinformatics (Motif Discovery Algorithms) 05/12/2024 37 THE END