2023 Fall Computational Molecular Microbiology Notes PDF

Document Details

ArticulateBowenite6305

Uploaded by ArticulateBowenite6305

University of Manitoba

2023

Abdullah Zubaer

Tags

phylogenetic trees molecular microbiology computational biology evolutionary biology

Summary

These are lecture notes from a Computational Molecular Microbiology course at the University of Manitoba in Fall 2023. The document outlines methods for analyzing sequence data, phylogenetic trees, and ancestral reconstruction within a molecular biology context.

Full Transcript

Computational Molecular Microbiology (MBIO 4700) ABDULLAH ZUBAER UNIVERSITY OF MANITOBA Working with sequences Sequence databases Sequence comparison • Pairwise alignment • Multiple sequence alignment Phylogenetic tree Tree – presentation and “treefiles” Newick notation: examples (A,B,(C,D));...

Computational Molecular Microbiology (MBIO 4700) ABDULLAH ZUBAER UNIVERSITY OF MANITOBA Working with sequences Sequence databases Sequence comparison • Pairwise alignment • Multiple sequence alignment Phylogenetic tree Tree – presentation and “treefiles” Newick notation: examples (A,B,(C,D)); leaf nodes Means: A C D B http://getyourimage.club/resize-march-12.html https://www.slideshare.net/bcbbslides/phylogenetics-tree-building http://tree.bio.ed.ac.uk/software/figtree/ Treefile: Newick notation or New Hampshire tree format (13Xerula:0.20742,(4Gymnopus:0.01799,3Gymnopus:0.01655):0.18973, (11Marasmiu:0.13171,(((12Hemimyce:0.10126,(((9Hydropus:0.18929, 10Megacoll:0.16889):0.07927,5Clitocybu:0.10161):0.05273,(7Mycena:0.12563, 8Mycena:0.08566):0.06746):0.03878):0.06105,(((Macrolepio:0.07506, Melanophyl:0.07776):0.07589,Laccaria-l:0.06483):0.03329,(((2Tephrocyb:0.03985, (1Tephrocyb:0.02172,(2Lyophyllu:0.01327,1Lyophyllu:0.01128):0.01244):0.01147):0.03053, (Lepista-ir:0.04695,Clitocybe-:0.02680):0.01156):0.01228,(Tricholoma:0.02023, Hypsizygus:0.01816):0.01950):0.04363):0.03321):0.03312,((LH10:0.00000, LH7:0.00000):0.00276,(LH6:0.00000,LH5:0.00000,LH2:0.01095):0.00254):0.12640):0.08161):0.12332); http://etetoolkit.org/treeview/ Treefile (output from bootstrap analysis: Sepboot-->DNApars) -> leaf nodes and distances are provided (((((LH5:1000.0,(LH2:1000.0,LH6:1000.0):96.4):446.4,(LH10:1000.0,LH7:1000.0):876.7):731.6, ((((8Mycena:1000.0,7Mycena:1000.0):970.9,(5Clitocybu:1000.0,(9Hydropus:1000.0, 10Megacoll:1000.0):774.6):706.4):444.4,12Hemimyce:1000.0):917.0,((((((1Lyophyllu:1000.0, 2Lyophyllu:1000.0):949.7,1Tephrocyb:1000.0):643.1,2Tephrocyb:1000.0):939.2,(Lepista-ir:1000.0, Clitocybe-:1000.0):454.5):465.2,(Hypsizygus:1000.0,Tricholoma:1000.0):934.8):889.5, ((Melanophyl:1000.0,Macrolepio:1000.0):974.7,Laccaria-l:1000.0):714.3):398.6):469.9):952.0, 11Marasmiu:1000.0):960.3,(4Gymnopus:1000.0,3Gymnopus:1000.0):968.7,13Xerula:1000.0); -> leaf nodes and bootstrap support values are provided 13Xerula 11Marasmiu LH5 Majority rule consensus tree (species of small mushrooms) 446 LH2 96 LH6 731 LH10 876 LH7 960 12Hemimyce 8Mycena 970 7Mycena 917 952 444 5Clitocybu 706 9Hydropus 774 10Megacoll rDNA ITS region (parsimony analysis) 2Tephrocyb 939 469 1Tephrocyb 643 465 889 398 1Lyophyllu 949 2Lyophyllu Lepista-ir 454 ClitocybeHypsizygus 934 Tricholoma Laccaria-l 714 Melanophyl 974 Macrolepio 4Gymnopus 968 3Gymnopus (X((A,(B,C)),(D,E))); - Draw me? B C A D E X (X((A,(B,C)),(D,E))); - Draw me? B C A D E X Newick notation: A B C D E Remember , means a node () means group(ing) ; means end of tree F G Newick notation (practice): Visualization (various tools) Interactive tree of life: ONLINE tool for tree display: https://itol.embl.de/ Letunic and Bork (2019) Nucleic Acids Res doi: 0.1093/nar/gkz239 | Figure 2. New dataset types in iTOL v4. Each tree in iTOL can be annotated with an unlimited number of datasets. ... NUCLEIC ACIDS RESEARCH, VOLUME 47, ISSUE W1, 02 JULY 2019, PAGES W256–W259, HTTPS://DOI.ORG/10.1093/NAR/GKZ239 THE CONTENT OF THIS SLIDE MAY BE SUBJECT TO COPYRIGHT: PLEASE SEE THE SLIDE NOTES FOR DETAILS. TreeView program (OLD) and rooting trees Outgroup: “most distant member in your analysis” - sometimes need to estimate which sequence could be the most “distant” member - “distance methods” branch length could be a clue BUT branch length can be misleading - sometimes a lineage evolves more rapidly http://www.metagenomics.wiki/tools/phylogenetic-tree/viewer http://tree.bio.ed.ac.uk/software/figtree/ FigTree – successor To TreeView http://tree.bio.ed.ac.uk/software/ http://treegraph.bioinfweb.info/ http://etetoolkit.org/treeview/ Additional topics Ancestral sequence reconstruction The FASTML Server http://fastml.tau.ac.il/ The FASTML Server Server for computing Maximum Likelihood ancestral sequence reconstruction Ashkenazy H, Penn O, Doron-Faigenboim A, Cohen O, Cannarozzi G, Zomer O, Pupko T. 2012 FastML: a web server for probabilistic reconstruction of ancestral sequences Nucleic Acids Res. 40(Web Server issue):W580-4. Randall, R. N. et al. An experimental phylogeny to benchmark ancestral sequence reconstruction. Nat. Commun. 7:12847 doi: 10.1038/ncomms12847 (2016). https://en.wikipedia.org/wiki/Ancestral_reconstruction Example: Ancestral sequence reconstruction Might show what is ancestral PLUS ML to “estimate” more probable substitutions https://en.wikipedia.org/wiki/Ancestral_sequence_reconstruction Selection? (dN/dS) and sequence logos What else can I do with my alignment? E X A MI NE D ATA FO R E V I D EN C E O F “ C H A N GE ” – C H A N GE D U E TO “ D RI FT = C H A N CE ” O R D U E TO “ S E L EC T I ON” ( P U R I FY I NG O R P O SI T I VE S E L EC T I ON ) E X T RAC T “ CONS ENS US ” S EQ UE NC E ( V I SUA L I ZAT I ON ) SYNSCAN analysis -a program that allows one to estimate if a gene (protein) is selected for or “drifting” synonymous and nonsynonymous substitutions i.e. a change at the DNA level (codon) causes a change at the aminoacid level sequence of the protein SYNSCAN: http://hivdb.stanford.edu/pages/synscan.html https://www.khanacademy.org/science/ap-biology/gene-expression-and-regulation/translation/a/the-geneticcode-discovery-and-properties SYNSCAN analysis Argument: if a gene is under strong selection pressure: dN > dS (i.e. get more changes -maybe adapting to a new function) if gene is drifting (no selection) dN = dS (random change) if gene is under functional constraint (= purifying selection - no room for change): dS > dN IN a nutshell: Examine rates of synonymous substitutions (dS) and rates of non-synonymous substitutions (dN) If dS = dN If dS > dN If dS < dN - gene is drifting and maybe no function as no functional constraints - most changes are synonymous (as predicted by neutral evolution) - i.e. “real change” - indication that Natural selection is acting on this gene • (functional constraint – cannot change ~ purifying selection) Neutral Theory of Evolution Neutral Theory of Evolution Rates (of substitutions) rate of neutral mutations varies among: 1. genes and within segments of genes (“active sites”/domains vs linker regions etc.) 2. among codon positions (highest at third-position bases) functional constraint and translational efficiency (codon bias) Neutral Theory of Evolution SYNSCAN OR MEGAX (Selection options) Requirements: fasta format - but codon based alignment ! (AliView) Sometimes only part(s) of the gene/protein is/are analyzed So great care needed for assembling the alignment cannot have indels (insertions/deletions). Neutrality Positive Purifying SYNSCAN: dS > dN i.e. “pressure to remain the same” “purifying selection” or functional constraint. Sethuraman et al. 2008 So in the shown example: dS > dN i.e. “pressure to remain the same” (purifying selection) DNA level changes functional constraints (changes allowed as long as amino-acid sequence stays the same or “conserved replacements”. Program useful if you suspect “your gene of interest” is nonfunctions (dN = dS - drifting) OR gaining a new function (dN > dS seen values of ratios > 1 BUT can be variable as sometimes only a small part of the protein changes to gain a new function) so dN value for entire gene might only be a bit higher than the dS. “Adaptive selection” dN > dS Seen in genes involved in pathogen defense/recognition interaction Or conversely “genes” in pathogens to overcome our innate (or adaptive immune systems etc.) BUT also vertebrate lysozymes (diet) MORE Tools – re: Selection https://www.hiv.lanl.gov/content/sequence/SNAP/SNAP.html Korber B. (2000). HIV Signature and Sequence Variation Analysis. Computational Analysis of HIV Molecular Sequences, Chapter 4, pages 55-72. Allen G. Rodrigo and Gerald H. Learn, eds. Dordrecht, Netherlands: Kluwer Academic Publishers. https://www.hiv.lanl.gov/content/sequence/CodonAlign/codonali gn.html http://www.datamonkey.org/ For more programs/tools http://www.datamonkey.org/busted https://www.hiv.lanl.gov/content/sequence/HIV/links.html – Example application: what parts of the virus is evolving to adapt to new hosts etc. vs changes by drift. DNA polymorphisms (http://www.ub.edu/dnasp/) Molecular-clock hypothesis ▪ DNA sequences (in a genome) change through spontaneous mutation at a constant rate ▪ Can estimate how long ago two species diverged from a common ancestor https://tanguay.info/learntracker/page/lectureNotesItems?idCode=molecularclock Sequence logos A logo displays the frequencies of bases/a.a. at each position http://www.lecb.ncifcrf.gov/~toms/sequencelogo.html “Alternative tool” to find conserved or display (visually) conserved regions in DNA or a.a. sequences. (NICE way of showing consensus sequences) http://weblogo.berkeley.edu/ -→ create http://weblogo.threeplusone.com/create.cgi Schneider TD, Stephens RM. 1990. Sequence Logos: A New Way to Display Consensus Sequences. Nucleic Acids Res. 18:6097-6100 Weblogo For sequence logos: Weblogo http://weblogo.threeplusone.com/create.cgi Sequence Logos need multiple alignment in either fasta or CLUSTAL W(X) format Excellent for showing conservation for binding sites, promoter regions, etc. http://weblogo.berkeley.edu/examples.html http://weblogo.threeplusone.com/examples.html#CAP

Use Quizgecko on...
Browser
Browser