Nociones de Metagenómica 2024 PDF
Document Details
Uploaded by EfficaciousRuby50
Universidad de Chile
2024
Andres Marcoleta
Tags
Related
- Quality Control in Metagenomics Data PDF
- Outils pour les analyses -omiques PDF
- Metagenomics: A Novel Tool For Livestock & Poultry Improvement (PDF)
- Introduction To Environmental DNA Applications of Metagenomics PDF
- Introduction to Environmental DNA (eDNA) Applications of Metagenomics (PDF)
- Lecture 2 - Protein Sequence Analysis PDF
Summary
Nociones y herramientas informáticas para análisis metagenómicos. Curso de Bioinformática 2024 de la Universidad de Chile. Explica qué es un metagenoma y la metagenómica.
Full Transcript
FACULTA D DE lliJM Grupo de I Curso de Bioinformatica 2024 CIENCIAS ""'==== :,...
FACULTA D DE lliJM Grupo de I Curso de Bioinformatica 2024 CIENCIAS ""'==== :, Microbiologia lntegrativa UNIVERSIDAD DE CHILE f, I I 1 ft.&1 Universidad de Chile Nociones y herramientas informaticas para analisis metagen6micos Dr. Andres Marcoleta [email protected] Grupo de Microbiologia lntegrativa, Departamento de Biologia ¿Qué es un metagenoma? La colección de genomas provenientes de una comunidad de microorganismos presente en un ambiente definido. ¿Entonces que entendemos por metagenómica? La aplicación de genómica moderna para el estudio de comunidades de microorganismos directamente desde su ambiente natural, pasando por alto la necesidad de aislamiento y cultivo de especies individuales. Handelsman et al. Chemistry Biology, 1998 (Tenericutes) The Great Armatimonadetes Bacteria Nomurabacteria Kaiserbacteria Actinobacteria Zixibacteria Atribacteria Adlerbacteria Cloacimonetes Aquificae Chloroflexi Campbellbacteria Fibrobacteres Calescamantes Gemmatimonadetes Caldiserica Firmicutes Plate Count WOR-3 Dictyoglomi TA06 Thermotogae Cyanobacteria Poribacteria Deinococcus-Therm. Latescibacteria Synergistetes Giovannonibacteria BRC1 Fusobacteria Wolfebacteria Melainabacteria Jorgensenbacteria Marinimicrobia RBX1 Ignavibacteria Anomaly Bacteroidetes WOR1 Chlorobi Caldithrix Azambacteria PVC Parcubacteria superphylum Yanofskybacteria Planctomycetes Moranbacteria Elusimicrobia Chlamydiae, Lentisphaerae, Magasanikbacteria Verrucomicrobia Uhrbacteria Falkowbacteria Candidate Omnitrophica Phyla Radiation SM2F11 NC10 We can cultivate Aminicentantes Rokubacteria Peregrinibacteria Acidobacteria Tectomicrobia, Modulibacteria Gracilibacteria BD1-5, GN02 Nitrospinae Absconditabacteria SR1 Nitrospirae Saccharibacteria roughly “1%” of the Dadabacteria Berkelbacteria Deltaprotebacteria (Thermodesulfobacteria) Chrysiogenetes Deferribacteres Hydrogenedentes NKB19 TM6 Spirochaetes Wirthbacteria Woesebacteria Shapirobacteria Amesbacteria Collierbacteria microbes out there! Epsilonproteobacteria Pacebacteria Beckwithbacteria Dojkabacteria WS6 Roizmanbacteria Gottesmanbacteria CPR1 Levybacteria CPR3 Katanobacteria Daviesbacteria Microgenomates Curtissbacteria Alphaproteobacteria WWE3 Zetaproteo. Acidithiobacillia Betaproteobacteria Major lineages with isolated representative: italics Major lineage lacking isolated representative: 0.4 Gammaproteobacteria A new view of the tree of life Laura Hug, et al. Micrarchaeota Eukaryotes Diapherotrites Measurement of in Situ Nanohaloarchaeota Aenigmarchaeota Parvarchaeota Loki. Thor. Activities of DPANN Korarch. Crenarch. Nonphotosynthetic Pacearchaeota Nanoarchaeota Bathyarc. YNPFFA Aigarch. Microorganisms in Woesearchaeota Opisthokonta Altiarchaeales Z7ME43 Halobacteria Aquatic and Terrestrial Methanopyri Archaea Methanococci Hadesarchaea TACK Excavata Habitats Thermococci Thaumarchaeota Archaeplastida Methanobacteria Thermoplasmata Chromalveolata Staley JT, Konopka A. Archaeoglobi Whole Genome Sequencing vs. classical Metagenomics Genomic approach Classic metagenomic approach Cultivable microorganisms (~1-30%) Uncultured microorganisms (~70-99%) Isolation, v DNA extraction, DNA extraction, Sequencing Sequencing v Fragmentation Sequence of a Whole genome sequence v genomic fragment (~1/100 whole Cloning in vectors, genome) cultivation, Sequencing Whole Genome Sequencing vs. shotgun Metagenomics Genomic approach (current) Metagenomic approach Cultivable microorganisms (~1-30%) Uncultured microorganisms (~70-99%) Isolation, v DNA extraction, DNA extraction, Sequencing Sequencing v Fragmentation Whole genome sequence v Direct sequencing FIGURE 1 Schematic of the metagenomic pipeline to identify sequence-discrete populations. Reads from metagenomic Bypassing Cultivation To Identify Bacterial Species sequencing of microbial community DNA can be assembled into consensus genomic sequences of cells belonging Luis to the same population. Contigs originating from the same population can be M. Rodriguez-R identified and based on their Konstantinos T. Konstantinidis sequence Análisis de secuencias metagenómicas cortas independientes de anotación Metagenomic Analyses Short-reads Assemblies TAGCAG CGCTAG TAGCAG CGCTAG TAGCAG TAGCAG CGCTAG CCGAGC Análisis metagenómicos independientes de anotación Metagenomic dataset CCGAGC CGCTAG TAGCAG TAGCAG ATCGC CCGAGC What can we learn from un-assembled CCGAGC TAGCAG TAGCAG CGCTAG CGCTAG short metagenomic reads? TAGCAG CGCTAG TAGCAG TAGCAG CGCTAG CCGAGC CGCTAG TAGCAG TTGACA CCGAGC CGACG CGCTAG CCGAGC CGCTAG CCGAGC Fraction of the metagenome covered Sequence diversity Similarity between metagenomic samples Análisis metagenómicos independientes de anotación La determinación de la diversidad de una comunidad microbiana y la cantidad de secuencias necesarias para cubrir la diversidad TOTAL son desafíos actuales en análisis metagenómicos. of a finite collection (singletons in clus- Nonpareil efficiently estimateexamines the coverage of the the degree of overlap among individual sequence of the collection captured in the subset reads This of a has observation whole-genome been previously shotgun (WGS) metagenome to compute the fraction datasets of reads to estimate with species no match, which is used to estimate the abundance- richness weighted coverage average (Schloss coverage. and Handelsman, ene amplicons (Schloss et al., 2009) or other individual genes. To the best eil is the first method directly applying genome level, without using reference pose that Nonpareil projection curves e proxy to the diversity of the commu- oredSequencing to rank natural communities in e ort diversity. (number of reads) Coverage (Fraction of a genome covered by sequencing reads rvation that datasets with higher coverage the sequencing reads are nearly random, es have been noted for specific sequencing Redundancy is defined here as the portion Fig. 1. Main steps in the construction of Nonpareil curves. (a) The ch with at least one other read (redundant construction of a Nonpareil curve starts with the calculation of a noted !). Calculating this value is compu- vector containing the number of matches for a randomly drawn query requires a number of paired comparisons subset from the total dataset (1000 reads in this case). The function of the dratic growth (in the worst case, where no number of matches for each query read, ranked by decreasing number of prohibitive calculation, even for powerful matches, resembles a rank-abundance plot. The inset shows the histo- ncing datasets that are composed of mil- gram of matches for the same vector, i.e. the observed distribution of tead, Nonpareil estimates the redundancy matches from which the rank-abundance plot is generated. (b) Next, the e of query reads from the Nonpareil: a redundancy-based entire dataset redundant approach portion is to assess the calculated forlevel of coverage sub-datasets in metagenomic of different datasets sizes. For he number of matches per query read in Luis M.are each size, 1024 replicate datasets Rodriguez-R generated.and TheKonstantinos inset showsT.the Konstantinidis dis- ff Nonpareil curves Commentary 100 Estimated average coverage (%) 80 60 Posterior fornix [SRS023466] Anterior nares [SRS147950] Stool [SRS015540] 40 Tongue dorsum [SRS055495] Lake Lanier [SRR948155] Baltic Sea (21m) [SRS291372] 20 Tibet soil [SRR1023760] Peru tropical forest (Manu Park) 0 1 Mb 10 Mb 100 Mb 1 Gb 10 Gb 100 Gb 1 Tb Sequencing effort gure 2 Comparison of diversity and coverage in available metagenomic data sets using Nonpareil curves. The abundance-weighted verage coverage is presented as a function of sequencing effort in the form of Nonpareil curves (Rodriguez-R and Konstantinidis, 2013) r selected available metagenomic data sets. Note that more diverse communities require larger sequencing efforts to achieve the same vel of coverage, hence located rightward in the plot. Four samples of the Human Microbiome Project are shown that represent ommunities in the human microbiome of varying diversity, all of which are less diverse than selected environmental samples. Soil ibet soil and Peru tropical forest) and marine (Baltic sea, 21 m depth) samples are the most diverse among those selected. The Sequence ead Archive identifier of each sample https://github.com/lmrodriguezr/nonpareil is provided within squared brackets, except for the Peru tropical forest sample obtained from erer et al. (2012). https://github.com/lmrodriguezr/nonpareil Análisis de secuencias metagenómicas cortas usando anotación funcional o taxonómica Annotation of short-reads from metagenomic datasets Compare top vs. deep annotations Annotation of short-reads from metagenomic datasets Deep Surface Oxygen and light sensor PpaA−PpsR Photolyase (DNA repair Terminal cytochrome O ubiquinol oxidase DNA repair, bacterial photolyase Putative oxidase COG2907 Proteorhodopsin CBSS−243277.1.peg.4359 from UV exposure) Ferrous iron transporter EfeUOB, low−pH−induced Lipid−linked oligosaccharide synthesis related cluster Hypothetical Related to Dihydroorotate dehydrogenase The Chv regulatory system of Alphaproteobacteria Plastoquinone Biosynthesis Tocopherol Biosynthesis beta−glucuronide utilization Iron transport Pectin metabolism in plants CBSS−188.1.peg.9880 Quinol oxidases in plants (mitochondrial) Flagellum in Campylobacter Respiratory complex II (succinate dehydrogenase) in plants Terminal cytochrome oxidases Glutathione analogs: mycothiol Mycobacterium virulence operon involved in lipid metabolism Resistance to chromium compounds Sandy Quorum sensing in Yersinia CBSS−292415.3.peg.2341 CBSS−235.1.peg.567 CBSS−224911.1.peg.435 Curli production Phage tail proteins 2 Lipid A biosynthesis in plants CBSS−288000.5.peg.1793 CBSS−364106.7.peg.3204 CBSS−290633.1.peg.1906 Benzoate transport and degradation cluster Phylloquinone biosynthesis in plants Selenoprotein O CBSS−318161.14.peg.2599 ESAT−6 proteins secretion system in Firmicutes Mycobacterial MmpL7 membrane protein cluster Hemin transport system Siderophore Enterobactin Methanogenesis Lysine and threonine metabolism in plants H2:CoM−S−S−HTP oxidoreductase millsd methanogensis Tn552 Silt Hydroxyaromatic decarboxylase family Cobalamin Dh Thiamin biosynthesis in plants Bacillibactin Siderophore Arginine metabolism and urea cycle in plants Wyeosine−MimG Biosynthesis Hydrogenases Cytochrome b6−f complex in plants (plastidial) and cyanobacteria CBSS−498211.3.peg.1514 Pyrimidine de novo biosynthesis in plants loam Lysine biosynthesis AAA pathway 2 Yfa cluster Indole−pyruvate oxidoreductase complex Phosphoenolpyruvate phosphomutase Aromatic amino acid interconversions with aryl acids Streptothricin resistance Sporulation Cluster III A Methanogenesis from methylated compounds Pyridoxin(Vitamin B6) Degradation Pathway Xyloglucan biosynthesis in plants Energy conserving hydrogenase, Methanococcales−Methanobacteriales−Methanopyrales V−Type ATP synthase Trehalose metabolism in plants Naphtalene and antracene degradation Dimethylsulfoniopropionate (DMSP) mineralization RNA processing orphans pMMO/AMO Heme biosynthesis orphans DMSP breakdown Ribonuclease P archaeal and eukaryal ycosylglycerates Particulate methane monooxygenase (pMMO) family CBSS−69014.3.peg.2094 CBSS−49338.1.peg.459 Nucleolar protein complex eukaryotic rRNA modification and related functions DNA replication, archaeal Acetophenone carboxylase 1 AMP to 3−phosphoglycerate D−Alanyl Lipoteichoic Acid Biosynthesis Archaeal Flagellum Many Archaeal Transcription elongation factors, archaeal rRNA modification Archaea DNA recombination, archaeal Proteasome subunit alpha archaeal cluster Thermosome, archaeal subsystems COG2016 RNA polymerase archaeal Ribosome biogenesis archaeal Ribosome LSU eukaryotic and archaeal Ribosome SSU eukaryotic and archaeal RNA polymerase archaeal initiation factors Translation initiation factors eukaryotic and archaeal KEOPS complex Translation elongation factors eukaryotic and archaeal p-adj log2fold>1 16S rRNA gene fragments from soil metagenomes Supplementary Figure 3 0 - 5 cm Havana 20 - 30 cm C 0 - 5 cm Urbana 20 - 30 cm B 100 20.7 18.3 28.6 27.1 28.1 Relative abundance 16SrRNA gene fragments 37.6 34.4 37.9 38 80 40.9 42.1 40.2 40.6 41.9 40.1 41.1 41.2 42.9 40.8 45.5 [ % detected reads in metagenome ] 21.5 24 20.1 19.6 21.9 60 11 15.5 15.3 19.2 17.9 16 19.3 21.8 19.4 20.7 23.7 15.9 24.8 31.7 22.6 22.6 25.1 25 23.6 40 21.2 24.1 12.4 27.2 8.38 9.56 19.5 21.3 24.5 13.6 17.7 15.5 13.5 7.79 6.19 13 15.1 6.03 18.7 12.3 13.4 10.8 10.2 10.2 20 5.26 6.17 7.14 5.49 7.15 5.78 5.63 6.47 5.85 0 Apr. Jun. Sept. Nov. Apr. Jun. Sept. Nov. Apr. Jun. Sept. Nov. Apr. Jun. Sept. Nov. Proteobacteria Firmicutes Chloroflexi Nitrospirae Chlamydiae candidate division WPS−1 Acidobacteria Bacteroidetes Thaumarchaeota Gemmatimonadetes Candidatus Saccharibacteria candidate division WPS−2 Actinobacteria Planctomycetes Verrucomicrobia Euryarchaeota Armatimonadetes Other Ensamblaje de secuencias metagenómicas cortas Sampling Total DNA ATTAGA Total DN TGGATC… extraction Sequencing GACCTC… TAGCAT… TGGAAC… GTACAG… ~150 bp ~1-8 Mbp Metagenomic Analyses Short-reads Assemblies TAGCAG CGCTAG TAGCAG CGCTAG TAGCAG TAGCAG CGCTAG CCGAGC A Metagenomes - Assembly or not? - Pros: - Can provide longer genetic contexts (contigs with multiple genes). - Larger proportion of complete genes. - Greatly facilitates taxonomic sequence binning/classi cation (as its accuracy is a function of sequence length). - Cons: - Risk of creating chimeric sequences. - Not all short-reads can be assembled in contigs. - For large metagenomes: memory & computation intense. fi Nowadays Strategies for Obtaining FEATURE ARTICLE Genomic Sequences of Environmental Microbes FIGURE 1 Schematic of the metagenomic pipeline to identify sequence-discrete populations. Reads from metagenomic sequencing of microbial community DNA can be assembled into consensus genomic sequences of cells belonging Why do we do binning? ¿Por qué hacemos el ‘binning’ de contigs? http://merenlab.org http://merenlab.org Telling sequences apart Commonly used metrics: 1. Sequence composition (k-mer) 2. Abundance 3. GC Sequence composition (k-mer) 2. Abundance But how? Calidad Quality check for MAGs Metagenomic Analyses Short-reads Assemblies (MAGs) TAGCAG CGCTAG TAGCAG TAGCAG CGCTAG TAGCAG CGCTAG CCGAGC What to consider? MAG fragmentation Read mapping (coverage patterns) Tetra nucleotide composition Contig taxonomy MAG fragmentation Completeness and contamination - Relies in the presence or absence (completeness) and number of copies (contamination) of single-copy (‘essential’) genes. - These essential genes typically account for less than 10% of all genes present in single genome. - Their functions are related to central metabolism processes (e.g., replication, translation and transcription) or core genes. - Di erent sets of proteins are commonly used to assess completeness and contamination: - 31 bacterial single-copy genes (AMPHORA) - 104 Universal single-copy genes (AMPHORA 2) - Lineage speci c single copy genes (checkM) - Others (anvi’o, enveomics, etc). ff fi CheckM provides a set of tools for assessing the quality of genomes recovered from isolates, single cells, or metagenomes. It provides robust estimates of genome completeness and contamination by using collocated sets of genes that are ubiquitous and single-copy within a phylogenetic lineage Assessment of genome quality can also be examined using plots depicting key genomic characteristics (e.g., GC, coding density) which highlight sequences outside the expected distributions of a typical genome. CheckM also provides tools for identifying genome bins that are likely candidates for merging based on marker set compatibility, similarity in genomic characteristics, and proximity within a reference genome tree. http://ecogenomics.github.io/CheckM/ Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. 2014.. Identifying and comparing MAGs A bit of background: The DNA-DNA hybridization method DHH general principle >70% -> SAME species Isolate genomic DNA from strains A and B Random fragmentation Good correspondence with Denature DNA phenotypically coherent clusters of Mix and let renature Quantify heteroduplex (relative to homoduplex) strains Enterobacteriaceae but… Di cult to implement Unclear how it relates to whole-genome relatedness Need to have isolates available… but only 1-2%prokaryotes cells are cultivable! Slide modi ed from Kostas Konstantinidis ffi fi ANI as a measure of relatedness Slide modi ed from Kostas Konstantinidis fi DDH vs. ANI 70% DDH 95% ANI Goris, Konstantinidis, et atl. IJSEM, 2007 Slide modi ed from Kostas Konstantinidis fi fastANI Back to MAGs Identifying MAGs - Taxonomy Even though checkM o ers a quick taxonomic annotation for MAGs, more accurate and re ned annotations require additional analyses. Find and extract 16S rRNA gene sequences. However, 16S rRNA genes do not typically assemble. Use conserved ribosomes proteins or single-copy genes fi ff 1) Resolving polyphyletic groups (e.g., Clostridium) 2) Standardizing taxonomic ranks (e.g., Nitrospiraceae) https://www.nature.com/articles/nbt.4229 https://gtdb.ecogenomic.org/about GTDB The Genome Taxonomy Database (GTDB) is an initiative to establish a standardized microbial taxonomy based on genome phylogeny The genome tree on which the taxonomy is based is inferred using FastTree from an aligned concatenated set of single copy marker proteins NCBI taxonomy was initially used to decorate the genome tree. Taxonomic ranks are normalized using phylorank and the taxonomy manually curated to remove polyphyletic groups. Relative evolutionary divergence (RED) https://www.nature.com/articles/nbt.4229 https://gtdb.ecogenomic.org/about.. GRACIAS POR SU ATENCION!