Chapter 3: Bacterial Metagenomics PDF
Document Details
Uploaded by EntrancedAustin7155
Dapeng Wang
Tags
Summary
This chapter provides an overview of bacterial metagenomics and its applications. It describes different sequencing techniques, such as amplicon and shotgun sequencing, and discusses various databases used to analyse and annotate metagenomics data.
Full Transcript
1 Bacterial Metagenomics Metagenomics is by definition the study of a collection of genomes in a sample in its entirety. It is one type of sequencing-based technology that takes advantage of next-generation sequencing (NGS) to explore the taxonomic and functional components of the microbial communi...
1 Bacterial Metagenomics Metagenomics is by definition the study of a collection of genomes in a sample in its entirety. It is one type of sequencing-based technology that takes advantage of next-generation sequencing (NGS) to explore the taxonomic and functional components of the microbial community, including bacteria and archaea present in a sample or in a number of samples. The complexity and diversity of microbiome are beyond our imagination, and the continuation of studying new species and known species is of great importance for the advancement of microbiology. Different from traditional research methods on isolated strains of bacteria, metagenomics can be straightforwardly used for biological or environmental samples without the need for culturing or isolating the bacteria in laboratory. The primary aim of the invention and Suparna Mitra (ed.), Metagenomic Data Analysis, development of metagenomics focuses on the sequencing and identification of each gene and the discovery of new species. More importantly, the newly sequenced species often broaden our understanding of the diversity of microbial community beyond the direct observation through the microscope and aid in the refinement of the tree of life with more pieces of molecular evidence. In this regard, the popularity of sequencing has shifted the taxonomy analysis from the old-fashioned phenotype or morphology level to the high-resolution molecular level. Specifically, more and more fascinating evolutionary questions can be addressed, and a growing number of new biological hypotheses can be proposed based on those data. In the past few years, the high-throughput and economic next-generation sequencing technology has made metagenomics a very powerful approach for surveying targeted or unbiased bacterial gene sequences in an affordable and rapid fashion. It not only reveals the components of individual species in one sample but also compares the samples from various conditions or treatments and consequently trace the changes of the key features of sequences by assigning the appropriate taxonomic and/or functional information to the sequences. 56 Dapeng Wang In general, the modern metagenomic approaches can be divided into two categories, namely amplicon sequencing and shotgun sequencing. Amplicon sequencing attempts to use a specific set of highly conserved primers to capture a variable region of gene sequence from a wide variety of bacteria, and as a result, usually only one fragment of the gene will be sequenced. Shotgun sequencing randomly fragments the sequences into small pieces and sequence them without any obvious biases, aiming to catch the information from as many genes as possible and explore the gene distribution and relative abundance. Due to the difference in the characteristic of these two sequencing approaches such as cost, the sequencing depth, the breadth of gene coverage, and the toleration of contamination, which one to use largely depends on the research objective, sample quality, and study design. Nevertheless, the best working strategy is to use amplicon sequencing to pull out the taxonomic information for a large number of samples and run shotgun sequencing on a relatively small number of selected samples that have shown the diversity of microbial community of interest for a deeper profiling of genes with a variety of metabolic functions. In this way, the entire study can make the most of the two complementary approaches to produce a more comprehensive and reproducible view of the microbiome. In most cases, such studies are aiming to obtain the assignments of taxonomy, homology, and functional categories at an acceptable quality using mapping-based methods together with a few gold-standard databases, rather than resolve the individual complete genomes from the metagenomics data. Metagenomics Databases for Bacteria 57 Metagenomics has gradually become a routine tool for research groups that have an interest in any aspect of the microbiology study, and several international consortiums have been formed to investigate the fundamental and prominent microbiome questions in a collaborative way and set the standards for the field [3, 4]. This technique has revolutionised our way to understanding the microbial world by analysing clinical samples and environmental samples such as soil samples and ocean samples and has a broad range of applications and utilities such as diagnosis of infectious diseases, study of host- pathogen interaction, investigation of ecological system, and discovery of novel biomolecules [5, 6]. The early stage research on isolation and naming of type strains has laid a solid ground for the analysis of the high-volume metagenomics data. In the last decade, continuous efforts on data curation and accumulated high-quality annotation sequence data, along with improved tools for sequence alignment, phylogenetic tree reconstruction, and classification algorithm in machine learning have significantly boosted this field. In the “omics” era, the cutting-edge multi-omics techniques have been successfully used in many domains, including medicine and ecology, and some new interdisciplinary subjects pertaining to metagenomics have been derived in order to see various angles of the microbes. For instance, metatranscriptomics enables the gene expression profiling of the microbial community, and metapangenomics permits the definition of core and accessory set of genes. Using them in combination with metagenomics will be able to enhance the modalities of the data in order to elucidate the makeup of the genome and the gene regulatory network in a holistic and thorough way. The breakthrough in the third generation sequencing technologies such as Pacific Biosciences and Oxford Nanopore sequencing has brought metagenomics into a new era and offered the longer reads that have the potential to dissect complete genes from the rare microbial genomes sampled from highly complex microbial community. At this stage, the long read sequencing should be used in combination with the short read sequencing for metagenomics projects owing to some unsolved challenges in the new techniques, including relatively higher error rate and greater cost. 2 Introduction to Bacterial Metagenomics Database As large amounts of genomic data for microbes have been generated and accumulated in an unprecedented pace, raw reads as well as their corresponding genome assembly data have been deposited and stored in the general nucleotide resources such as NCBI and Ensembl. As part of NCBI, NCBI Taxonomy provides a list of well-curated taxonomic classifications of both prokaryotes and eukaryotes and NCBI-NR offers the non-redundant protein sequences for all forms of life. The needs for further processing and organising all the publicly available microbial genes or genomes in a universal and comparative way have led to a number of efforts made on curation of data and creation of databases that are dedicated to supply the high-quality reference genes for each species, alignments, and taxonomic relationships for the reference genes well supported by the literature and knowledge in the field. Among these databases, Greengenes , SILVA , and Ribosomal Database Project (RDP) are the key players in this field and they all provide an analytical framework for processing ribosomal RNA sequences that includes data collection, quality control, sequence alignment, phylogenetic tree inference, and taxonomy assignment. Interestingly, they make use of the information from each other during the data curation process to improve their quality of data. The regular update of the databases is also important because the new data especially for the taxa that are understudied can significantly increase the accuracy of the whole analysis pipeline and lead to the results of better quality. The direct outcome of construction of these databases is to provide a consensus taxonomy for the species that have the sequence data available. It has also benefitted numerous studies that produce their own metagenomics data and need to analyse their data in order to make sense of those data. A simple example is that the taxonomic classification can be assigned to the new sequencing data by comparing those data with the reference sequences. Nowadays, the number of complete or near- complete microbial genomes has been growing exponentially, and this opens up the opportunity to take advantage of a set of genes to understand the taxonomic relationships among species instead of using single gene. For this purpose, Genome Taxonomy Database (GTDB) [15, 16] has established a whole set of new workflows that can efficiently and accurately take sequence information from over 100 genes and tackle the issues caused by the uneven distribution of genomes on different taxa. It is anticipated that more work will be done in this direction in the future. The following sections will give a general introduction to these four most popular metagenomics databases and highlight the sources of data, data processing workflows, user interface and functionality, and the main advantages and limitations for each of them. 3 Greengenes Greengenes is a 16S rRNA sequence database that is originally designed to accommodate the need for the taxonomic definition of the enormous amount of unclassified rRNA sequences from the uncultured bacterial strains and the identification of the chimeric sequences from the public databases. It contains sequences and multiple-sequence alignments for 16S small subunit rRNAs from archaea and bacteria. This study brings to our attention that chimeric events may have polluted the available rRNA sequences to some extent, which can lead to the erroneous designing of the probe or primer and the misclassification of taxonomical clades. Furthermore, it points out the inconsistent naming system at different taxonomical levels due to less studied sequences from environmental samples. The ultimate goal of this project is to leverage the information about the identified potential poor-quality sequences and make the inference about the taxonomical relationships as accurate as possible based on the highquality sequences. Metagenomics Databases for Bacteria 59 The initial step of the data processing workflow involves downloading at least nearly full-length sequences from NCBI. The detection of the PCR-generated chimera in the sequences from both uncultured and cultured organisms is done by the modified version of Bellerophon that contrasts the divergence between the two fragments of the chimeric sequence with that between the two parent sequences. Specifically, nearest alignment space termination (NAST) aligner tends to align the sequences against a core set of reference sequences in a partial manner that needs to be improved constantly and the generated alignments are in a good position to be visualized in the external alignment visualization software. For each sequence, the taxonomy information is derived from publicly available authoritative resources and calculated trees based on the maximum likelihood inference. 4 SILVA SILVA (https://www.arb-silva.de/) is a dedicated resource for ribosomal RNA analysis that provides comprehensive and complementary sets of sequences and alignments with quality scores delivered in the formats that are compatible with other prevalent packages as well as the advanced search and download functions for the purpose of meeting the diverse requirements of microbial community. At the time of wring, it hosts 9,469,124 small subunit and 1,312,534 large subunit sequences and relevant alignment data for three major domains, namely bacteria, archaea, and eukarya. SILVA develops a data analysis system with the SILVA incremental aligner (SINA) and a group of tools such as TestProbe and TestPrime to help with the rRNA related experimental and sequencing design that facilitate the ongoing research projects at a large scale. SILVA maintains a well-established analytical pipeline for processing the data. Based on a certain set of keywords, the data from SILVA are extracted from all rRNA-related sequences and associated descriptive information from European Nucleotide Archive as well as through rRNA prediction using hidden Markov model. The strategy leads to the validation of known rRNAs and identification of potential novel rRNAs at the same time. The database is regularly updated in accordance with the new release of European Nucleotide Archive under the consistent release number. The two-step quality control procedure involves setting the empirical thresholds for both sequence quality scores and alignment quality scores. In order to remove the non-rRNA sequences and erroneous sequences that are introduced by the experiment, all sequences are filtered according to the examination of a number of essential parameters such as ambiguous bases, the length of homopolymeric stretches, and vector contamination with consideration of their genic locations. Manually curated high-quality sequences that can be well aligned and have a broad taxonomic coverage are included in the SEED alignment. As part of the analysis pipeline, SINA is developed to align a great number of sequences with high accuracy by comparing the query sequence against the graph constructed by the most similar sequences in the SEED alignment until an optimal alignment is reached. Pintail software carries out the check for small subunit sequences that are problematic or chimeric. Consequently, the quality scores are assigned to each sequence which gives the user a chance to define the high-quality sequences based on their own criterion in line with their research interest and scenario. In the earlier version of SILVA, the taxonomy information was derived from a number of other sequence databases or rRNA resources, and more recently, it was made based on in-house curation under the instruction of widely accepted standard or literature and guide tree was used to resolve the conflict caused by the questionably classified taxa. Taken together, this opens up the possibility to assign names to the unknown and novel taxa during the analysis. To balance the breadth of the sequences and the quality of the sequences, it offers three independent sequence datasets to address the various demands from end users. Firstly, Parc is the most comprehensive dataset that consists of full-length and partial sequences that have passed all quality control process. Secondly, Ref only contains nearly full-length sequences which are defined by the stringent and flexible length parameters that treat sequences from different domains distinctly after the check of high-quality alignment and branch length in the constructed tree. Lastly, Ref NR (non-redundant) is a highly condensed version of Ref that only contains the representative sequences from each cluster and a guide tree and serves as a more uniformly distributed phylogenetic reference that can be used for more general microbiome data analysis. 60 Dapeng Wang The data underlying the browser are organised in two types of databases such as small and large subunit rRNA sequences as well as seven categories of taxonomies, including SILVA, SILVA Ref, SILVA Ref NR, LTP, EMBL-EBI/ENA, GTDB, and RDP. This allows users to click on the names of each rank in a hierarchical way and finally arrive at an organism name of interest where the detail page can display the useful information such as general information, taxonomy, environment, features, and references that is associated with this organism and the link can direct users to the sequence page of the external ENA database where the original sequence is deposited. SILVA offers an elegant search functionality that is able to search for sequences from any of Ref, Ref NR, and LTP datasets with a single or a combination of keywords or criterion pertaining to metadata about organism, publication, sequence, alignment, and taxonomy. In combination with cart function, the users will be able to perform complex operations between two or multiple rounds of searches such as union and difference of datasets and the result sequences can be downloaded in a selection of popular formats such as ARB and FASTA for unaligned and aligned sequences. Another useful entry point is by adopting SINA aligner to analyse user provided multiple FASTA sequences against the built-in Ref NR databases with the limitations of minimum identity and number of most similar sequences and a consensus classification name is assigned to each input sequence whenever possible. More importantly, a phylogenetic tree can be generated for either only queried sequences or the combination of queried sequences and their most similar sequences present in the database through two maximum likelihood phylogenetic tree construction approaches. All tasks or jobs are arranged in a task manager with an individual job name and the corresponding execution message and all results are viewable and downloadable. In addition, there are two tools such as TestProbe and TestPrime for the facilitation of choosing the appropriate single primer or paired primers for the taxonomic groups of interest based on the SILVA reference dataset. In particular, both tools understand IUPAC nucleotide code as part of the query primers, classify the sequences from the selected sequence collection and database into three categories, namely the matched sequences, mismatched sequences and those that are too short to make any decision, and produces the list of taxonomic units with the coverage as well as other matching metrics and the list of matching sequences with detailed information. In detail, TestProbe enables the definition of the maximum number of mismatches and the number of ambiguous nucleotides relaxed by the matching, and TestPrime supports the input of a pair of forward primer and reverse primer and gives a breakdown of each category by considering each primer separately and paired primers as a whole in the form of pie charts. 5 Ribosomal Database Project Ribosomal Database Project (RDP, https://rdp.cme.msu. edu/) is one of the mainstream resources in the field of rRNAs that houses both small subunit rRNA sequences from archaea and bacteria and large subunit rRNA sequences from fungi and the most recent release contains 3,356,809 bacterial and archaeal 16S rRNA sequences and 125,525 fungal 28S rRNA sequences. This project features the provision of the versatile online and stand-alone tools for a broad range of user cases and analysis workflows for the newly generated high-volume amplicon sequencing data. 62 Dapeng Wang RDP collects data from European Nucleotide Archive for defined taxonomic clades and sequence features. The filtering step is taken by SeqMatch that works on the shared 7-mer fragments between two sequences and retains the sequences that have the best match to core built-in sequences with an acceptable seqmatch score. Infernal is a highly efficient aligner that takes into account the secondary structure of rRNA sequences and is pre-trained on full-length well representative sequences and optimised to deal with imperfect or incomplete sequences and highly variable regions on the sequences. All the sequences after quality control are scanned for chimeric potential using UCHIME program. There are two collections of taxonomic assignments for RDP. The first one is extracted from the sequence description and annotation fields provided by NCBI taxonomy ID system. The second one is to assign the low-order taxon to each sequence using a naive Bayesian classifier based on the training on the taxonomy of type strains from the published guideline. At the heart of the RDP data access, Hierarchy Browser allows users to browse taxonomic hierarchies from the higher order to the lower order stored in the database in an expandable and collapsible fashion and select or deselect the sequences that need to be further explored. The sequence file is downloadable in the formats of FASTA, Genbank, and Phylip according to the selection of appropriate depth of ranks, strain type, source, size, quality, and taxonomy. In detail, there are two kinds of taxonomy collection available to choose from, namely NCBI-based taxonomy data and RDP compiled nomenclatural data. The investigation can be focused on a combination of the filters along with a number of keywords on the sequence annotation free text, for example, the type strain, long sequences, good quality, and uncultured ones. The advanced search function also supports groupings of predefined fields with Boolean operators, wildcard matching, range queries and awareness of the common characters. For each individual sequence, the sequence itself is viewable on the FASTA page and the detailed description and annotation information is explorable on the Genbank page. Interestingly, each sequence has been linked to the publications that have identified this sequence. As a result, it is possible to see the taxonomic distribution among all ranks from a single publication and show the count of the selected sequences for each publication. Another advantage of RDP is that it makes available all the tools that have been used to prepare the data deposited in the database so that users will be able to make the most of the heavily tested pipeline and integrate their own data with the datasets from the database in a more reproducible and traceable way. Classifier tool provides four taxonomic hierarchy models, including one for 16S rRNAs, one for fungal large subunit rRNAs, and two for fungal internal transcribed spacer regions, and places user-provided sequences into pre-defined built-in taxonomic groups of the database. Probe Match tool allows users to enter the nucleotides from one primer with a length of less than 64 bases or two primers in a pair and a certain degree of ambiguity in the nucleotides is tolerated. The query sequence will be transformed to the correct form accordingly if the orientation of the targeted sequence is specified to fit the unusual experimental design requirement and the restriction of the search to a specific range of sequence regions will be allowed to explore the suspicious occurrence of matching on partial sequences. The results are displayed in the Hierarchy Browser and a number of typical parameters such as the depth of ranks, the number of mismatches, and other options can be refined and afterwards a table is showing up to contain the list of sequences matching the primers with sequence ID, edit distance, alignment region from the best match, and description. Library Compare tool makes the comparison between two groups of sequences based on a given bootstrap confidence threshold and produces the tabular view that lists out rank, name, sequence counts in two libraries, and the significance score for each affected taxon. The link on each taxon name will lead to some further presentation of the difference between the two datasets in terms of bar plots and organised hierarchical view. The assignment details for each query sequence with their relevant confidence score in each taxonomic rank are shown through the link next to the root node on each page. Sequence Match tool returns a number of best matched database sequences together with the comparison-related scores as well as their taxonomic ranks and names. Metagenomics Databases for Bacteria 63 RDP has generously shared its core pipeline that is used to process and prepare the data for the database with the research community in two ways, namely the webserver and stand-alone packages. It covers the basic quality control, paired-end read assembly, chimera assessment, sequence alignment, unsupervised clustering, measurement of variation of microbiome within the sample and across samples, format conversion and manipulation of sequences and alignments. 6 Genome Taxonomy Database Genome Taxonomy Database [15, 16] (GTDB, https://gtdb. ecogenomic.org/) provides a high-resolution phylogenetic tree constructed from the concatenated over 100 single-copy protein sequences from a great number of genomes and defines the species names in a quantitative way, which can overcome the polyphyletic issues and improve unbalanced taxonomic ranks, which are caused by the phylogenetic tree inference approach based on a partial region from a single gene (e.g., 16S rRNA). Furthermore, all the genomes have been organised into species cluster according to two metrics, namely average nucleotide identity (ANI) and alignment fraction (AF), and the name of each cluster has been given by the representative genome in that cluster. This has proved to be a highly efficient and automatic genome- based taxonomic classification workflow that can deal with the significantly increased number of bacterial and archaeal genomes from the public resources as well as other collective efforts to sequence the microbial genomes around the world. The most up-to-date version of GTDB comprises 254,090 bacterial and 4,316 archaeal genomes from 45,555 bacterial species and 2,339 archaeal species, respectively. 64 Dapeng Wang GTDB acquires genome data from NCBI Assembly Database and those genomes are heavily examined through a number of quality metrics, including completeness, contamination, genome assembly quality, sequence ambiguity, and coverage of the protein sets among genomes, leading to the generation of high-quality genomes. The derivation of genomes from the type strains is determined by the comparison between NCBI annotation and several resources concerning the strain identifiers and names, and afterwards, a single representative genome is selected for each species based on a few ordered principles. For one genome per species, the taxonomy tree is deduced from multiple sequence alignments from 120 proteins for bacterial genomes and 122 proteins for archaeal genomes using FastTree and IQ-Tree approaches, respectively. The tree is decorated with NCBI taxonomy supplemented by 16S rRNA based taxonomy. More importantly, the polyphyletic groups have been identified and resolved into monophyletic groups whenever possible and taxonomic rank normalisation based on RED (relative evolutionary divergence) is implemented in order to address the issue that genomes from different taxa have been studied to varying degrees. In order to form the species clusters, any non-representative genome will either be assigned to the closest representative genome if it falls within the circumscription radius defined by ANI and a moderately relaxed AF cut-off is satisfied or grouped into a de novo species cluster using a greedy clustering approach. In the case of naming the unknown de novo species cluster, a placeholder will be given. The robustness of the GTDB and species definition has been extensively evaluated through several lines of methods that include comparison with a variety of tree inference methods and gene marker sets as well as random sampling. GTDB taxonomy data are organised in the order of seven major taxonomic ranks, namely “Domain,” “Phylum,” “Class,” “Order,” “Family,” “Genus,” and “Species”, in the tabular format. Clicking on a record in a certain line can lead to species cluster page where the detailed information about this species cluster is available, such as Genome ID, NCBI Organism Name, NCBI Taxonomy, GTDB Taxonomy, GTDB species representative, and NCBI type material. It allows further navigation through the genome ID to the genome page that displays the data for the genome such as taxonomic information, genome characteristics, NCBI metadata, and taxon history. Alternatively, GTDB provides a phylogenetic tree- based browser that enables the query of the genome through the structured taxonomic levels from the highest to the lowest. The flat files that encompass the sequences, alignments, and taxonomic information and trees that have been generated during the database curation are all downloadable through the GTDB repository. Additionally, GTDB offers a tool called GTDB-Tk that works compatibly with GTDB database to analyse the classification for user-provided genome sequences in a way in which GTDB has processed its own data. 7 Conclusions Metagenomics databases have made significant contributions to the development and innovation of the metagenomics domain and a substantial number of research groups have benefitted from these curation efforts by understanding their data in a better way. On the other hand, the data generated through end users have also been continuing to provide resources to enrich the collection of data that are used for the database curation and building. One of the main challenges the field faces is the inconsistent taxonomic classification and assignment between different databases. This might be due to the fact that they use different sources or types of data, different tools for data analysis and different strategies to control the data quality, which indicates that more efforts should be made to construct a consensus reference taxonomy for all the microbes of importance on top of the individual database. Furthermore, the computing advances such as machine learning and other artificial intelligence techniques will be able to help with the classification tasks in the metagenomics analysis, which in turn boost the development efficiency and ease the maintenance burden of the highquality metagenomics databases.