Quality Control in Metagenomics Data PDF
Document Details
Uploaded by FancierFluorine
Abraham Gihawi, Ryan Cardenas, Rachel Hurst, and Daniel S. Brewer
Tags
Related
- Chapter 3: Bacterial Metagenomics PDF
- Lecture 13: Fungi, Metagenomics and the Human Microbiome PDF Summer 2024
- Outils pour les analyses -omiques PDF
- Metagenomics: A Novel Tool For Livestock & Poultry Improvement (PDF)
- Unraveling the Functional Dark Matter through Global Metagenomics PDF
- Introduction To Environmental DNA Applications of Metagenomics PDF
Summary
This chapter details quality control in metagenomics, including study design and data analysis. It outlines basic principles and practices for quality control of sequencing data, and demonstrates a walkthrough of analyzing microbiome data using the R statistical language.
Full Transcript
Quality Control in Metagenomics Data Abraham Gihawi, Ryan Cardenas, Rachel Hurst, and Daniel S. Brewer Abstract Experiments involving metagenomics data are become increasingly commonplace. Processing such data requires a unique set of considerations. Quality control of metagenomics data is critical...
Quality Control in Metagenomics Data Abraham Gihawi, Ryan Cardenas, Rachel Hurst, and Daniel S. Brewer Abstract Experiments involving metagenomics data are become increasingly commonplace. Processing such data requires a unique set of considerations. Quality control of metagenomics data is critical to extracting pertinent insights. In this chapter, we outline some considerations in terms of study design and other confounding factors that can often only be realized at the point of data analysis. In this chapter, we outline some basic principles of quality control in metagenomics, including overall reproducibility and some good practices to follow. The general quality control of sequencing data is then outlined, and we introduce ways to process this data by using bash scripts and developing pipelines in Snakemake (Python). A significant part of quality control in metagenomics is in analyzing the data to ensure you can spot relationships between variables and to identify when they might be confounded. This chapter provides a walkthrough of analyzing some microbiome data (in the R statistical language) and demonstrates a few days to identify overall differences and similarities in microbiome data. The chapter is concluded by discussing remarks about considering taxonomic results in the context of the study and interrogating sequence alignments using the command line. Key words Metagenomics, Microbial bioinformatics contamination, Quality control data, Microbiome bacteria, Virus 1 Introduction We define sequencing microbial communities as sequencing the nucleic acids from multiple organisms in a collection of samples. Sequencing microbial communities has several advantages: it is culture-free and so can identify organisms difficult to culture; it can be hypothesis-free; it has potential quantitative ability; it can capture information from across an organism’s genome (in the case of shotgun sequencing); and prior knowledge of the expected microbial constituents is not necessarily required, although it can help validate results. It is possible to sequence one part of the organisms, such as the 16S ribosomal rRNA amplicon sequencing (18S for eukaryotic organisms), or it is possible to sequence a Suparna Mitra (ed.), Metagenomic Data Analysis, Methods in Molecular Biology, vol. 2649, https://doi.org/10.1007/978-1-0716-3072-3_2, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023 21 greater range of the genomic regions with shotgun metagenomic sequencing. In this chapter, we will offer a practical guide to overcoming some of the challenges in analyzing microbiome sequencing data (particularly shotgun sequencing data) to obtain meaningful and insightful results. 22 Abraham Gihawi et al. Publications mentioning sequencing and the microbiome have been increasing exponentially in recent years, with no sign of plateauing (Fig. 1). The number of these publications that mention 1 10 100 1000 10000 2005 2010 2015 2020 Date of Publication Cumulative Number of Publications search_type "sequencing" + "microbiome" + "contamination" Publication Dates of Microbiome Sequencing Studies Fig. 1 The number of publications on a log scale with the date of publication. Search terms for each condition were: ‘sequencing’ and ‘microbiome’ (purple, 12,189 total studies) and additionally ‘contamination’ (yellow, 317 total studies). Publications were identified using the easy PubMed R package (version 2.13) as of 26-Apr-2021 contamination has also been increasing exponentially, but with a much lower total number of studies (12,189 vs. 317). Quality Control in Metagenomics Data 23 Although far from the earliest study of the microbiome, one of the most significant studies in this field is that of the human microbiome project. The early phase of this project focused on characterizing the bacteria detected in samples from various anatomical sites such as gastrointestinal, oral, urogenital, skin, and nasal. One of the major findings of this analysis was that the bacterial communities at each bodily site is quite distinctive. This data provides a nice foundation for future studies into the human microbiome by describing some of the microbial communities that can be expected. This has been developed upon in the latest phase of the human microbiome project (the integrative human microbiome project) by beginning to decipher the complex relationship between the microbiome and disease. Quality control is critical in extracting robust and reproducible conclusions from metagenomic data. There are numerous considerations that can help maximize the quality of metagenomics data, including experimental design, the quality of sequence data, and the quality of the reported microbial community. 2 Considerations in Study Design and Methodology The first step to achieving high-quality data is by establishing a good study design and methodology. No amount of quality control can completely rectify imperfections in study design. It is always a better option to give some prior thought to the methods than to identify confounding variables at the analysis stage that can nullify any findings. Metagenomics often demonstrates batch effects, which occur when non-biological factors in an experiment are linked with differences in the data. This can result in erroneous conclusions and obscure the extraction of biologically insightful conclusions. Minor variations in sample storage has a minimal effect , but multiple freeze-thaw cycles can lead to degraded nucleic acids. Care needs to be taken with formalin-fixed paraffin-embedded (FFPE) tissue, as it degrades nucleic acids and introduces genomic changes, including cytosine deamination. Many studies have obtained interesting metagenomic insights from FFPE tissue, but it can hinder analysis to mix tissue types by introducing large batch effects [7, 8]. It is always good practice to replicate any preliminary findings on a separate cohort with an orthogonal technique to ensure that any biological results are reliable, robust, and reproducible. To a certain extent, it is possible to take into account batch effects in statistical analysis, but this reduces the power of the test and it is the opinion of the authors that there is only so much correction that is possible before the data becomes unusable. To reduce batch effects, one should keep the methodology as consistent as possible throughout the duration of the study. Ideally, the same person will follow the exact same protocols, on the same equipment in as short a time frame as possible. Secondly, the experimental conditions should be mixed across the duration and batches of the project. For example, do not run all of condition A and then run all of condition B. 24 Abraham Gihawi et al. One context-specific example of batch effects is that of animal studies. Typically, mice are coprophagic—mice caged together share similarities in their gastrointestinal microbial constituents. Comparing two cages of mice will reveal differences which can be inaccurately assigned to experimental differences if condition is aligned with cage. This needs to be considered alongside financial and ethical obligations when planning animal experiments. Similar concerns of confounding variables occur in human studies, where the microbiome of an individual might be affected by many factors. For example, age, gender, seasonal changes, lifestyle, ethnicity, geographical location, socioeconomic status, and culture. Even pet-ownership has been identified as a potential confounding variable in microbiome studies. Confounders can be better controlled in prospective studies, but even in retrospective studies, care must be taken to match the characteristics of the experimental groups as far as possible. It is recommended that the preparation of samples for DNA extraction prior to any metagenomics study should involve a number of quality control checks and batch testing of reagents and kits prior to use with samples. This is to minimize contamination that can be present in molecular biology grade water, reagents, plus in the DNA extraction and library preparation kits—the contamination present in the latter is well documented and termed the “kitome” [11, 12]. Where relevant, for particular study designs, protocols and reagents designed to improve detection for low abundance bacteria may be used effectively, including working in a clean-room environment, plus the use of reagents that are guaranteed contamination free is particularly important. Care should be taken with the method of lysis of the bacteria prior to DNA extraction, for example, some enzyme preparations, including mixed enzyme cocktails, lysozyme can have bacterial DNA contaminants present and therefore it is appropriate to screen batches prior to use. The use of repeated bead-beating protocols has been shown to improve the quality and extraction of bacteria and other microbiome, including fungi from certain samples [13–15]. Certified molecular biology grade nuclease free beads are recommended with the composition (sizes/material) of beads adjusted to be suitable for the tissue type, in addition, the speed and duration of bead beating can be tested and validated for effective lysis of microbiota to obtain yields of quality DNA suitable for metagenomics studies. Protocols for human depletion of clinical samples prior to bacteria DNA extraction (bacteria enrichment protocols) can be useful for certain studies. However, it has been shown that with certain sample preparation conditions susceptible bacteria can also be lysed during the initial human cell lysis steps and thus removed/depleted from the sample of interest , which may pose as a critical issue. Testing and validation of protocols for recovery of key species of interest is important to check for effective recovery when using human host depletion during DNA extraction steps. Finally, one of the most important study design additions is to prepare and sequence negative control/blank samples at each step, blanks for each set of samples extracted, each library preparation batch plus also blank reagents and no-template controls to determine background contamination. One recent study investigating the microbiota in human cancer tissue used a ratio of ~50% controls to samples to enable effective sequencing data analyses. Inclusion and analyses of mock community positive control samples can also aid interpretation and validation of study protocols. Recently criteria have been recommended to aid study design considerations when investigating low microbial biomass microbiome samples to avoid (i) contaminant DNA plus; (ii) sample cross-contamination, with a study “RIDE” checklist , including several of the sample preparation criteria discussed above. Quality Control in Metagenomics Data 25 3 Solutions to Support Reproducibility One of the successes of the bioinformatics community is the incredible range of openly available computational software for all manner of applications. When planning the analyses for a project, selection of the correct tools is crucial. It is necessary to trial and benchmark multiple computational tools using simulated or prototype data to find the optimal tool for your use. Some criteria for what makes a good tool include whether it: provides accurate and reproducible results (in a reasonable format); is suitable for your specific application (e.g., does a metagenomic assembly tool allow for co-assembly); is computationally efficient and scalable to your needs; is user-friendly and provides explanatory error messages; installs easily; is in widespread use in the community; is wellsupported; and doesn’t need to be adapted or updated too often. One of the most laborious and time-consuming tasks a computational biologist will face is installing different computational software and each of their dependent software and databases. Mangul et al. tested the installability of 98 different software tools and concluded that 51% were “easy to install” and 28% of tools failed to install whatsoever. Fortunately there are a few solutions which are becoming best practice. Installation of software has become easier over the last few years through the use of package managers such as Conda (https://docs. conda.io/), which allow easy installation of computational software and their dependencies. These tools can install multiple software and are typically quite good at resolving any conflicts in dependencies although they can take a considerable amount of time to do so. Additionally, it can be troublesome to update software to newer versions and sometimes multiple dependencies are required. A newer alternative to package managers is that of containerized software. 26 Abraham Gihawi et al. Software containers such as Docker and Singularity have changed the way that analysis pipelines can be used and shared. It allows the exact same code to be applied on different machines. This aids the reproducibility of research. Container files typically hold a minimal Linux operating system with installation of all of the necessary dependencies to run the analyses. Commands and Workflows can be run within these containers with minimal intrusion from the user operating system and environment. This means that an entire set of analysis can be repeated and investigated without having to spend time worrying about the exact versions of the software. Another recommendation is to integrate your analyses in workflow software, which reduces the development time, improves efficiency, aids reproducibility, is scalable, and allows effective monitoring and restarting of analyses. Although it is quite possible to create pipelines in a variety of coding languages, some of the easiest to read, implement, and adapt pipelines are written in workflow languages such as Snakemake (which is implemented in Python) and Nextflow. There are still a few limitations of containers however. Firstly, metagenomic databases can be too large to include in containers (particularly those built on remote servers) meaning that they have to be shared in addition to any container and loaded (using a bind argument, for example) before they can be used. Secondly, containers can be difficult to update and usually rely on sourcing the original build recipe if additional software are required. Thirdly, software and tools can often rely on Internet access. This can be difficult to spot within a particular tool without reading the source code. This limitation means that containers can run into issues on an offline machine. Additionally, as data on the Internet is subject to change, this might further hinder reproducibility. For example, the code to produce Fig. 1 downloads data to plot which will almost definitely be subject to change when the code is run at a later date. For this reason (and if feasible), it is often a good idea to save the data that is used to create plots. Quality Control in Metagenomics Data 27 4 Code Walkthrough Introduction This chapter will provide walk-through re-analyzing some Illumina (Miseq) sequencing data made publicly available from a study by Thomas et al. entitled “Metagenomic characterization of the effect of feed additives on the gut microbiome and antibiotic resistome of feedlot cattle”. A basic understanding of the Linux command line and the R programming language will be beneficial, but we will introduce a lot of the basics here. Thomas et al. investigated the impact that antibiotics in animal feed have on the microbiome. To do so, the authors used Illumina Miseq to produce paired end reads that are 250 nucleotides long. The sequence data is available on NCBI’s Sequence Read Archive (SRA) under the bioproject PRJNA390551. Please note that this walkthrough will not use the same methods as the manuscript and may not be applicable to other technologies like long read sequencing. Also, for simplicity, we will investigate a reduced selection of samples (N = 9) from three anatomical sites: colon, cecum, and rumen. To facilitate the analysis in this chapter, we have containerized all necessary software using Singularity and therefore only the Singularity software is required and access to the Unix command line. To access the Unix command line, simply open a terminal on a Mac/Linux machines or you can open an emulator on Windows machines (such as Cygwin). Instructions on how to install singularity can be obtained from https://sylabs.io/ guides/3.3/user-guide/installation.html Once singularity is installed, use the command line code below to pull the metagenomicsQC container from the Sylabs cloud library: singularity pull library://r- cardenas/default/ qcmetagenomics_v1:latest Or by accessing the link: h t t p s:// cl o u d. s yl a b s.i o/li b r a r y/ r - c a r d e n a s/ d e f a ul t/ qcmetagenomics_v1 The image can also be built using the recipe file within our GitHub page (https://github.com/UEA-Cancer-Genetics-Lab/ Metagenomics_QC) which also contains a number of the scripts used in this chapter. 28 Abraham Gihawi et al. # Ensure Metagenomics_QC directory is made mkdir -pv Metagenomics_QC/ # Navigate to the newly downloaded folder / directory cd Metagenomics_QC/ # Note: Singularity container can be built using the recipe file # sudo singularity build metagenomics.simg singularity/metagenomicsQC.def # Once you have the container, use the shell command to enter, while setting your home directory to the your current working directory. singularity shell –home $PWD qcmetagenomics_v1_latest.sif 5 Downloading SRA Project Data The code below uses SRA toolkit to download a small number of the samples from the animals that were not treated with antibiotics. Type the below one line at a time on the Unix command line. Lines with # at the start are comments to improve readability and do not need to be run. This code will download the files from the SRA, then reformat the data to FASTQ and then rename them to something a bit more understandable before compressing them. This step can take a while depending on your Internet connection and how busy the servers are—Go grab yourself a beverage, go for a quick walk, do a quick bit of exercise, or something more productive like reading a bit of that manuscript you’ve had open in a tab for far too long now... Just keep the terminal open in the corner to make sure it’s still running without any issues. # first we will create a folder/directory structure to store the data #For this we will use the mkdir command with -p and -v flags (-p makes directories if required, -v makes it verbose so feeds back what it has done) # At any time you can find out what folder/directory you are in with the 'pwd' command mkdir -pv QC_Metagenomics_data/analysis cd QC_Metagenomics_data/analysis/ mkdir -pv images processed_data raw_data results cd raw_data/ # Download the data from the SRA for x in 'SAMN07243779' 'SAMN07243778' 'SAMN07243777' 'SAMN07243769' 'SAMN07243768' 'SAMN07243767'; do ~/Downloads/sratoolkit.2.11.0- mac64/bin/prefetch $x; done Quality Control in Metagenomics Data 29 # Extract FASTQ files for x in 'SRR5738067' 'SRR5738068' 'SRR5738074' 'SRR5738079' 'SRR5738080' 'SRR5738085' 'SRR5738092' 'SRR5738091' 'SRR5738094'; do fastq-dump -I --split-files $x; done #rename the files to something more illustrative mv SRR5738068_1.fastq 03-Colon_1.fastq mv SRR5738068_2.fastq 03-Colon_2.fastq mv SRR5738067_1.fastq 04-Colon_1.fastq mv SRR5738067_2.fastq 04-Colon_2.fastq mv SRR5738074_1.fastq 05-Colon_1.fastq mv SRR5738074_2.fastq 05-Colon_2.fastq mv SRR5738079_1.fastq 03-Cecum_1.fastq mv SRR5738079_2.fastq 03-Cecum_2.fastq mv SRR5738091_2.fastq 04_Rumen_2.fastq mv SRR5738094_1.fastq 05_Rumen_1.fastq mv SRR5738094_2.fastq 05_Rumen_2.fastq #compress fastq files to save space gzip *fastq #Navigate up to the analysis/ directory cd../ mv SRR5738080_1.fastq 04-Cecum_1.fastq mv SRR5738080_2.fastq 04- Cecum_2.fastq mv SRR5738085_1.fastq 05-Cecum_1.fastq mv SRR5738085_2.fastq 05-Cecum_2.fastq mv SRR5738092_1.fastq 03_Rumen_1.fastq mv SRR5738092_2.fastq 03_Rumen_2.fastq mv SRR5738091_1.fastq 04_Rumen_1.fastq 6 Ensuring Data Integrity After downloading or transferring files, it is good practice to ensure the files are an exact copy of the version on the server, i.e., the integrity of the data. A quick and useful way to do this is by finding a fingerprint (or checksum) for each file using the md5sum command (or md5). This is also called hashing and takes a file of arbitrary size and condenses the data return a fixed length string. If the md5 values are the same as what they are supposed to be (they are often supplied by the data provider) you can be confident the files are identical and that nothing went wrong in the download process. Below we provide the md5 values for each file downloaded. To obtain these, type “md5sum raw_data/*fastq.gz”, which will run an md5 checksum on all of the files in your current directory that end in fastq.gz. MD5 (03-Cecum_1.fastq.gz) = afb204df6ae58af7da00cc78221da044 MD5 (03-Cecum_2.fastq.gz) = aec5b97a325f687604d43d18732a8486 MD5 (03-Colon_1.fastq.gz) = d0a129dddd7aea45b2b7481dd3e93146 MD5 (03-Colon_2.fastq.gz) = 808ace970c7c2aea59af4757cbcd26e3 MD5 (03_Rumen_1.fastq.gz) = 0c91bb2498439a965f8f30caae051ece MD5 (03_Rumen_2.fastq.gz) = ed839b91ec20fa50c8cc38070ab6b9a2 MD5 (04-Cecum_1.fastq.gz) = e56de2d0a02c66d62ec67f79363600c6 MD5 (04-Cecum_2.fastq.gz) = f3b9c4725790e174f170696f5ab4faeb MD5 (04-Colon_1.fastq.gz) = 53a0ae19e568a1406a32a6b028b8ba3a MD5 (04-Colon_2.fastq.gz) = 09be6df77a6c3fc21d0363f7b6a78f47 MD5 (04_Rumen_1.fastq.gz) = 3246c89fa20bfd7e02b4ab76b6a11a0c MD5 (04_Rumen_2.fastq.gz) = 95237571c6003704ab3afb4f9a1dffdf MD5 (05-Cecum_1.fastq.gz) = 641770d9f3e824f7509f9967e3e6e15e MD5 (05-Cecum_2.fastq.gz) = 82e7009e4c37ef11c3f1baee7bd7e918 MD5 (05-Colon_1.fastq.gz) = 465addb89806851caca724cf214506f0 MD5 (05-Colon_2.fastq.gz) = 01f69c06108177ba992df39e71b69010 MD5 (05_Rumen_1.fastq.gz) = 9c42ae670b201d2221f1c9d3049ba25c MD5 (05_Rumen_2.fastq.gz) = 78eaa7c5f555b726abffd118c7029515 ls raw_data/ #run fastqc on all raw data, the asterisk matches any pattern and by adding.fastq.gz to the end we ensure that only.fastq.gz files are matched fastqc raw_data/*fastq.gz #list all the files in the raw_data directory #This should return all of the fastqc files that you have just created 30 Abraham Gihawi et al. 7 Quality Control Statistics High-quality input data is a requirement for metagenomic taxonomy assignment and other subsequent analyses—poor quality in, poor quality out. Unfortunately, there is no consensus on the best way to remove low quality data. For example, there are mixed reports as to whether or not quality trimming is required for RNA sequencing gene quantification [26, 27]. In the case of microbe identification, it is generally advised to quality trim sequence data to avoid mis-classification. In this section, we will walkthrough an example of how to filter and adapt sequencing reads to ensure high-quality data. The first step in ensuring good quality data is to calculate quality metrics and visualize the quality. For this, we use a popular tool called FastQC. First ensure you are in the analysis/ directory (you can use the pwd command to find out where you are and cd../or cd directory_x/to navigate up to the parent directory and to a directory or folder called directory_x, respectively). Open up the first report in the raw_data directory using your preferred browser (“05-Colon_2_fastqc.html”). You may need to download the FastQC files locally to view in a browser if you are carrying out this tutorial on a high performance computing cluster. This will open up some nice analysis of the quality of your sequences. A full description of each of the plots is available on the tool website, but we will summarize some of the important aspects below. These graphs are summarized by the following: Quality Control in Metagenomics Data 31 7.1 Basic Statistics This provides some basic statistics on the file, including the filename, file type, what quality score encoding was used (more on that below), the total number of sequences in the file, the number of sequences flagged as poor quality, the sequence length, and the percentage GC content. 7.2 Per Base Sequence Quality This plot is arguably the most informative. A box plot shows the distribution of quality scores (y-axis) along individual read positions (x-axis). The mean quality score at each position is shown by the line running through the box plots. There are different ways to encode base quality scores in FASTQ files, but Fastqc is good at determining the correct quality score system. The most common scoring system is the Phred system (Q) where the probability of a base being incorrect is given by = 10 - Q 10. What this translates to in practice is that Q = 30 denotes a 1/1000 chance of a base being incorrect, Q = 20 denotes a 1/100 chance, Q = 10 denotes a 1/10 chance and so on and so forth. Therefore, statistically speaking, Q = 20 means that approximately 1 base in a 100 bp read is incorrect. Therefore a Q value above 30 is currently a good standard to aim for. It is entirely normal for Illumina sequencing data to start off with a relatively lower quality and then rise in the first 2 base pairs before dropping off toward the end of the read. For this reason, it can be useful to remove either end of sequencing reads to just leave the high-quality middle portion. This example is showing highquality scores with low variation and a maximum read length of ~251 bp (Fig. 2). 7.3 Per Tile Sequence Quality This will produce a heatmap showing the deviation from the average quality split by position in the read (x-axis) by flow cell tile (yaxis). Blue colors are good and mean that there is little to no deviation from the average quality. More red colors indicate that that particular tile had worse quality scores than other tiles for that particular base location. If there are areas of red, it may mean that your sequencer needs servicing or it could be a much simpler issue, such as a bubble in the flow cell. 32 Abraham Gihawi et al. Fig. 2 Per base sequence quality for the walkthrough file 05-Colon_2.fastq. Overall this plot suggests high quality sequencing reads 7.4 Per Sequence Quality Scores A useful metric is to obtain the mean quality score across an entire read. This plot shows the distribution of the mean values for each sequencing read across all sequencing reads. For this you would expect an exponentially increasing peak over a high-quality value with a small distribution that rapidly falls back down again. Multiple peaks would suggest that a group of sequencing reads were poorer quality. 7.5 Per Base Sequence Content A line graph showing the percentage each nucleotide base (A,C,G, T) occurs at each position in a read. We expect that the relative proportions remain constant over the length of a read. The distribution can be skewed, particularly at the beginning of a read if adapter sequences are present. The type of library preparation can introduce bias. 7.6 Per Sequence GC Content A plot showing the theoretical and observed distribution of percentage GC content in each read. Ideally, it should display a normally distributed bell curve that matches the theoretical distribution. Abnormalities in the observed plot can indicate 7.10 Overrepresented Sequences some form of contamination or overrepresented sequences can skew the distribution. RNA sequencing is another factor that can result in some abnormal observations for this. Quality Control in Metagenomics Data 33 7.7 Per Base N Content When a sequence is unknown, it is replaced with “N”. This plot shows the percentage of bases that are “N” across the length of the reads. Small proportions of N may be possible, especially where quality drops toward the end of the read. If there are many N’s, it may be associated with low quality. 7.8 Sequence Length Distribution A plot demonstrating the distribution of sequence lengths. It should be expected to peak at a particular number of reads similar to your expected read length. 7.9 Sequence Duplication Levels A plot showing how many times each sequence is duplicated (xaxis) along with what percentage of reads (y-axis). You should expect that most sequences are not duplicated and so there should be a peak at 1 which drops off. Overrepresented sequences can signify either there is some form of contamination or something that is biologically important. FastQC will automatically try to find the source of the overrepresented sequence, which can include some adapter sequences. This warning may be triggered if the library fragmentation is a bit more selective. 7.11 Adapter Content Adapter sequences are synthetic oligonucleotides that are joined to the ends of sequencing reads with the purpose of attaching them to the flow cell. In this section, Illumina adapter sequences, which might be alluded to in the overrepresented sequences or in the kmer content, are detected. The plot shows the percentage of the library, which is estimated to be attributed to adapter sequences at each position in a read. This graph typically increases in percentage toward the end of the read because FastQC assumes the rest of the read contains adapter sequences when one is detected in the middle of a read. 7.12 K-Mer Content A k-mer is merely a string of nucleic acids of length k. This bit of analysis shows the presence of overrepresented k-mers (FastQC uses k = 7 or 7-mers) and what their distribution is like across the length of the read. On its own, this isn’t particularly useful, but it can be used in conjunction with some of the other metrics above, like adapter content. 7.13 Scaling Quality Control to Multiple Samples In the case of the first file (05-Colon_2_fastqc), overall the quality looks quite good. FastQC is a great tool for visualizing quality control statistics, but it can be unfeasibly laborious to investigate each report in depth and to compare samples or identify outliers. The tool MultiQC provides a great approach to summarizing quality control reports for all samples from a range of tools in addition to FastQC. Running it from the command line is simple and can be run with the command multiqc. from within the directory containing FastQC reports. The output will be a file named multiqc_report.html. This report contains much of the same information as FastQC reports, but it presents the data in a way to visualize and compare all samples. 34 Abraham Gihawi et al. 8 Trimming and Filtering Reads It is quite ordinary to have raw sequencing data with imperfections. The real benefit of visualizing QC is to compare the quality metrics before and after applying trimming to remove inappropriate sequences. There are a few key questions following trimming that MultiQC can help answer. For each of the above quality metrics, you need to critically evaluate whether or not your trimming parameters has resolved these issues and have left a sufficient quantity of data for your analysis. To ensure the sequences are all of good quality we will be applying Trimmomatic , which is specifically designed for Illumina reads and widely used. Trimmomatic is a malleable tool that boasts a wide variety of parameters such as removing the ends of reads below a defined quality value and setting a minimum read length. Additionally, Trimmomatic has been granted permission to use and distribute proprietary adapter sequences to search for and remove these from sequencing data. To trim the reads, developing a shell script in Bash helps to avoid repeating commands. The first line #!/bin/bash tells your computer that you want to run the contents of the script in the bash shell. The line containing set - euxo pipefail is an incredibly useful bit of defensive programming for bash scripts. The -e and -o flags tells the script to terminate if any command returns an error. The -u flag stops scripts from running commands with a variable that has not been set. The -x flag makes bash print each command before executing it, which makes it easier to debug exactly where an error occurs. Next, the script takes the input from the command line and sets it to the raw_fastq_file variable ($1 represents the first additional entry from the command line other than the script name. Copy and paste the below code into a file called “quality_trim.sh” and make sure to save it in the analysis/directory. Also ensure that you are in this directory. Quality Control in Metagenomics Data 35 #!/bin/bash # This is a script to quality trim files # To run the script use./quality_trim.sh R1.fastq.gz R2.fastq.gz output_paired1.fastq.gz output_unpaired1.fastq.gz output_paired set -euxo pipefail # Get which file to trim from the command line raw_fastq1="$1" raw_fastq2="$2" out_paired1="$3" out_unpaired1="$4" out_paired2="$5" out_unpaired2="$6" # Ensure adapter sequence files are downloaded from trimmomatic github wget -o NexteraPE-PE.fa -nc https://github.com/timflutre/trimmomatic/blob/master/adapters/NexteraPE -PE.fa # Manipulate the input filename to get our output filename trimmomatic PE $raw_fastq1 $raw_fastq2 \ $out_paired1 $out_unpaired1 \ $out_paired2 $out_unpaired2 \ ILLUMINACLIP:NexteraPE-PE.fa:2:30:10 HEADCROP:12 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:50 The Trimmomatic function in the script will process each file by first finding and removing adapter sequences, removing the first 12 bases and any poor quality bases at either end of the read, scan the read four nucleotides at a time, and trimming when the average quality falls below 20 (1/100 chance of an error). Lastly, Trimmomatic will only keep reads with a minimum length of 50 nucleotides. To run the script, you will need to make it executable, which can be done by typing chmod +x quality_trim.sh from the command line. To run the script, try the following from the command line. Be sure to run the script on all sequencing files in the raw_data/ directory, including the negative control samples../quality_trim.sh raw_data/03- Cecum_1.fastq.gz raw_data/03-Cecum_2.fastq.gz processed_data/03- Cecum_R1_trimmed.fastq.gz processed_data/03-Cecum_R1_unpaired.fastq.gz processed_data/03-Cecum_R2_trimmed.fastq.gz processed_data/03- Cecum_R2_unpaired.fastq.gz You could create a “‘for” loop in bash to make processing these files a bit more simple. This loop simply reads, for each of the identifiers denoted “s”, run the quality trimming script with the command line parameters. 36 Abraham Gihawi et al. for s in '03- Colon_' '04-Colon_' '05-Colon_' '03-Cecum_' '04-Cecum_' '05-Cecum_' '03-Rumen_' '04-Rumen_' '05- Rumen_'; do./quality_trim.sh raw_data/$s\1.fastq.gz raw_data/$s\2.fastq.gz processed_data/$s\R1_trimmed.fastq.gz processed_data/$s\1_unpaired.fastq.gz processed_data/$s\R2_trimmed.fastq.gz processed_data/$s\R2_unpaired.fastq.gz; done Trimmed sequences will be present in the processed_data/ directory. Run FastQC on your trimmed sequencing files and examine the output. It should appear much more reasonable than the raw data. 9 Removing Host Derived Content A large proportion of DNA and RNA content in metagenomics samples obtained from organisms is host-derived. It is advisable to deplete host sequences from samples—this limits taxonomic misclassification, reduces the size of files, minimizing the required computing resource. This study used a laboratory-based method for microbial DNA enrichment, but it is still good practice to apply depletion approaches to the sequence data. There are many tools for this job , but we recommend that you use BBDuK from the BBTools suite. In this case, we will use a reference genome for cattle (Bos taurus) from NCBI and download it using the wget command: wget https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/ Bos_taurus/latest_assembly_versions/GCF_002263795.1_ARS-UCD1.2/ GCF_002263795.1_ARS- UCD1.2_genomic.fna.gz Then run depletion with: mkdir -pv depleted_data # Run BBDuK with bbduk.sh in=processed_data/03-Cecum_R1_trimmed.fastq.gz in2=processed_data/03- Cecum_R2_trimmed.fastq.gz out=depleted_data/03- Cecum_R1_depleted.fastq.gz out2=depleted_data/03- Cecum_R2_depleted.fastq.gz outs=depleted_data/03- Cecum_R3_depleted.fastq.gz ref=GCF_002263795.1_ARSUCD1.2_genomic.fna.gz mcf=0.5 # For loop to process all samples for s in '03-Colon_' '04-Colon_' '05-Colon_' '03-Cecum_' '04-Cecum_' '05-Cecum_' '03-Rumen_' '04-Rumen_' '05-Rumen_'; do bbduk.sh in=processed_data/$sR1_trimmed.fastq.gz in2=processed_data/$sR2_trimmed.fastq.gz out=depleted_data/$sR1_depleted.fastq.gz out2=depleted_data/$sR2_depleted.fastq.gz outs=depleted_data/$sR3_depleted.fastq.gz ref=GCF_002263795.1_ARSUCD1.2_genomic.fna.gz mcf=0.5; done This command works by comparing each sequence to all of the unique k-mers in the reference genome. If 50% or more of a sequencing read is covered by k-mers found in the reference genome, then it is removed The command returns paired- end reads as well as single-end reads that have had their pair removed. This step requires an amount of RAM larger than most personal computers (~30–40gb) and so can be omitted for the purposes of this tutorial. If you have access to that much RAM, feel free to run depletion using the code above. Quality Control in Metagenomics Data 37 10 Taxonomic Classification Part of the art of microbial bioinformatics is knowing what computational tool to use for taxonomic classification. They each have strengths and weaknesses. Reading tool manuscripts is not always the best way to gauge their performance and there have been numerous independent benchmarking approaches to deciphering which tool might be the best to use in particular circumstances [33, 34]. Kraken is a widely used tool for taxonomic classification. Usefully, this tool allows for user-defined metagenomic databases. It is a k-mer-based approach that is typically quite sensitive. Drawbacks include false positive results, and databases containing a large number of sequences can prohibitively increase the memory requirements (although more recent versions have improved this aspect). Additionally, sequencing reads that do not contain enough unique k-mers to be mapped to species-level classifications are given a less specific classification at a higher taxonomic level (i.e., order or family level). These classifications can be redistributed to a more specific taxonomic level (i.e., species level) using a Bayesian approach (BRACKEN ). Taxonomic profilers such as mOTUs2 and MetaPhlAn use pre-built databases of taxonomically informative sequences and are therefore not as malleable to user--defined sequences. By using smaller pre-built databases, they are less computationally intensive and often perform well at predicting relative abundance. Each taxonomic profiler uses a different rationale to develop their pre-computed databases. mOTUs2, for example, uses a database of 40 widely expressed single-copy marker genes across the tree of life that were shown to be highly taxonomically informative. In this walkthrough we will run MetaPhlAn on the quality trimmed bovine metagenome data. To do so, we will use a Snakemake pipeline (implemented in Python) to make processing all samples easier. The pipeline will be kept simple by only analyzing the paired end samples that have undergone quality trimming (another potential source of data could be the unpaired reads that were produced as a result of quality trimming and/or host depletion). Save the above to a file called “mph.snake” in the “analysis/” directory and launch it from the command line with: “gunzip processed_data/*gz”, followed by “snakemake -- snakefile mph.snake --jobs 1”. This is a Snakemake workflow pipeline with only one rule, which runs MetaPhlAn on all the files it finds. The output files are tab-separated files containing some header information about each run, the sample name followed by the taxa, their NCBI taxonomic IDs, and their relative abundance. MetaPhlAn comes with some handy utility scripts. All of the reports can be merged with the following command: “merge_metaphlan_tables.py processed_data/*mph.tsv > all_metaphlan_genera.tsv”. 38 Abraham Gihawi et al. #!/usr/bin/env python #import the modules needed throughout the script import glob import re # find all input files to process input_files = glob.glob("processed_data/*_R1_trimmed.fastq") # Prepare a list of output files expected based on our input list output_files = [re.sub('_R1_trimmed.fastq', '_mph.tsv', file) for file in input_files] rule all: input: [file for file in output_files] rule run_metaphlan: input: R1=("processed_data/{sample}_R1_trimmed.fastq"), R2=("processed_data/{sample}_R2_trimmed.fastq") output: out="processed_data/{sample}_mph.tsv", bwtout="processed_data/{sample}_bowtie.bz" shell: "metaphlan {input.R1},{input.R2} --input_type fastq -- tax_lev g -o {output.out} --sample_id_key {wildcards.sample} -- bowtie2out {output.bwtout}" 11 Data Handling, Visualization, and Comparative Analysis The data in all_metaphlan_genera.tsv captures the relative abundance for all genera found in all of our nine samples. Now we can begin to carry out some analysis in the programming language R, which you can access by typing R from the command line. The code below will read in and “clean” our data so that it is in a format that can be analyzed. It is good practice to visualize your data as you go along to ensure its integrity and this is easy to do in an integrated development environment (IDE) like RStudio. You can also examine your data within R by typing commands such as “str(community_matrix)” which tells you the structure. From this, it should be clear that each of our columns is a genera and that they are all “num” variables. “head ()” and “tail()” give you the first and last lines, respectively, which can be overwhelming if you have a large number of features. You can also type “colnames(community_matrix)” and “rownames (community_matrix)” to display samples and taxa. You can view a particular column such as the column for Bacteroides as follows: “community_matrix$Bacteroides”. Typing “rowSums(community_matrix)” will tell us the sum of the values in each sample, in this case, they all add up to 100, which is perfect because this data is a percentage. Another useful function is the “is.na()” function, which can tell you how many cells have no assigned abundance in your dataset with the command “length(which(is.na(raw_- data)))”. Sometimes “NA” values can arise from merging reports together and usually represent absence of a taxa in metagenomics. All “NA” values can be replaced with zero values with code such as “community_matrix[is.na(community_matrix)]