Bioinformatics: DNA Sequencing Quality Control PDF
Document Details
Uploaded by EfficientHurdyGurdy
McMaster University
Tags
Summary
This document provides a lecture on quality control in bioinformatics, specifically focusing on DNA sequencing. The lecture covers reasons for quality control, types of errors in sequencing experiments, and methods such as read quality assessment, adapter and contaminant filtering, and trimming low-quality bases. It also includes the topic of k-mer analysis. This document is suitable for undergraduate-level students studying genetics or bioinformatics.
Full Transcript
BIOTECH 4BI3 - Bioinformatics Lecture 3 – DNA Sequence Quality Control Where are we going? DNA Sequencing DNA DNA Read Sequencing Quality Assembly Mapping...
BIOTECH 4BI3 - Bioinformatics Lecture 3 – DNA Sequence Quality Control Where are we going? DNA Sequencing DNA DNA Read Sequencing Quality Assembly Mapping Control Genome Expression Annotation Analysis Marker-Trait Population Polymorphis Genotyping Associations Analysis m Discover Learning Outcomes Be able to describe why quality control is an important step in bioinformatics analysis Understand what are the steps in sequence QC Review the types of errors most commonly found in DNA sequencing experiments and their sources Review common metrics of sequence quality Identify sources of sequencing read duplications in experiments and how we can control for them Understand k-mer frequency distributions, their uses in DNA sequence QC, and what information can be learned from them Why is Quality Control Important Ensures Accuracy of Results: Without QC, errors in sequencing data can lead to incorrect conclusions in downstream analyses. Minimizes Cost and Time: Detecting and addressing sequencing errors early in the process can save significant time and resources that would otherwise be spent on re-sequencing or correcting data. Prevents Misinterpretations: Poor quality data can lead to misinterpretation in research studies or clinical applications. For example, false-positive or false-negative variant calls can mislead disease association studies. Improves Reproducibility: Standardized QC measures improve the reproducibility of experiments, which is critical in both academic research and clinical settings. What Happens Without QC Incorrect Genome Assemblies: Poor quality data can lead to fragmented or misassembled genomes, affecting evolutionary studies or functional genomics. False Variant Calls: Errors in sequencing data can cause false SNP or indel calls, impacting studies of genetic variation, disease association, or population genetics. Biased Expression Data: In RNA-Seq experiments, low-quality reads can introduce biases in gene expression quantification, leading to incorrect biological interpretations. Contaminated Data Interpretation: Without QC, contamination from different species or unintended samples can lead to spurious findings, especially in metagenomics or microbiome studies. Key Types of Quality Control Read Quality Assessment: Use tools like FastQC to evaluate the quality of raw reads. Adapter and Contamination Filtering: Remove adapter sequences and check for any contamination that might skew results. Trimming Low-Quality Bases: Remove low-quality bases from the ends of reads to improve overall read quality. Error Correction: Implement error correction methods to identify and correct sequencing errors. K-mer Analysis: Analyze k-mer frequency distributions to detect errors, contamination, or other anomalies. Duplication Level Analysis: Assess the level of duplication in the sequencing reads to identify potential over-sequencing or PCR artifacts. Read Quality Assessment Objective: Evaluate the quality of raw sequencing reads to identify potential issues that could affect downstream analyses. Tools Used: FastQC, MultiQC. Common Metrics Assessed: Per Base Sequence Quality: Identifies low-quality bases that may indicate errors, particularly at the ends of reads. Per Sequence Quality Scores: Evaluates the overall quality of each read; a bimodal distribution may suggest a subset of low-quality reads. Per Base N Content: High levels of 'N' bases can indicate low-quality reads or sequencing issues. Outcome: Identify regions or reads that need trimming or correction to improve the dataset's overall quality. Read Quality Assessment - FastQC A piece of analysis software that is widely used for quality control of raw sequencing data It analyzes the sequencing reads from an experiment (or a subset of the reads) and creates a number of diagrams and tables summarizing different view of the information These diagrams are not difficult to create but FastQC offers a simple method of generating them and provides an HTML output Commercial sequencing providers will often provide reports with much of the same information as that provided by FastQC Per Base Quality Scores Shows the quality score distribution at all positions along the reads The yellow box captures the inter- quartile range The whiskers represent the top and bottom 10% of values https://www.bioinformatics.babraham.ac.uk/ projects/fastqc/ Per Sequence Average Quality This diagram illustrates the overall quality of the sequencing reads We would prefer a tight distribution on the right side of the graph The bi-modal distribution in this image reveals a subset of reads with overall lower quality. We may want to filter those out https://www.bioinformatics.babraham.ac.uk/ projects/fastqc/ Per Nucleotide Sequence Content In a random library we would expect the average nucleotide abundance at each read position to be consistent We want to see roughly parallel lines Some library preps (random hexamer and tagmentation) can result in bias, typically at the 5’ end https://www.bioinformatics.babraham.ac.uk/ projects/fastqc/ Per Sequence GC Content We expect a normal distribution shape centered on the average GC content for the sequenced organism Deviations from a ‘normal’ shape may indicate a contamination in the library preparation (mix of 2 distributions) A shifted distribution indicates a systemic bias that is not associated with base position https://www.bioinformatics.babraham.ac.uk/ projects/fastqc/ Sequence Length Distribution Only useful for technologies creating variable length sequencing reads Significant differences from what was expect may indicate a bias with the the sequencing run https://www.bioinformatics.babraham.ac.uk/ projects/fastqc/ Duplicated Sequences In the data from a sequenced genome we expect most reads to have a low level of duplication A moderate level of duplication across the graph can indicate over- sequencing It is not uncommon to see some spikes on the right which indicate the presence of high copy number elements in the genome https://www.bioinformatics.babraham.ac.uk/ projects/fastqc/ Adapter and Contaminant Filtering Objective: Remove adapter sequences and contaminants that can interfere with downstream processing. Why It’s Important: Adapter sequences from library preparation can lead to false- positive variant calls or incorrect read mapping. Contaminants such as human DNA or bacterial sequences can skew results, particularly in metagenomics or microbiome studies. Tools Used: Trimmomatic, Cutadapt, BBMap, Fastp, fastq_screen Outcome: Cleaner datasets with fewer artifacts, leading to more accurate alignments, assemblies, and variant calls. Contamination Filtering – fastq_screen https://www.bioinformatics.babraham.ac.uk/projects/ fastq_screen/ Contamination Filtering – fastq_screen https://www.bioinformatics.babraham.ac.uk/projects/ fastq_screen/ Trimming Low-Quality Bases Objective: Remove low-quality bases from the ends of reads to enhance the overall quality of the sequence data. Why It’s Important: Sequencing technologies like Illumina often show decreased quality at the 3' end of reads. Trimming these regions can reduce error rates. Preserves high-quality central regions of the reads, which are more reliable for downstream analyses. Tools Used: BBDuk, Trimmomatic. Outcome: Increased read quality scores and reduced downstream error propagation, leading to more reliable results. Error Correction Objective: Identify and correct sequencing errors to improve data accuracy. Types of Errors: Substitution Errors (incorrect base calls) Indel Errors (insertions or deletions of bases) Methods for Error Correction: K-mer Based Correction: Identify low-frequency k-mers likely caused by errors and correct them using high-frequency k-mers. Consensus-Based Correction: Utilize overlapping paired-end reads to generate a consensus sequence that corrects errors. Outcome: Higher-quality data that is more accurate for applications such as genome assembly or variant calling. Why do our DNA sequence reads have errors? Each DNA sequencing technology we’ve covered has its own type of errors Errors can be attributed to either the library preparation step or the sequencing process itself Most sequencing technologies have a bias meaning that the errors created are not random Errors that aren’t random mean that we can’t always detect or fix errors in our sequencing data Sources of Errors – Library Prep PCR-Based Errors: Transitions: More common than transversions; these are substitutions between purines (A ↔ G) or between pyrimidines (C ↔ T). Transversions: Less frequent; substitutions between purine and pyrimidine (A ↔ C, A ↔ T, G ↔ C, G ↔ T). PCR Bias: Impact on Amplification: Some sequences are preferentially amplified, leading to biased results. High GC content sequences are harder to amplify than AT-rich sequences. Self-Annealing Fragments: DNA fragments with high self-annealing properties may lead to uneven amplification, causing deviations that are not strictly errors but still affect data quality. Sources of Errors – Instrument Error Amplification-Based Technologies: PCR-Based Errors: Instruments that amplify DNA using polymerases can introduce errors similar to those seen in PCR library preparation. Signal Amplification Errors: Bridge Amplification (Illumina): Errors occur when nucleotide incorporation cycles fall out of phase, resulting in mixed fluorescent signals and increased errors at the 3' end. Rolling-Circle Amplification: Similar out-of-phase issues can occur, leading to confounded base-calling. Platform-Specific Error Types: Oxford Nanopore: Prone to errors with homopolymers (repeated sequences like AAAAA or TTTTT). PacBio SMRT Cells: Known for having random errors throughout the reads, making error rates less predictable. Handling Sequencing Errors Goal: Improve the accuracy and reliability of DNA sequencing data by addressing errors. Common Approaches: Masking Low-Quality Bases Trimming Low-Quality Ends Error Correction Techniques Masking Low Quality Bases Concept: Use the IUPAC standard to mask low-confidence bases with ‘N’ (any nucleotide). Why Mask?: Improves Downstream Analysis: Ignoring low-quality bases helps avoid misleading alignments or variant calls. How It Works: Replace bases with low-quality scores (e.g., below Q20) with ‘N’. This allows downstream software to ignore these uncertain bases. Trade-off: May lose some correct bases, but the overall data quality improves. Trimming Low-Quality Ends Concept: Remove stretches of low-quality bases from the ends of reads. Why Trim?: Sequencing Bias: Sequencing-by-synthesis technologies (e.g., Illumina) accumulate errors toward the 3' end of reads. How It Works: Identify and remove low-quality tails that fall below a set quality threshold. Trade-off: Potential loss of some high-quality bases, but the benefits to overall data integrity outweigh this loss. Error Correction Techniques Concept: Replace erroneous bases with the correct base call or adjust quality scores. Methods: K-mer-Based Correction: Use high-frequency k-mers to correct low- frequency erroneous k-mers. Consensus-Based Correction: Use paired-end read overlap to generate a consensus sequence for higher accuracy. Why Correct?: Reduces the error rate, making the data more reliable for assembly, variant calling, and other analyses. Tools: BFC, BayesHammer, SPAdes error correction. Error Correction with kmers Erro r High count kmers Low count kmer s Any kmers which include the erroneous base have lower counts than kmers which don’t include the error Paired-End Consensus Correction Objectives: Used to correct errors at the 3’ of sequences in short-read data Description: Use overlapping low-quality bases from forward and reverse read pairs to increase the confidence of shared calls. In cases where they disagree the higher quality score can be used. Approach first used with Sanger. Usage: Common in Illumina data 3’ ends of Illumina data tend to be error prone Only works when the forward and reverse reads overlap Outcome: Dramatically improves the quality in a sequence and therefore improves follow-on analysis like genome assembly PE Consensus Correction Correcting PCR Library Errors Purpose: Error correction in PCR-amplified libraries aims to remove or mitigate biases and errors introduced during the amplification process, ensuring more accurate downstream analyses. When to Use: Non-Quantitative Applications Only: Correction is suitable for applications where accurate quantification is not required, as correction methods may skew quantification results. Requires Paired-End Reads: This approach relies on overlapping paired-end reads to generate a consensus sequence. Optimal for Larger Inserts: Larger insert sizes increase the likelihood of overlap between paired-end reads, improving the effectiveness of error correction. Assumption: Assumes that each template fragment in the library is unique. Reads mapping to the same position are considered PCR duplicates. Correcting PCR Library Errors Step-by-Step Approach 1. Identify Duplicate Reads: Identify read pairs where both forward and reverse reads map to the same genomic position. These are likely to have arisen from PCR amplification rather than independent sampling. 2. Utilize Unique Molecular Identifiers (UMIs) If Available: UMIs are short, random sequences added to each DNA fragment before PCR amplification. They help distinguish true duplicates (identical UMIs) from independent reads. 3. Generate a Consensus Sequence: For each set of duplicate reads, generate a single consensus sequence. This involves aligning the overlapping regions of the paired- end reads and using the higher-quality base calls where they differ. Quality Improvement: Consensus sequences reduce the impact of random sequencing errors and provide a more accurate representation of the original DNA fragments. Outcome: Reduces errors and biases introduced by PCR, leading to more accurate downstream analyses, such as variant calling and genome assembly. Correct PCR library errors AAAAAAAAAAAAAAAAAAAAA AAAAAAATAAAAAAAAAAAAA AAAAAAAAAAAGAAAAAAAAA AAAAAAAAAAAAAAAAAAAAA consensus read Read Duplication Analysis Objective: Assess the level of duplicate reads to identify over-sequencing, PCR artifacts, or optical duplicates. Types of Duplicates: PCR Duplicates: Caused by amplification of the same fragment during library preparation. Optical Duplicates: Caused by physical proximity of identical sequences on the sequencing flow cell. Why It’s Important: High duplication levels can skew quantification results (e.g., RNA-Seq) and affect variant calling accuracy. Identifying duplicates can help optimize sequencing depth and data usage. Tools Used: Picard, SAMtools. Outcome: Cleaner data with minimized duplication bias, resulting in more accurate quantification and variant analysis. Addressing PCR Duplicates These types of duplicates are controlled by the use of ‘Unique Molecular Identifiers (UMIs) 3’ to the sequencing primer binding sites (P5 and P7) a short random sequence is introduced These short sequences are usually 6-8 bases long This allows the identification of sequencing reads which originated from the same template molecule during PCA This is very important for sequence counting assays like RNA-Seq and genotyping to control for PCR duplicates Addressing Optical Duplicates This problem was not well understood by Illumina when first discovered When identical sequences are found coming from similar coordinates on the flowcell it indicates an optical duplicate. This is because it is unlikely that a small fragment of DNA will be independently sampled from a genome twice and land close to each other on a flowcell which can hold billions of fragments. The solution is the use of UMI when available. Optical duplicates originate from the same initial fragment that bound to the flowcell. If UMIs are not available you can identify optical duplicates as fragments with the same sequence being generated from physically close positions on the flowcell. This requires setting a maximum distance between identical sequences. K-mers K-mer Definition: A k-mer is a subsequence of length k extracted from a longer DNA sequence. For example, for k = 4, the sequence "ATCGG" contains the k-mers "ATCG" and "TCGG." Importance in Bioinformatics: K-mer analysis is a foundational technique used for various applications, including genome assembly, error correction, contamination detection, and genome size estimation. What K-mers Can Tell Us: By analyzing the frequency of k- mers, we can learn about the underlying sequence quality, complexity, and characteristics, such as repetitive regions, sequencing errors, and coverage. Selecting K-mer Size A small k (e.g., k=5) may result in too many repetitive or non-unique k-mers, especially in complex genomes, leading to ambiguity in analysis. A large value of ‘k’ increases the uniqueness of k-mers in the genome, but too large a value can reduce sensitivity and make the analysis computationally expensive. Considerations: Larger genomes with higher complexity require larger k-mers to differentiate between unique and repetitive regions. Higher sequencing depth allows the use of larger k-mers because of better coverage, reducing the chances of missing low-abundance k-mers. For most genome assembly projects, a k-mer size of 21 to 31 is a good starting point. For error correction or genome size estimation, a k-mer size around 19 to 25 is often optimal. Optimal k 100% % of Paired K-mers with Uniquely 90% 80% Assignable Location 70% 60% E.COLI 50% HUMAN 40% 30% 20% 10% 0% 8 10 12 14 16 18 20 Length of K-mer Reads (bp) Dr. Jianhua Ruan and Jay Shendure Optimal Kmer Size For Analysis Kmer Frequency Process of identifying all of the fragments of a given length, k, and counting how many times that kmer is found in your population Want a k-mer size where most of the fragments will be unique in the genome Presented as a histogram of data values Useful for many applications including; Prediction of genome size Identifying contamination in sequencing libraries Error correction of reads Quality control across sequencing runs Understanding a K-mer Frequency x-axis: k-mer abundance (number of times a k-mer appears in the dataset) y-axis: the number of k-mers with that abundance. unique k-mers: A peak at low abundance values represents sequencing errors (low-frequency unique k-mers). true k-mers: The main peak represents correctly sequenced k-mers (high-frequency unique k-mers). This represents the unique genome content repetitive k-mers: Higher abundance k-mers indicate repetitive regions in the genome. Kmer Frequency Analysis Kmer Peak Unique Content Errors Repeated Sequence Contamination Genome Size Estimation Can use kmer analysis to determine the total length of the genome you sequenced Identify the peak representing unique kmers from your sample on the plot Determine where the curve hits its maximum Determine the total number of kmers in the distribution Genome size = (total kmer number)/(peak depth) Genome Size GS = number of kmers/peak GS = 43,099,162,547/25 GS = 1,724 MB Transcriptome Kmer Frequency http://www.homolog.us/blogs/blog/2011/10/26/k-mer-distribution-of-a-transcriptome/ Popular K-mer Tools Popular Tools Jellyfish: Fast k-mer counting and analysis; suitable for large datasets. KMC (K-mer Counter): Highly efficient tool for counting and manipulating k-mers. GenomeScope: Utilizes k-mer spectra models to estimate genome size, heterozygosity, and repeat content. BFC: Error correction tool that uses k-mer frequencies for correcting sequencing errors in Illumina reads. Considerations When Choosing a Tool Speed and Memory Usage: Important for large genomes or deep sequencing projects. Output Format and Visualization: Look for tools that provide clear and interpretable plots and statistics. Ease of Integration: Tools that can be integrated into existing workflows or pipelines (e.g., Snakemake, Nextflow). Genome Composition Estimation Mixture model KFA with Genomescope (http://qb.cshl.edu/genomescope/) Summary Read Quality Assessment - Detects low- quality reads. Adapter and Contamination Filtering - Removes artifacts and contaminants. Trimming Low-Quality Bases - Improves read reliability. Error Correction - Corrects sequencing errors. K-mer Analysis - Provides insights into errors, contamination, and genome characteristics. Duplication Level Analysis - Reduces duplicate reads and biases. Outcome: High-quality, reliable datasets that enhance the accuracy of downstream bioinformatics analyses.