Podcast
Questions and Answers
What is the primary purpose of read mapping?
What is the primary purpose of read mapping?
A reference genome is typically of low quality and poorly annotated.
A reference genome is typically of low quality and poorly annotated.
False
Name one limitation of read mapping.
Name one limitation of read mapping.
Sequencing errors or significant divergence from the reference genome.
Read mapping involves finding the best match for DNA sequencing reads in the ______ genome.
Read mapping involves finding the best match for DNA sequencing reads in the ______ genome.
Signup and view all the answers
Which of the following is NOT a reason why mapping reads is important?
Which of the following is NOT a reason why mapping reads is important?
Signup and view all the answers
Match the process with its description:
Match the process with its description:
Signup and view all the answers
All DNA sequencing reads can find an acceptable match in the reference genome.
All DNA sequencing reads can find an acceptable match in the reference genome.
Signup and view all the answers
What influences mapping quality in read mapping?
What influences mapping quality in read mapping?
Signup and view all the answers
What is the space efficiency of the Hash Table approach in read mapping?
What is the space efficiency of the Hash Table approach in read mapping?
Signup and view all the answers
The Burrows-Wheeler Transform is primarily a compression algorithm.
The Burrows-Wheeler Transform is primarily a compression algorithm.
Signup and view all the answers
Name one read mapping algorithm that allows for indels and mismatches.
Name one read mapping algorithm that allows for indels and mismatches.
Signup and view all the answers
The Burrows-Wheeler Transform is particularly effective for _____ large datasets in bioinformatics.
The Burrows-Wheeler Transform is particularly effective for _____ large datasets in bioinformatics.
Signup and view all the answers
Match the following read mapping methods with their corresponding characteristics:
Match the following read mapping methods with their corresponding characteristics:
Signup and view all the answers
What is the primary purpose of the Burrows-Wheeler Transform?
What is the primary purpose of the Burrows-Wheeler Transform?
Signup and view all the answers
The last column of the Burrows-Wheeler Matrix is used to generate the Burrows-Wheeler Transform.
The last column of the Burrows-Wheeler Matrix is used to generate the Burrows-Wheeler Transform.
Signup and view all the answers
What property is used to reverse the Burrows-Wheeler Transform?
What property is used to reverse the Burrows-Wheeler Transform?
Signup and view all the answers
The Burrows-Wheeler Matrix exhibits similarities to the _____ generated using the same sequence.
The Burrows-Wheeler Matrix exhibits similarities to the _____ generated using the same sequence.
Signup and view all the answers
Match the following components of the Burrows-Wheeler Transform with their descriptions:
Match the following components of the Burrows-Wheeler Transform with their descriptions:
Signup and view all the answers
In the context of LF Mapping, what does the ith occurrence of a character in L correspond to?
In the context of LF Mapping, what does the ith occurrence of a character in L correspond to?
Signup and view all the answers
The order of characters in the left column (L) changes after the Burrows-Wheeler Transform is applied.
The order of characters in the left column (L) changes after the Burrows-Wheeler Transform is applied.
Signup and view all the answers
What is the final output of the Burrows-Wheeler Transform for the string 'ATAATA$'?
What is the final output of the Burrows-Wheeler Transform for the string 'ATAATA$'?
Signup and view all the answers
What is the primary benefit of using Burrows-Wheeler Transform (BWT) in read mapping?
What is the primary benefit of using Burrows-Wheeler Transform (BWT) in read mapping?
Signup and view all the answers
The first column of the BWT is necessary for reconstructing the original sequence.
The first column of the BWT is necessary for reconstructing the original sequence.
Signup and view all the answers
What does SAM stand for in the context of file formats for storing mapping information?
What does SAM stand for in the context of file formats for storing mapping information?
Signup and view all the answers
The __________ is a compressed binary format that saves space and offers computational efficiencies.
The __________ is a compressed binary format that saves space and offers computational efficiencies.
Signup and view all the answers
During the backward searching process, what is the first character investigated?
During the backward searching process, what is the first character investigated?
Signup and view all the answers
Match the following file formats with their characteristics:
Match the following file formats with their characteristics:
Signup and view all the answers
What is the role of aligners like BWA and Bowtie in relation to BWT?
What is the role of aligners like BWA and Bowtie in relation to BWT?
Signup and view all the answers
The backward search requires scanning the entire genome to find matches.
The backward search requires scanning the entire genome to find matches.
Signup and view all the answers
What information does the FLAG field convey in the BAM file structure?
What information does the FLAG field convey in the BAM file structure?
Signup and view all the answers
The CIGAR string only indicates matches in the read alignment.
The CIGAR string only indicates matches in the read alignment.
Signup and view all the answers
What does MAPQ stand for, and why is it important?
What does MAPQ stand for, and why is it important?
Signup and view all the answers
The observed length of the template DNA fragment sequenced is referred to as ______.
The observed length of the template DNA fragment sequenced is referred to as ______.
Signup and view all the answers
Which of the following is NOT included in the BAM file header?
Which of the following is NOT included in the BAM file header?
Signup and view all the answers
SAM files only contain read sequences but no mapping data.
SAM files only contain read sequences but no mapping data.
Signup and view all the answers
Match the following components to their descriptions in the BAM file structure:
Match the following components to their descriptions in the BAM file structure:
Signup and view all the answers
What does the software 'samtools' primarily do?
What does the software 'samtools' primarily do?
Signup and view all the answers
What does a MAPQ score of 0 indicate?
What does a MAPQ score of 0 indicate?
Signup and view all the answers
A higher MAPQ score indicates that the read alignment is less reliable.
A higher MAPQ score indicates that the read alignment is less reliable.
Signup and view all the answers
What is the formula for calculating MAPQ?
What is the formula for calculating MAPQ?
Signup and view all the answers
If a read maps to multiple locations equally well, the MAPQ score is set to ______.
If a read maps to multiple locations equally well, the MAPQ score is set to ______.
Signup and view all the answers
Match the MAPQ score to its description:
Match the MAPQ score to its description:
Signup and view all the answers
Which factor does NOT contribute to the determination of MAPQ scores?
Which factor does NOT contribute to the determination of MAPQ scores?
Signup and view all the answers
Longer reads usually lead to lower MAPQ scores because they are less specific.
Longer reads usually lead to lower MAPQ scores because they are less specific.
Signup and view all the answers
What is the primary reason for a low MAPQ score?
What is the primary reason for a low MAPQ score?
Signup and view all the answers
Study Notes
Lecture 6 - DNA Read Mapping
- Lecture topic: DNA read mapping in bioinformatics (specifically, BIOTECH 4B13)
Where are we going?
- The lecture outlines a process starting with DNA sequencing, quality control, assembly, and ultimately, read mapping for genome annotation, expression analysis, marker-trait associations, population analysis, genotyping, and polymorphism discovery.
Learning Outcomes
- Define read mapping and applications
- Identify limitations and common issues in read mapping
- Understand 4 read mapping algorithms
- Interpret the contents of a SAM/BAM file
- Interpret mapping quality and its influencing factors
What is Read Mapping?
- Researchers are often limited to a small selection of high-quality reference genomes.
- A reference genome is a highly contiguous and representative genome of a species with excellent annotation plus chromosome-scale assembly and scaffolding.
- Read mapping is the process of aligning short DNA sequencing reads against a reference genome to determine the best match for each read.
- Not all reads align perfectly to the reference genome. This can result from sequencing errors or significant divergence between the sequenced individual and the reference.
What is Read Mapping (visual representation)
- A set of sequenced reads is aligned against a reference genome.
- The mapping process identifies specific locations on the reference genome where the reads align, providing insights into the genomic sequence of the sample.
Read Mapping Example
- Displays a software visualization of read mapping.
Why do we want to map reads?
- Constructing a high-quality reference genome requires significant resources.
- Read mapping allows comparison of individual genomes (or multiple genomes) against a reference genome.
- This has multiple purposes including:
- Identifying polymorphisms (DNA sequence variations) among individuals.
- Quantifying gene expression (measuring the abundance of gene transcripts).
- Mapping genome structure and interactions (Hi-C, ATAC-Seq)
- Investigating genomic modifications (epigenetic analysis).
Read Mapping Considerations
- A single read should ideally align to a single location within the reference genome.
- Factors impacting mapping quality: quality of the reference genome, DNA/RNA quality during re-sequencing, and the relatedness of samples.
- A significant percentage of reads (>20%) may not map to a reference genome, which needs explanation.
Duplication in the Genome
- A single genomic sequence can sometimes appear repeatedly in the reference genome.
- This can confuse read mapping, especially in the context of short sequencing reads and polyploidy genomes (genomes with multiple sets of chromosomes).
Read Mapping Algorithms
- Different algorithms exist; a naïve approach compares every read against every position in the reference genome.
- The computational cost is high, with a complexity usually noted as O(m²n).
- More efficient approaches index the reference genome, focusing on potential matches, and using sub-sequences.
Read Mapping Algorithms (list of paradigms)
- Hash Table: Fast but ineffective for mismatches, limited by the requirement for precise matches.
- Array Structures: Can accommodate but is not as fast as the hash table, and their space requirements are proportional to the reference genome size.
- Smith Waterman: Handles indels and mismatches, always produces the best possible alignment, but is computationally expensive (SLOW O(mnN)).
- Burrows Wheeler Transform/FM Index: FAST, and efficient by utilizing memory effectively (O(m+N)).
Burrows-Wheeler Transform (BWT)
- BWT is a fundamental algorithm for data compression and read alignment.
- BWT enables efficient storage and searches of large datasets such as genomic sequences.
- Tools like BWA and Bowtie use BWT for aligning millions of short reads to a reference genome rapidly and with limited memory usage.
- BWT is a preprocessing step allowing more efficient compression of similar characters.
Burrows-Wheeler Transform (BWT) - further details
- A data compression and read alignment algorithm.
- Significantly speeds up genome searching and alignment of short reads to a longer reference.
- Essential for tools like BWA (Burrows-Wheeler Aligner).
Burrows-Wheeler Transform (BWT) - technical description
- Involves a rotation permutation of the string that organizes the data to effectively compress similar characters.
- Creates a Burrows-Wheeler Matrix, showing a striking similarity to the suffix array.
- Character ranking in the last column of the matrix provides the basis of compression.
Burrows-Wheeler Transform (BWT) - Last/First Mapping
- Reversing BWT involves ranking characters, with the same character positioning in the first and last columns of the matrix
- This allows for faster reconstruction of the original sequence.
LF Mapping
- Mapping in LF method follows the same rules regarding character positioning as in BWT.
- The character order is maintained because sorting with the right-context preserves order.
LF Mapping - further details
- Allows re-ranking characters in the matrix for more efficient processing
- Simplifies the underlying mathematics of BWT greatly.
BWT in Read Mapping
- Transforms the reference genome to be searchable, enabling the use of backward searches.
- -Allows for efficient exact matching of short reads.
SAM/BAM Files
- SAM (Sequence Alignment/Map) files store read mapping information, usually tab-delimited.
- BAM (Binary Alignment/Map) files compress data saving storage space (efficiently compressing binary data).
- The header section contains essential metadata (format version, sorting info, read groups, and used programs).
- -Each line (after the header) in a SAM file provides information on how a read aligns with the reference genome.
- 'samtools' is a suite of programs enabling user interaction and data extraction from SAM files.
BAM File Structure
- QNAME: Unique read name (corresponds to the FASTQ header).
- FLAG: Bitwise flags denoting alignment (example: paired-end, reverse complement).
- RNAME: Reference sequence name (usually a chromosome).
- POS: Position in the reference where alignment begins.
- MAPQ: Mapping quality score (confidence measure).
- CIGAR: Describes details of the alignment (matches, insertions, deletions, etc.).
BAM File Structure - further details
- Additional data fields include information on the next read in a paired-end sequencing experiment.
- -Includes template length (fragment length), the actual read sequence (SEQ), and quality scores for the sequences (QUAL).
BAM File Header
- Includes header information (format version, details of reference sequences such as chromosomes, read groups, and the programs used).
SAM/BAM Example
- Shows formatted data in a SAM/BAM file.
Mapping Quality (MAPQ)
- MAPQ is a metric representing the quality of read mapping given a reference genome.
- Higher scores indicate greater certainty about the mapping of the read to a specific location.
### Mapping Quality (MAPQ) - factors impacting Score
- Repeat regions in the genome make mapping more complex, leading to lower values
- Shorter reads are more likely to overlap regions and decrease certainty
- Sequencing errors can cause incorrect mappings and decrease the certainty of matches.
- Genome complexity (many paralogous genes or conserved regions) introduces potential multimapping, decreasing mapping certainty.
Interpreting MAPQ Scores
- MAPQ = 0: Indicates a read maps equally well to several locations (usually discarded).
- MAPQ < 20: High chance of misalignment.
- MAPQ between 20 and 40: Moderate confidence, reads compete with multiple possible locations.
- MAPQ > 40: High confidence.
- MAPQ = 60: Very high confidence, a unique genomic location.
Factors Influencing MAPQ
- Repeat Regions: Reads mapping to repetitive genomic sequences tend to have lower scores due to ambiguous mapping.
- Read Length: Longer reads often have higher certainty, increasing the specificity that mapping occurs accurately.
- Sequencing Errors: Inaccurate bases decrease the overall read certainty, thus reducing the MAPQ.
- Genome Complexity: Complex genomes (e.g. many paralogs, conserved regions) have high multimapping, and the MAPQ will subsequently be relatively low.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
This quiz covers the essential concepts of DNA read mapping in bioinformatics, as outlined in Lecture 6 of BIOTECH 4B13. It includes definitions, applications, algorithms, and practical file interpretations relevant to read mapping. Explore the challenges and techniques involved in aligning DNA sequences for genomic studies.