Lecture 6 - DNA Read Mapping PDF

Lecture 6 – DNA Read Mapping BIOTECH 4BI3 - Bioinformatics Where are we going? DNA Sequencing DNA DNA Read Sequencing Quality Assembly Mapping...

Lecture 6 – DNA Read Mapping BIOTECH 4BI3 - Bioinformatics Where are we going? DNA Sequencing DNA DNA Read Sequencing Quality Assembly Mapping Control Genome Expression Annotation Analysis Marker-Trait Population Polymorphis Genotyping Associations Analysis m Discover Learning Outcomes Define what read mapping is and some of its applications Limitations of read mapping and common issues Explain 4 algorithms used in read mapping software Understand the contents of a SAM/BAM file and its purpose Speak to mapping quality, what influences it, and how to interpret it What is Read Mapping? Researchers are often limited to a few quality reference genomes A reference genome is (hopefully) a highly contiguous representative genome for a particular species The reference genome should have high quality annotation in addition to chromosome-scale assembly and scaffolding Read Mapping is the process of taking a collection of DNA sequencing reads (typically short read technology) and for each of the reads finding the best match for the read in the reference genome. Not all reads can find an acceptable match against the reference genome. This can be because of sequencing errors or significant divergence between the individual from which the reads were generated the individual used to create What is Read Mapping https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/mapping/ tutorial.html Read Mapping Example Milne I, et al. Tablet--next generation sequence assembly visualization. Bioinformatics. 2010 Feb Why do we want to map reads? Constructing a reference quality genome is expensive in terms of human effort and consumables Read mapping allows us to compare the genome of one or more individuals against a reference genome This can be done for a number purposes To identify polymorphisms among individuals To generate count data for gene expression analysis For exploration of the structure of the genome with applications like Hi-C and ATAC-Seq Genome interaction studies with applications like ChIP-Seq Identify regions and patterns of epigenetic modification of the genome Read Mapping Considerations A single read should hopefully be assigned a single ‘best match’ in the reference genome. This can be considered where in the genome the read originated from. The quality of a read mapping depends on the quality of the reference genome, the quality of the re-sequenced DNA/RNA, and the relatedness among the samples. It is often the case that >20% of reads do not ‘map’ to a reference genome. Why? Low quality DNA sequencing resulting in error-prone reads Low quality library prep - > perhaps a contamination Significant divergence between the re-sequenced sample and reference genome. The re-sequenced genome may have genome regions that are unique to it Duplication in the genome. We are typically interested in read Duplication in the genome A particular genome sequence may occur more than once in the reference genome. It is possible that a DNA sequencing read may have equally good mapping to multiple positions in the genome. This becomes more an issue with short sequencing reads and complex polyploidy genomes Image from Dr. Jianhua Ruan, University of Texas at San Antonio Read Mapping Algorithms A naïve approach to read mapping could be to perform a Needleman-Wunsch or Smith-Waterman alignment between the read and the genome This would be computationally very expensive as you’d need to compare the read to all positions in the genome. The best SW implementations have O(mn) compute complexity (not very good). The original algorithm was O(m 2n) A better way to perform read alignment would be to index the reference genome and use that to identify genome locations that have the potential to have a high quality read mapping with a given read We likely wouldn’t use the whole read for identifying potential genome locations. We would use sub-sequences. A genome position that shares a number of sub-sequences with a read could a potential match. This has some similarity to how BLAST uses shared words between the query and the database to focus in on similar sequences. In this way we are eliminating all of the places not to look for read mapping Read Mapping Algorithms There are a number of approaches used by read mapping software to find the position in the genome a read may map to. Most of them avoid comparing the sequencing read to all locations of the genome There are four main paradigms for mapping reads to a reference genome 1. Hash Table: Fast but requires perfect matches. Not space efficient at O(mn+N) 2. Array structures: Able to accommodate mismatches but not as fast as a hash table. Space requirements are O(mN) 3. Smith Waterman: Reads can have indels and mismatches relative to the reference. It will always provide the mathematically best solution. SLOW O(mnN) 4. Burrows-Wheeler Transform/FM Index: FAST. Memory efficient at O(m+N) Burrows-Wheeler Transform The Burrows-Wheeler Transform (BWT) is a fundamental algorithm widely used in data compression and read alignment. It is particularly well-known for enabling the efficient storage and searching of large datasets, including genomic sequences. In bioinformatics, BWT plays a crucial role in tools like BWA (Burrows-Wheeler Aligner) and Bowtie, making it possible to align millions of short reads to a reference genome quickly and with low memory usage. The Burrows-Wheeler Transform is not a compression algorithm by itself, but a preprocessing step that makes data easier to compress. It rearranges the characters in a string to form clusters of similar characters, which compression algorithms can then handle more efficiently. Burrows-Wheeler Transform A rotation permutation of string, S, that was developed in the 1980’s for string compression Burrows-Wheeler Matrix ATAATA$ $ATAATA TAATA$A Sort A$ATAAT Rotation columns s AATA$AT AATA$AT ATAATA$ BWT(S) = ATTA$AA ATA$ATA ATA$ATA TA$ATAA ATAATA$ A$ATAAT TA$ATAA $ATAATA TAATA$A  The BWT(S) tends to group the characters. This is the basis of compression BWT The Burrows-Wheeler Matrix (BWM) has a striking similarity to the suffix array generated using the same sequence $ATAATA 6$ A$ATAAT 5A$ AATA$AT 2AATA$ ATA$ATA 3ATA$ ATAATA$ 0ATAATA$ TA$ATAA 4TA$ TAATA$A 1TAATA$  The BWT are the characters in the last column of the BWM BWT – Last/First Mapping To reverse the BWT we rely on a property of the BWM involving the ranking of the characters The first ‘A’ in the left column and the first ‘A’ in the last column represent the same character. F L This holds true for all character positions. $ ATAAT A This is called T-ranking 3 A $ATAA T 3 1 S = A0T0A1A2T1A3$ A1 A T A $ A T0 A2 T A $ A T A1 A0 T A A T A $ T1 A $ A T A A2 T0 A A T A $ A0 LF Mapping LF Mapping: The ith occurrence of a character, c, in L and the ith occurrence of c in F correspond to the same occurrence in S Because of this we can rank the characters however we like, the order stays the same in F and L Order is preserved because occurrences of c both F and L are sorted by right-context. LF Mapping Let’s re-rank the characters in BWT(S) $ A T A A T A3 $ A T A A T A0 A3 $ A T A A T1 A0 $ A T A A T0 A1 A T A $ A T0 A1 A T A $ A T1 A2 T A $ A T A1 A2 T A $ A T A1 Ascending rank A0 T A A T A $ A3 T A A T A $ T1 A $ A T A A2 T0 A $ A T A A2 T0 A A T A $ A0 T1 A A T A $ A3 This greatly simplifies BWT(S) BWT Using this structure we can F L reconstruct the original sequence $ A0 In fact, we don’t even NEED the A0 T0 first column A1 T1 Step: A2 A1 1. Start at $ and look to L to see the character that precedes it. S = A0$ A3 $ 2. Go to the row starting with A0 and look to L to see the character T0 A2 that precedes it. S = T0A0$ T1 A3 2. T0 is preceded by A2, A2 by A1, A1 by T1, T1 by A3 and A3 by $ S = A3T1A1A2T0A0$ BWT in Read Mapping Burrows-Wheeler Transform enables fast alignment by transforming the reference genome into a more searchable format. The basic idea is to reduce the problem of searching for a sequence in the reference genome to a series of backward searches over the BWT. Backward Search: The backward search allows aligners like BWA and Bowtie to efficiently find exact matches of short reads in the reference genome. Instead of scanning the entire genome, the BWT allows the mapper to trace the alignment backward, starting from the last character of the read and extending toward the first. Exact Matching Using BWT: When you perform an exact match, the backward search checks one character at a time (starting from the last character of the read) and narrows down the possible locations in the reference where the read could match. SAM/BAM Files The Sequence Alignment/Map (SAM) file is a standardized way to store information about reads mapping to a reference The SAM is tab-delimited and in a format users can understand. The BAM file is a compressed binary format that saves space and offers some computational efficiencies. The SAM file starts with the ‘header’ section which contains information about the read mapping such as the command used to generate the SAM file and the format version Each line of the SAM file (after the header) contains information about how a read aligned/mapped to the reference genome SAM files contain the read information as well as the mapping data The software ‘samtools’ is a collection of programs used to interact and extract information from a SAM file BAM File Structure QNAME (Query Name): The name of the read or query, which typically corresponds to the FASTQ file header. FLAG: A bitwise flag that conveys alignment information. You can show an example of how to interpret the FLAG field: Example Interpretation: A FLAG of 83 means: Read is mapped in reverse complement (bit 0x10 set). Read is part of a paired-end read (bit 0x1 set). The read is the second in the pair (bit 0x80 set). RNAME: The reference sequence name where the read is aligned, usually the chromosome. POS: The 1-based position on the reference sequence where the alignment starts. MAPQ (Mapping Quality): A score that reflects the confidence in the alignment. Higher values indicate more reliable alignments. CIGAR (Compact Idiosyncratic Gapped Alignment Report): Describes how the read aligns BAM File Structure CIGAR (Compact Idiosyncratic Gapped Alignment Report): Describes how the read aligns with the reference, showing matches, insertions, deletions, and clipping. M: Match I: Insertion D: Deletion S: Soft clipping (part of the read was not aligned) H: Hard clipping (part of the read is excluded from alignment) RNEXT and PNEXT: These fields provide information about the next read in a paired-end alignment, helping to identify the mate in the pair. TLEN (Template Length): The observed length of the template (DNA fragment) sequenced. SEQ: The actual sequence of the read. QUAL: The quality score of the sequence, indicating the confidence in each base call. BAM File Header The BAM file header contains metadata about the alignment, including: @HD (Header): Contains format version and sorting information. @SQ (Reference Sequences): Lists reference sequences (chromosomes) with lengths and other details. @RG (Read Group): Information about the read group, which is useful for keeping track of different samples or batches. @PG (Program): Lists the programs used during SAM/BAM Example Mapping Quality (MAPQ) Mapping quality (MAPQ) is a key metric in genomic analysis that indicates the confidence level of a read being mapped to a specific location in the reference genome. It is especially important for distinguishing between accurately and inaccurately mapped reads in regions where sequences are similar, repetitive, or ambiguous. The higher the MAPQ score, the more confident the algorithm is that the read is mapped correctly. Mapping Quality (MAPQ) Definition: The MAPQ score is a numerical value assigned to each read to represent how confident the read aligner is that the read is mapped to the correct location in the reference genome. Range: Most tools assign MAPQ scores on a scale of 0 to 60, where: A MAPQ score of 0 means the read could not be mapped confidently, or it is mapped to multiple locations with equal probability. A MAPQ score of 60 means the read is mapped with a very high degree of confidence, usually indicating that this is the only possible location. Mapping Quality (MAPQ) Mapping quality is derived from several factors, and the method used to calculate MAPQ depends on the read aligner. Typically, MAPQ scores are computed as follows: Base Formula: MAPQ is often calculated based on the likelihood that a read maps to a given location versus mapping to another equally probable location. MAPQ = -10 x log10(P) where P is the probability that the read is mapped to an incorrect position. Uniqueness of the Mapping: The more unique the mapping location, the higher the MAPQ score. If a read maps to multiple locations, the probability that it’s incorrectly mapped increases, resulting in a lower MAPQ score. Mapping Quality (MAPQ) Alignment Score: The overall alignment score (how well the read aligns to the reference) also contributes to the MAPQ score. If there are many mismatches, indels, or gaps in the alignment, the MAPQ score is lower. Read Length: Longer reads usually lead to higher MAPQ scores because they are less likely to map to multiple locations due to their increased specificity. Ambiguity in Mapping: If the read maps to multiple locations equally well (i.e., multimapping), the MAPQ score is typically set to 0, indicating uncertainty. Interpreting MAPQ Scores MAPQ = 0: The read maps equally well to multiple locations. This usually indicates ambiguity, and such reads are often discarded in downstream analyses. MAPQ < 20: The read may have mapped to the correct location, but the probability of misalignment is still significant. MAPQ between 20 and 40: There is a moderate confidence that the read is correctly mapped, but it may be competing with other possible locations. MAPQ > 40: High confidence in the mapping, with a very low probability of incorrect alignment. MAPQ = 60: The highest score, representing extremely high confidence in a unique mapping. Factors Influencing MAPQ 1. Repeat Regions: Reads that map to repetitive regions of the genome (e.g., centromeres, telomeres, or transposons) are more likely to receive lower MAPQ scores due to the ambiguity of mapping. 2. Read Length: Longer reads typically have higher MAPQ scores since they are more likely to map uniquely to a specific genomic location. 3. Sequencing Errors: If a read contains sequencing errors (e.g., incorrect base calls), it might reduce the overall alignment score, which in turn can lower the MAPQ score. 4. Genome Complexity: In complex genomes with many paralogous genes or conserved regions, the probability of multimapping increases, leading to lower MAPQ scores.

Lecture 6 - DNA Read Mapping PDF

Document Details

Tags

Related

Summary

Full Transcript