Lecture 5 - DNA Assembly, BIOTECH 4B13
Document Details
Uploaded by EfficientHurdyGurdy
McMaster University
Tags
Summary
This document is a presentation on DNA assembly, including the Lander-Waterman equation, sequencing coverage examples, and various assembly strategies.
Full Transcript
Lecture 5 – DNA Assembly BIOTECH 4BI3 - Bioinformatics Where are we going? DNA Sequencing DNA DNA Read Sequencing Quality Assembly Mapping...
Lecture 5 – DNA Assembly BIOTECH 4BI3 - Bioinformatics Where are we going? DNA Sequencing DNA DNA Read Sequencing Quality Assembly Mapping Control Genome Expression Annotation Analysis Marker-Trait Population Polymorphis Genotyping Associations Analysis m Discover Be able to describe what a genome assembly is and why it is done Discuss the different algorithms commonly used for assembling DNA sequencing reads Learning Understand the what ‘coverage’ is and how it impacts an assembly Objectives Discuss why genome assemblies are fragmented and what strategies can be used to help with this Describe what a DNA minimizer is and what role they play in modern DNA assembly When we sequence a genome we don’t have the technical ability to sequence a chromosome from end-to-end Instead the genome’s chromosomes are fragmented into many smaller pieces and then some subset of those are sequenced What is Conceptually we are sequencing a single Genome genome but practically we are fragmenting and sequencing millions of Assembly copies of identical genomes The more sequencing we do the higher the likelihood we will sequence the whole genome We use the Lander-Waterman equation to help us determine how much sequencing should be done to achieve our goal Lander-Waterman Equation An equation that was developed to help estimated the depth of coverage of sequencing required to sequence a genome to a minimum level of confidence C=LN/G C = The genome coverage. This is the average number of times you’d expect to sequence any particular base L = The length of the sequencing reads used N = The number sequence reads generated G = The haploid genome size of organism being sequenced Sequencing any particular base follows a Poisson probability distribution P(Y=y) = (Cy × e-C)/y! y = the number of times the base was sequenced C = The genome coverage Sequencing Coverage Example Probability of sequencing a base at a particular level of coverage; Assume coverage is 10X, what’s the probability of sequencing a base only once? P(Y=y) = (Cy * e-C)/y! P(Y=1) = (101 * e-10)/1! P(Y=1) = 0.00045 Probability of sequencing a base 3 times or less is P(Yc and c->b in such a way that the overlap of a and b is implied by the overlap with c String Graph Transitive reduction reduces the number of redundant edges in the graph making it easier to resolve repeats This leaves only the information required to solve the assembly Moves this to an Eularian path problem which is easier to solve efficiently Idealized String Graph Myers EW. The fragment assembly string graph. Bioinformatics. 2005 Sep 1;21 String Graph Continued Once reduced the nodes with an in-degree and out-degree of one are compressed into compound edges These represent unique stretches and are called ‘unitigs’ Repeat edges can also be compressed to have multiple in/out- degrees Example String Graph Myers EW. The fragment assembly string graph. Bioinformatics. 2005 Sep 1;21 De Bruijn Graph Assembly The nodes represent sequences of a fixed size (k-mer) Edges represent reads that exist containing the adjacent kmers The distance between any adjacent kmers is 1 nucleotide where the kmer-1 suffix of node 1 is the kmer-1 prefix of node2 Traversal through the graph generates the sequence with no alignment necessary Because nodes represent kmers and not reads this type of assembly paradigm makes assembly of NGS reads possible Travel each edge once while visiting each node multiple times (Eularian path) De Bruijn Graphs and K-mer Size Impact Too Small (Low k value): Over-connectivity: A small k-mer size results in more shared k-mers between non-adjacent sequences, leading to many false overlaps. Collapsing of Repeats: Repetitive sequences are more likely to share the same small k-mers, causing repeats to collapse and misrepresenting the genome structure. More Ambiguity: Higher chance of ambiguous paths in the graph, making it harder to resolve unique genome regions. Too Large (High k value): Fewer Overlaps: As k increases, fewer k-mers will overlap between reads due to sequencing errors and random mutations, leading to fragmentation of the graph. Lost Data: Larger k-mers require perfect matches between sequences, so even small sequencing errors or mutations can prevent overlaps, disconnecting parts of the assembly. Insufficient Coverage: High k-mer values may result in insufficient overlap, especially in low- coverage regions, which can lead to fragmented assemblies. De Bruijn Graph Assembly TAGA AGAC GACC C C C ATAG ACCC A A CCCA G ACCT GACC CAGA CCAG A T C A Sequence: ATAGACCCAGACCTA DBG Advantages 1. The number of nodes is a function of the kmer size not the amount of sequence 2. No sequence-to-sequence overlap has to be calculated as the reads are encoded in the DBG. This results in substantial computational savings 3. There is no construction of a final sequence from the aligned reads. The final sequence is generated by traversing the graph Why are assemblies fragments? Weaknesses of the sequencing technology The complexity of the genome Major issue is how repeats are represented and resolved during assembly DBG repeats result in fragmentation as the path through the graph is ambiguous. This can be mitigated through larger kmer size There is a practical limit to the kmer size because of the error rate of the sequencing technology. If set too high few reads with share identical kmers. Hamiltonian vs Eulerian Hamiltonian – pass through each node exactly one. May not pass through each edge. It is very, very difficult to find out if there is a path through the graph that visits each node Eulerian – Travel each edge once without repeating but can visit a node more than once. There are algorithms that exist that making it not hard to determine whether it is possible to travel each edge in the graph once Phased Assembly A phased assembly is one that separates the maternally and paternally inherited chromosomes into haplotypes As sequencing technologies allow us to achieve longer and longer reads, phasing genomes becomes easier There are two approaches to phasing Assembly-based phasing – generates a de novo assembly for each set of chromosomes Alignment-based phasing – uses a reference genome to identify heterozygous positions and determine which are physically associated on a chromosomes Using trios sequencing data (maternal, paternal, offspring) can help with phasing. This is common in animal species Using genetic mapping populations can help with phasing. This is common in plant species Genome Assembly Modern assemblies of genomes have 4 core steps 1. Data pre-processing 2. Contig construction 3. Scaffolding 4. Gap closure Most assemblies do these steps with different pieces of specialized software The Perfect DNA Assembly? The ability to reconstruct a genome from DNA sequencing data is not possible given todays technology In 2013 Gene Myers proposed the following theorem over Twitter; Thm: Perfect assembly possible iff a) errors random b) sampling is Poisson c) reads long enough 2 solve repeats. Note e-rate not needed a) Errors Random Work by Churchill and Waterman in 1992 implies that as you add more coverage to a genome the error rate of the assembly moves to zero This is NOT true when the sequencing technology has reproducible errors, that is non-random errors -> there is a ceiling to the quality of an assembly Random errors allow one to achieve and arbitrarily accurate consensus by sampling enough sequence b) Sampling is Poisson A 1988 paper from Lander and Waterman implies “..for any minimum target coverage level k, there exists a level of sequencing coverage c that guarantees that every region of the underlying genome is covered k times. This means if you sequence enough from a poisson sampling distribution you will eventually cover each base c) Reads long enough 2 solve repeats If our read length is longer that any repeat in the genome then, with high enough quality, it is possible to resolve the number and position of repeats in a genome assembly Can PacBio Achieve Perfection? Myers believes it is possible but a few things need to be addressed Diploids and polyploids add complications but these are addressable Need an assembler that can deal with assembling long, noisy reads in a time and space efficient manner Improving error rates would lower the cost of performing an assembly. Assembly with long reads The use of long read sequencing technologies in DNA assemblies is the focus of much research There are three general approaches 1. Direct – use only SMS reads and a single step of read overlap to generate the assembly 2. Hybrid – use SMS reads to map the structures of the genome but use high quality reads for accurate base calling 3. Hierarchical – use only SMS reads but use multiple rounds of alignment and correction of the input sequence before the final assembly Challenges of SMS de novo Kmer based assembly methods difficult to use because of the high error rates but possible as quality improves OLC are now the preferred method as error rates fall resulting in improved assembly times. First whole-genome assemblies good >500,000 CPU hours to complete (CELERA assembler) Recent assemblers cut this time by orders of magnitude Many software choices – DALIGNER, CANU, HiFiasm are popular choices Seeing sub-specialization (error correction, overlap detection, assembly graph layout) DNA Minimizers Determining if an overlap exists between sequences is a key step in OLC and StringGraph assembly approaches One approach is to see if reads share seed sequences (kmers) and if they do use these seeds as points of anchoring for sequencing alignment The concept of using ‘minimizers’ as a lower memory use method to determine if sequences overlap was first presented in 2004 It took about 10 years before minimizers began to see use as de Bruin Graph assemblers dominated the space With the arrival of long read technologies minimizer approaches began to appear in software DNA Minimizers Minimizers are a set of subsequences (kmers) that are selected from a read to represent the read Instead of keeping track of all of the kmers for a sequence one only keeps a small subset Reads that share minimizers likely have overlapping regions and should be aligned Methods like the BWT-FM kmer search function can quickly find where minimizers are located in sequences to facilitate the seed alignment and extension process The concept of minimizers evolved to include methods to predict if error-prone sequences had overlapping regions (MinHash) DNA Minimizers (5,3) Minimizer 5 adjacent kmers of length three are identified The kmers that is first (numerically/alphabetical ly) is selected as the minimizer for that stretch Sets of minimizers instead of all of the kmers are used to predict overlap between sequences Roberts et al, 2004 DNA Minimizer Roberts et al, 2004 (4,3)-minimizer shown Choosing the smallest 3mer from every 4 adjacent 3mers Minimzer set is 032, 012, 123, 101 Minimizer DBG A recent publication illustrates how the application of existing ideas can be brought together to create new methods of analyses As the fidelity of long read sequencing technologies new techniques for assembly and genome comparison are being developed Ekim et al (2021) combined using DNA minimizers and de Bruin Graphs for rapid, computationally efficient assembly of human sized genomes in a fraction of the usual time Their approach can be used for assembly, error correction, and pan-genome assembly Minimizer DBG The input reads must have an error rate