Genome Assembly PDF

Genome Assembly SMBB 4713 1 de novo whole−genome shotgun assembly 3 Assembly Whole-genome“shotgun”sequencing starts by copying and fragmenting the DNA (“Shotgun”refers to the random fragmentation of the whole genome; like it was fired from a shotgun) Input: GGCGTCTATATCTCGGCTCTAGGCCCTCATTTTTT Output: GGCGTCTA TATCTCGG CTCTAGGCCCTC ATTTTTT GGC GTCTATAT CTCGGCTCTAGGCCCTCA TTTTTT GGCGTC TATATCTCGGCTCTAGGCCCTCATTTTTT GGCGTCTAT ATCTCGGCTCTAG GCCCTCA TTTTTT 4 Assembly Assume sequencing produces such a large fragments that almost all genome positions are covered by many fragments... CTAGGCCCTCAATTTTT CTCTAGGCCCTCAATTTTT GGCTCTAGGCCCTCATTTTTT Reconstruct CTCGGCTCTAGCCCCTCATTTT this TATCTCGACTCTAGGCCCTCA From these TATCTCGACTCTAGGCC TCTATATCTCGGCTCTAGG GGCGTCTATATCTCG GGCGTCGATATCT GGCGTCTATATCT GGCGTCTATATCTCGGCTCTAGGCCCTCATTTTTT 5 Assembly...butwe don’t know what came from where CTAGGCCCTCAATTTTT GGCGTCTATATCT CTCTAGGCCCTCAATTTTT Reconstruct TCTATATCTCGGCTCTAGG this GGCTCTAGGCCCTCATTTTTT From these CTCGGCTCTAGCCCCTCATTTT TATCTCGACTCTAGGCCCTCA GGCGTCGATATCT TATCTCGACTCTAGGCC GGCGTCTATATCTCG GGCGTCTATATCTCGGCTCTAGGCCCTCATTTTTT 6 Assembly Key term: coverage. Usually it’s short for average coverage: theaverage number of reads covering a position in the genome. CTAGGCCCTCAATTTTT CTCTAGGCCCTCAATTTTT GGCTCTAGGCCCTCATTTTTT CTCGGCTCTAGCCCCTCATTTT TATCTCGACTCTAGGCCCTCA 177 nucleotides TATCTCGACTCTAGGCC reads TCTATATCTCGGCTCTAGG GGCGTCTATATCTCG GGCGTCGATATCT GGCGTCTATATCT GGCGTCTATATCTCGGCTCTAGGCCCTCATTTTTT 35 nucleotides Average coverage = 177 /35 ≈ 7x 7 Assembly Coverage could also refer to the number of reads covering a particular position in the genome: CTAGGCCCTCAATTTTT CTCTAGGCCCTCAATTTTT GGCTCTAGGCCCTCATTTTTT CTCGGCTCTAGCCCCTCATTTT TATCTCGACTCTAGGCCCTCA TATCTCGACTCTAGGCC TCTATATCTCGGCTCTAGG GGCGTCTATATCTCG GGCGTCGATATCT GGCGTCTATATCT GGCGTCTATATCTCGGCTCTAGGCCCTCATTTTTT Coverage at this position = 6 8 Assembly 9 Computational tools for assembly ABySS 3D-DNA Celera Falcon Etc. Assembly for Short and Long Reads Long reads (eg: PacBio) - > 10,000 bp - Higher error rate (5-15%) - Chalenge: overcome high error rate Short reads (eg: Illumina) - ~ 150 bp - Higher accuracy - Challenge: To assemble large numbers of short reads 11 Long Reads Assemble Pipeline 12 Canyou identify which of theseis a graph? A. B. C. D. Can you identify which of these is a graph? B Edge Vertex (node) In mathematics, a graph is a term that refers to a collection of vertices connected by edges. Pipeline Long reads Overlap Build overlapgraph Layout Bundle stretches of the overlap graph into contigs Consensus Pick most likely nucleotide sequence for each contig 15 Overlaps Finding all overlaps is like building a directed graph where directed edges connect overlapping nodes (reads) CTAGGCCCTCAATTTTT GGCGTCTATATCT CTCGGCTCTAGCCCCTCATTTT CTCTAGGCCCTCAATTTTT | | |||||| | | |||||||| TCTATATCTCGGCTCTAGG GGCTCTAGGCCCTCATTTTTT GGCTCTAGGCCCTCATTTTTT CTCGGCTCTAGCCCCTCATTTT How to find overlap: TATCTCGACTCTAGGCCCTCA Suffix is similar to GGCGTCGATATCT prefix TATCTCGACTCTAGGCC GGCGTCTATATCTCG 16 Finding overlaps by building a graph Edge label is Example overlap graph with l = 3 overlap length k=7 5 3 5 AC GGC GC GCGTACG GTAC GGC 3 6 4 3 C GC GTAC 4 5 ATTATAT ATATTGC 5 3 5 6 GC ATTAT ATTGC GC 6 3 4 3 CGCCGCT GCCGCTA TATATTG l = minimum length of matches bases k = number of bases 17 Overlap Layout Consensus Overlap Build overlap graph Layout Bundle stretches of the overlapgraph into contigs Consensus Pick most likely nucleotide sequence for each contig 18 4 6 40 5 ry_thi n 4 6 y_thing 5 6 4 5 _thing _ 4 6 5 thing_ t 4 6 Layout l = 4, k = 7 hing_t 5 u 6 4 ing_tu 5 r 6 4 ng_tur 5 n 4 6 5 g_turn_ 6 _turn_ 4 t 4 5 4 4 5 urn_tu 4 r 5 6 6 5 6 6 5 4 4 rn_tur n 6 6 5 Below: part of the overlap graph for 4 4 4 5 n_turn _ 5 5 4 turn_t h turn_tu 4 6 5 urn_the 4 6 rn_the 5 r 6 4 5 n_there 4 6 to_every_thing_turn_turn_turn_there_is_a_season 5 _there_ 4 6 there_ 5 i 6 4 5 The overlap graph is big and messy. Contigs don’t“pop out” at us. here_i s 4 6 ere_is_ 5 6 4 5 re_is_a 4 6 5 e_is_a _ 4 6 Layout Picture gets clearer after removing some transitively-inferrible edges 1 abc bcd cde 2 2 abc 2 bcd 2 cde 41 4 6 42 5 ry_thi n 4 6 y_thing 5 6 4 5 _thing _ 4 6 node: 5 thing_ Before: t Layout 4 6 hing_t 5 u 6 4 ing_tu 5 r 6 4 ng_tur 5 n 4 6 5 g_turn_ 6 _turn_ 4 t 4 5 4 4 5 urn_tu 4 r 5 6 6 5 6 6 5 4 4 rn_tur n 6 6 5 4 4 4 5 n_turn _ 5 5 4 turn_t h turn_tu 4 x 6 5 urn_the 4 6 rn_the 5 r 6 4 5 n_there 4 6 5 _there_ 4 6 there_ 5 i 6 4 5 here_i s 4 6 ere_is_ 5 Remove transitively-inferrible edges, starting with edges that skip one 6 4 5 re_is_a 4 6 5 e_is_a _ 4 6 22 y_thin g 6 _thing _ 6 thing_ t 6 4 After: node: hing_tu Layout 6 ing_tu r 4 6 ng_tur n 6 g_turn _ urn_tu r 4 6 6 rn_tur 4 n 6 4 6n_turn _ _turn_ 4 t 6 4 6 turn_t x u turn_th 6 urn_th e 4 6 rn_the r 6 n_ther e 6 _there _ 4 6 there_ 6i Remove transitively-inferrible edges, starting with edges that skip one here_is 6 4 ere_is_ 6 re is a 6 23 ry_thin 6 y_thin g _thi6n g _ After: 6 thing_ Layout t 6 hing_t Even simpler u or twonodes: 6 ing_tu r 6 ng_tur n 6 g_turn _ 4 x urn_tu r 6 4 rn_turn 6 6 n_turn _ 6 4 _turn_ t 6 6 turn_t turn_t u h 6 urn_th e 6 x rn_the r 6 n_ther e 6 _there _ 6 Remove transitively-inferrible edges, starting with edges that skip one there_ i 6 here_i s 6 ere_is_ 24 to_ever 6 o_every 6 _every_ 6 every_t 6 very_th 6 Layout ery_th i 6 ry_thi n 6 Contig 1 y_thin g 6 _thing _ 6 thing_ t 6 hing_t u 4 6 to_every_thing_turn_ urn_t ing_tu u r r 6 6 6 4 rn_turn ng_tur 6 n n_turn_ 6 6 4 g_tu_rntu_rn_ t 6 6 turn_t turn_ t u h 6 urn_th e Unresolvable repeat. 6 Contig 2 rn_the r 6 n_ther e 6 _there _ 6 there_ Emit contigs corresponding to the non-branching stretches i 6 here_i e_is _a_ s 6 6 turn_there_is_a_season _is _a_s ere_i s 6 _ is _a_se 6 6 sre_is _ _a_sea 6a _a_s6ea s 6 a_seas o 6 _seas on Overlap Layout Consensus Overlap Build overlap graph Layout Bundle stretches of the overlap graph into contigs Consensus Pick most likely nucleotide sequence for each contig 25 Consensus TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA Take reads that make TAG TTACACAGATTATTGACTTCATGGCGTAA CTA up a contig and line TAGATTACACAGATTACTGACTTGATGGCGTAA CTA them up TAGATTACACAGATTACTGACTTGATGGCGTAA CTA Take consensus, i.e. TAGATTACACAGATTACTGACTTGATGGCGTAA - CTA majority vote At each position, ask: what nucleotide (and/or gap) is here? 26 Overlap Layout Consensus Overlap Build overlap graph Layout Bundle stretches of the overlap graph into contigs Consensus Pick most likely nucleotide sequence for each contig 27 Why this pipeline doesn’t work in short reads? With long reads, we can find long overlap between reads. With short reads, we need to use short overlap (~50-60bp). This can cause false positive due to the presence of repeats. Short reads pipeline Error correction Graph construction Graph cleaning Contig assembly Scaffolding Gap filling Short reads pipeline Error correction Graph construction Graph cleaning Contig assembly Scaffolding Gap filling K-mer correction Break down DNA pieces into k- mers Count the number each k-mer in the read present in all data Replace rare k-mers with common k-mers to correct error Many k-mers base correctors available (Quake, sga, soapdenovo etc.) Challenges in K-mer correction 1.Data-Related Challenges: Coverage Variation: Problem: Uneven coverage affects k-mer frequency Solutions: Adaptive threshold selection Local coverage estimation Variable k-mer sizes Repeat Regions: Problem: Similar k-mers from different regions Solutions: Longer k-mer sizes Paired-end information Multiple k-mer lengths Challenges in K-mer correction 2.Platform-Specific Issues: Illumina: High substitution error rate Solutions: Quality score integration Position-specific correction Base-specific error models Ion Torrent: Homopolymer errors Solutions: Flow signal analysis Context-specific correction Modified k-mer counting PacBio/Oxford Nanopore: High indel rate Solutions: Longer k-mers Modified alignment algorithms Platform-specific error models Challenges in K-mer correction 3.Implementation Strategies: Memory Management: Bloom filters Compressed data structures Disk-based algorithms Speed Optimization: Parallel processing GPU acceleration Efficient data structures Quality Control: Error rate estimation Correction validation Performance metrics tracking Short reads pipeline Error correction Graph construction Graph cleaning Contig assembly Scaffolding Gap filling Graph Construction (de Brujin graph) Similar with overlap graph in long read Use de Brujin Graph to break reads into k-mers with an edge Short reads pipeline Error correction Graph construction Graph cleaning Contig assembly Scaffolding Gap filling Graph Cleaning Need to clean the graph from graph artefacts (tip) Tip = Small branch that diverge from the main branch Happen because sometimes we cannot correct the error How to clean it? >>Trim the branches by removing the sequence Graph Cleaning Need to clean the graph from graph artefacts (bubbles) Bubble = Branch point that form a bubble Happen in diploid organism that posses heterozygous genotype How to clean it? >>Trim the branches by removing the sequence Graph Cleaning Data generated Bubbles removal from Illumina short reads Data after bubbles removal Graph Cleaning Data generated Bubbles removal from illumine short reads Data after bubbles removal Short reads pipeline Error correction Graph construction Graph cleaning Contig assembly Scaffolding Gap filling Contigs Assembly Combining all the DNA pieces to form contigs Short reads pipeline Error correction Graph construction Graph cleaning Contig assembly Scaffolding Gap filling Scaffolding Combining the contigs to form scaffold Short reads pipeline Error correction Graph construction Graph cleaning Contig assembly Scaffolding Gap filling Gap Filling Scaffold contain gaps “NNNN” >> Can use local assembler to fill these in (Eg: Gapcloser from Soapdenovo) Deploy other sequencing technology to resolve the issue (eg: Pacbio) Quality of Assemblies Bacterial Genomes > Short reads: Hundred of contigs > Long reads: Few contigs Large Genomes > Short reads: ~10,000 bp contigs > Long reads: ~ 1000,000 bp contigs Long read data is more expensive > The right technology depend on your research question and budget What makes assembly difficult? Repetitive sequences High heterozygosity Low coverage Biased sequencing High error rate Sequencing adapter in the data Sample contamination

Genome Assembly PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue