Lecture 2 - Sequence Assembly PDF
Document Details
Uploaded by ConvincingOak
Imperial College London
Derek Huntley
Tags
Summary
This document is lecture notes on sequence assembly. It covers various algorithms and concepts related to the process of assembling biological sequences, including greedy approaches, overlap graphs, and de Bruijn graphs. The notes highlight the challenges of assembling increasingly complex genomes.
Full Transcript
Lecture 2 – Sequence Assembly Scaling Assembly Finding overlapping reads and assembling them becomes disproportionately complex as the number as the number of reads increases. For n reads there are 2n2 – 2n possible overlaps. Assembly Algorithms There are several approaches to sequence assembly,...
Lecture 2 – Sequence Assembly Scaling Assembly Finding overlapping reads and assembling them becomes disproportionately complex as the number as the number of reads increases. For n reads there are 2n2 – 2n possible overlaps. Assembly Algorithms There are several approaches to sequence assembly, mainly depending on the length of the available sequence reads: - Greedy Approach - Overlap Graph (OLC) Common to these algorithms is the use of paired end - De Brujin reads/mate pairs to produce the final assembly product. - String Graoh Greedy Approach This is the simplest assembly method and is used by Phrap. It finds the two sequences with the largest overlap and merges them, and repeats until there is no further assembly possible. The choices made by the assembler are local and do not take into account the global relationship between reads. However, this means that it is limited to simple assemblies due to read lengths and the local assembly method, as it cannot easily use global information such as paired end reads/mate pairs, which can help resolve repetitive genomes. NOTE: Phrap uses the crossmatch program which is a full implementation of the Smith Waterman algorithm. Overlap Graph (Overlap Layout Consensus – OLC) This finds the best match between the su:ix of one read and the prefix of another, with mismatches allowed in overlaps for sequencing errors. Then a filtration method is applied to filter out pairs of fragments that do not share a significantly long common substring. The path is determined through reads to create layout. Then local multiple alignments are created from the overlapping reads, allowing for gaps. Then a consensus is derived from the alignments. Overlap Identification K-mer - k-mers are substrings of length k contained within a biological sequence. K-mers in the read are sorted and indexed. Then identify pairs of reads that share a k-mer. This is extended to the full alignment and discarded if it’s not >95% similar. This technique drastically reduces the search space, as it is not searching at every position and has been widely used. Even with this improvement compared to the greedy approach, the computational requirement to identify all possible overlaps from next-gen short reads is a significant limitation. NOTE: OLC is suitable for Sanger sequencing reads (1kb) and long PacBio reads (tens of Kb). Simple Assembly In traditional Sanger sequencing algorithms, reads were represented as nodes in a graph, and edges represented alignments between reads. Walking along a Hamiltonian cycle by following the edges in numerical order allows one to reconstruct the circular genome by combining alignments between successive reads. At the end of the cycle, the sequence wraps around to the start of the genome. However, this does not scale for the millions of reads from next gen genome sequencing. Assembly Improvements For any genome we can use the same approach to reconstruct it, for assembly ideally, we need: - All k-mers present in the genome to be assembled. - Each k-mer should appear at most once in the genome. The genome can then theoretically be assembled by following the graph through the k-mers, where the larger the genome, the larger the required k-mer. This is the basis of de Bruijn graph assembly. De Bruijn Graphs De Bruijn graphs are computational data structures that enable us to essemble the genome. Reads are split into all possible k-mers, removing redundancy in the reads. Then the Hamiltonian cycle is followed in which each successive node (k- mer) is shifted by one nucleotide. There is only one path that will touch every node once, and this path will resemble the genome. The use of k-mers means that even though an individual k-mer may overlap with more than one other, there is only one overlap that provides a path through the graph that passes through each k-mer (node) only once. Hamiltonian Graph The Hamiltonian graph approach is used by numerous assemblers: SOAPdenovo, SGA, and ABySS. Transversing all nodes once leads to the nondeterministic polynomial time (NP), this is a problem as the number of nodes increases. As the size of the genome increases, the computation time required to solve this graph problem increases infinitely. To compensate for this, assembly programs adjust and simplify the graphs, for example reducing branching nodes. The mention of NP implies that finding a Hamiltonian path in a graph is an NP problem. This means that there is currently no known e=icient algorithm to solve this problem in polynomial time for all cases, and it may require significant computational resources. An alternative approach is to use a Eulerian Path. Eulerian Graph All k-mer prefixes and su:ixes are represented as nodes, and each prefix and su:ix can only occur once in the graph. Edges represent k-mers having particular prefixes and su:ixes. NOTE: k-mer edge ATG has a prefix AT and su:ix TG. Once the Eulerian cycle is performed through the graph it will visit every edge of the graph exactly once to assemble the genome. The Eulerian graph is more e:icient as Hamiltonian requires every node to be traversed, meaning a traversable graph can still have redundant edges. This can lead to dead ends when traversing the graph and all possible routes need to be tested. Eulerian requires every edge to be traversed, meaning that a traversable graph cannot have redundant edges. Thus, there are no dead ends and only one path. In summary, the Eulerian graph approach is considered more e:icient in genome assembly because it focuses on traversing edges (representing overlaps) rather than nodes. This reduces redundancy, eliminates dead ends, and typically results in a more straightforward and computationally e:icient assembly process compared to the Hamiltonian graph approach. Assembly Requirements Hamiltonian or Eulerian have the same requirements in order to assemble a complete genome : - Contains all k-mers in the genome, ensuring the graph is balanced. Number of edges in matches the number of edges out. - This is unlikely to occur. - All k-mers are error free. - Next-gen sequences contain errors. - Each k-mer occurs at most once in the genome. - Problem with repeats, but paired end reads can help overcome this. Assembly programs adapt the method to compensate for these issues e.g. removing branches. Removing branches can lead to low coverage areas in the assembly. Low coverage areas are regions where there is not enough sequencing data to confidently determine the correct sequence. Low coverage areas can result in the assembly program producing multiple contigs (contiguous sequences) instead of a single, continuous sequence. This is because the lack of su:icient data in certain regions may prevent the program from confidently connecting or resolving those areas into a single, complete sequence. Size of k-mer Assembly requires the presence of all k-mers in the genome. Illumina reads are approx. 100-200bp = k- mer of 100+. However, reads will not contain all possible 100-mers present in the genome, however deep the coverage. So, assemblers will break each read into overlapping k-mers e.g. 46 overlapping 55- mers, ensuring that nearly all 55-mers in the genome are detected. The k-mer size can be set when running the assembly so di:erent options can be tried as optimum option depends on the genome sequence. ABySS Genome Assembly Program ABySS assembly is a multistage process with 3 stages: 1. Unitig The initial assembly of sequences using de Bruijn graph approach. 2. Contig Paired-end reads aligned to the unitigs and the paur information is used to orient and merge overlapping unitigs. 3. Sca:old Align mate-pair reads to the contigs to orient and join them into sca:olds. “N” characters are inserted at any gaps in coverage and for unsolved repeats. Unitig Assembly The most resource demanding stage of the de Bruijn graph assembly, including memory requirement. All k-mers from the sequence reads are stored in a hash table. Additional information is also stored: the number of k-mer occurrences in the reads, and the presence or absence of possible neighbour k-mers in the de Bruijn graph. Bloom Filter A bloom filter is a compact data structure for representing a set of elements that support two operations: - Inserting an element into the set. These are k-mers. - Querying for the presence of an element in the set. This is used by ABySS and reduces the memory requirement. The Bloom filter structure consists of a bit vector and one or more hash functions. The hash functions map each k-mer to a corresponding set of positions within the bit vector. This is the bit signature. A k- mer is added to the Bloom filter by setting its bit value to one, and queried by testing if all positions of its bit signature are one. ABySS Assembly 1. K-mers from each input sequencing read are loaded into the Bloom filter by computing the hash values of each k-mer sequence. 2. A path in the de Bruijn graph is traversed by repeatedly querying for possible successor k-mers and advancing to the successor(s) that are found in the Bloom filter. 3. Builds unitig sequences by extending solid reads left and right within the de Bruijn graph. 4. Extension of a solid read is halted when either a branching point or a dead end in the de Bruijn graph is encountered. NOTE: A look-ahead algorithm is employed to detected and ignore short branches caused by Bloom filter false positives and/or recurrent read errors. Filtering To filter out the majority of k-mers caused by sequencing errors, all k-mers with an occurrence count below a user-specified threshold are discarded, the optimum minimum is typically 2-4. Retained k- mers are called solid k-mers, and these are the ones that are used to extend. In the second pass through the reads, those that consist entirely of solid k-mers are extended left and right within the de Bruijn graph to create unitigs. During the read extension phase of assembly, it’s possible for multiple solid reads to result in the same unitig. This is avoided by suing an additional tracking Bloom filter to record k-mers included in previous unitigs. A solid read is only extended if it has at least one k-mer that is not already in the tracking Bloom filter. String Graph Longer reads have enabled return to overlap graph approach. A string graph is obtained from an overlap graph by removing redundancy. 1. First, reads that are substrings of some other read, contained reds, are removed. 2. Second, transitive edges are removed from the graph. 3. The resulting graph, called a string graph, shares many properties with the de Bruijn graph without the need to break the reads into k-mers. FM Index Theoretical work on e:iciently constructing the string graph using the FM index led to memory-e:icient assemblers for large genomes. The FM index is based on the Burrows-Wheeler transform and the su:ix array. Resequencing Assembly Resequencing is a simpler problem to resolve as there is a reference sequence to assemble against but mapping requires e:icient indexing. Identified Unknown Sequence Identified Applications of Mapping - Genome Assembly - RNA splicing studies - Gene expression studies - SNP discovery - Methylation studies - Transcription factor binding site discovery