Genome Assembly Flashcards PDF

Document Details

UnaffectedElbaite

Uploaded by UnaffectedElbaite

Università degli Studi Suor Orsola Benincasa - Napoli

Ola Żyto

Tags

genome assembly flashcards bioinformatics biology

Summary

These flashcards cover the topic of genome assembly, discussing different sequencing technologies such as Illumina, Oxford nanopore, and PacBio. They also address aspects like quality scores and issues like high error rates related to certain sequencing techniques.

Full Transcript

23/10/24, 15:02 Genome assembly Flashcards by Ola Żyto | Brainscape   Study These Flashcards...

23/10/24, 15:02 Genome assembly Flashcards by Ola Żyto | Brainscape   Study These Flashcards bioinformatics > Genome assembly > Flashcards Genome Assembly Flashcards 1 Q Why do we sequence? A we are still sequencing new genomes can be a new individual for DNA protein interacrions metagenetic Sequence new genome (no previous version) Sequence new individuals - how does it differ to reference Sequence population - look at variation across population Sequence tumour cells and compare to ‘normal’ tissue – where are cancer mutations - time course? Sequence transcripts: survey gene-space, also relative quantification by tissue / time / condition Sequence as read-out to identify DNA-protein interaction (e.g. chromatin precipitation) Metagenomic mixed-organism co-habiting population sequencing: genome fragments, transcripts or rRNAs to identify identity, relative abundance 2 Q https://www.brainscape.com/flashcards/genome-assembly-12674193/packs/21207531 1/20 23/10/24, 15:02 Genome assembly Flashcards by Ola Żyto | Brainscape What are the next gen sequencing technologies   A Study These Flashcards -Illumina -Oxford nanopore -PacBio 3 Q How do you get high quality In Illumina? A short reads but ht e volume of reads you can get through is quite big 4 Q What are the length of the reads in PacBio? A shorter than nanopore but longer than illumina 5 Q How do you deal with high error rates in PacBio? https://www.brainscape.com/flashcards/genome-assembly-12674193/packs/21207531 2/20 23/10/24, 15:02 Genome assembly Flashcards by Ola Żyto | Brainscape A   very high error rate - to solve that you sequence multiple times and then because the errors are random you can just align the sequences and then you get a high Study These Flashcards accuracy 6 Q What are quality scores particularly important for? A if you are trying to find SNPs you need to know the quality score to see if you have a sequencing error or an actual variation 7 Q What do we need quality scores for? A Quality scores are assigned to estimate confidence of a given base call Phred scores aiming for quality score 30 or higher The quality scores are used for filtering and trimming of reads Also used for assembly Base quality scores are essential for variant calling to distinguish a true variant from a sequencing error 8 https://www.brainscape.com/flashcards/genome-assembly-12674193/packs/21207531 3/20 23/10/24, 15:02 Genome assembly Flashcards by Ola Żyto | Brainscape Q Where does the quality decorate?   A Study These Flashcards Quality deteriorates towards the ends of reads 9 Q What does AT and GC do? A High AT or GC content reduces complexity and can lead to higher error rates\ 10 Q What is the formula for QV? A The quality value ( QV) is related to the base call error probability by the formula QV = - 10 x log10( Pe ); where Pe is the probability that the base call is an error 11 Q https://www.brainscape.com/flashcards/genome-assembly-12674193/packs/21207531 4/20 23/10/24, 15:02 Genome assembly Flashcards by Ola Żyto | Brainscape What is base calling?   A in illumina Study These Flashcards Base calling algorithms turn raw intensities into A, T, C, G or N base calls 12 Q What is Chastity Filter? A Usual method for base calling in Illumina systems is known as Chastity Filter Chastity filter calls a base if the intensity divided by the sum of highest and second highest intensity is no less than a threshold of 0.6 (usually). Otherwise it is marked as N 13 Q What is Fast Q format? A the standard output format for next gen sequencing output all the programs rely now on that format 14 Q https://www.brainscape.com/flashcards/genome-assembly-12674193/packs/21207531 5/20 23/10/24, 15:02 Genome assembly Flashcards by Ola Żyto | Brainscape What do they use for quality scores in Fast Q?   A they use ascii values for quality scores so you get char to char association Study These Flashcards 15 Q Describe the standard output A 4 lines per sequence Line 1 begins with the @ character, a sequence ID and an optional description Line 2 is the sequence Line 3 begins with the + character and, optionally followed by the same sequence ID and description Line 4 encodes the quality values for the sequence letters in line 2 and must contain the same number of characters 16 Q What is depth of coverage useful for? A Sequencing errors are eliminated by the depth of coverage of overlapping sequence fragments 17 https://www.brainscape.com/flashcards/genome-assembly-12674193/packs/21207531 6/20 23/10/24, 15:02 Genome assembly Flashcards by Ola Żyto | Brainscape Q What was the depth coverage in the human genome project?   Study These Flashcards A For the Human Genome Project, most of the genome was sequenced at 12X or greater coverage. Each base was present in 12 reads on average. Even with 12x coverage approximately 1% of the genome not accurately assembled 18 Q Describe paired end sequencing A you go from both ends so you two reads per fragment reads are shorter than sequence gives you information how far away from each other the sequences are in the genome 19 Q What do we do with repeats in paired end sequencing? A it is quite tricky to assemble a genome when you have repeats because the you can’t see whihc one the sequence was https://www.brainscape.com/flashcards/genome-assembly-12674193/packs/21207531 7/20 23/10/24, 15:02 Genome assembly Flashcards by Ola Żyto | Brainscape to solve that then you have to anchor the reads using other sequences overlaping with the sequence   If one read is unmappable because it falls in a very repetitive region, but the other is unique, you can again use that distance information to map both reads Study These Flashcards One read can be mapped and the second can then be positioned within the repeat With enough paired end reads the entire repeat can be mapped With large repeats (LINE etc) paired ends won’t be able to map entire repeat 20 Q Describe pmate pairs sequencing A Mate pairs are similar to paired ends but the insertion length is much greater Paired ends are a few hundred bp but mate pairs are kb long DNA fragmented into 2-5Kb fragments and the ends repaired with biotin labelled dNTPs The fragments are then circularised and fragmented Biotin labelled fragments captured, adapters added and sequenced from both ends, as with paired end reads distance between reads known 21 Q What do you need for scaffolding? A https://www.brainscape.com/flashcards/genome-assembly-12674193/packs/21207531 8/20 23/10/24, 15:02 Genome assembly Flashcards by Ola Żyto | Brainscape -contig -scaffold   22 Q Study These Flashcards What are contigs? A Contiguous sequence where base order is known Assembled from sequence reads 23 Q What are scaffolds? A Genome sequence reconstructed from contigs and gaps Gaps are where reads (paired end or mate pairs, depending on gap length) from the two sequenced ends of at least one fragment overlap with other reads in two different contigs Approx length of fragments are known so number of bases between contigs are estimated 24 Q What is de novo sequenicing>? A https://www.brainscape.com/flashcards/genome-assembly-12674193/packs/21207531 9/20 23/10/24, 15:02 Genome assembly Flashcards by Ola Żyto | Brainscape The genome is sequenced and assembled for the first time so there is no reference.   When the human genome was first sequenced it was de novo. De novo is the more difficult and challenging of the two methods. De novo projects may useStudy These multiple Flashcards to sequence full genome technologies 25 Q What is reference sequencing? A The genome has already been sequenced so a reference is available For subsequent re-sequencing the reference can be used as a scaffold for the assembly 26 Q How many overlaps do you get per n reads? A For n reads there are 2n2 - 2n possible overlaps 27 Q Describe Greedy Approach - Phrap A https://www.brainscape.com/flashcards/genome-assembly-12674193/packs/21207531 10/20 23/10/24, 15:02 Genome assembly Flashcards by Ola Żyto | Brainscape The simplest assembly method Finds two sequences with largest overlap and merges them   Repeats until no further assembly possible. The choices made by the assembler are local and do not take into account Study These Flashcards the global relationship between reads Limited to simple assemblies due to read lengths and local assembly method Cannot easily use global information such as paired end reads/mate pairs, which help resolve repetitive genomes Phrap uses the crossmatch program which is a full implementation of the Smith Waterman algorithm 28 Q Describe Overlap Graph (OLC-Overlap Layout Consensus) A Find the best match between the suffix of one read and the prefix of another Mismatches allowed in overlaps for sequencing errors Apply a filtration method to filter out pairs of fragments that do not share a significantly long common substring Determine path through reads to create layout Create local multiple alignments from the overlapping reads Consensus derived from alignments 29 Q What is a k-mer? A https://www.brainscape.com/flashcards/genome-assembly-12674193/packs/21207531 11/20 23/10/24, 15:02 Genome assembly Flashcards by Ola Żyto | Brainscape K-mer - all the possible substrings of length k that are contained in a string   30 Study These Flashcards Q How do you identify overlap? A Sort all k-mers in the reads (typically 16 – 24 based) and index them K-mer - all the possible substrings of length k that are contained in a string Identify pairs of reads that share a k-mer Extend to full alignment and discard if not >95% similar This technique drastically reduces the search space and has been widely used Even with this improvement the computational requirement to identify all possible overlaps from next-gen short reads is a significant limitation OLC is suitable for Sanger sequencing reads (1 kb) and long PacBio reads (up to a few tens of kilobases) 31 Q Describe simple assembly A With Sanger sequencing reads represented as nodes in a graph and edges represent alignments Following Hamiltonian cycle can construct genome by concatenating each read Note this forms a circular genome Hamiltonian cycle visits all nodes (reads) once only and returns to start position https://www.brainscape.com/flashcards/genome-assembly-12674193/packs/21207531 12/20 23/10/24, 15:02 Genome assembly Flashcards by Ola Żyto | Brainscape However, this does not scale for the millions of reads from next gen genome sequencing   32 Study These Flashcards Q How do k-mers improve assembly? A For any genome we can use the same approach to reconstruct it For assembly ideally need all k-mers present in the genome to be assembled Each k-mer should appear at most once in the genome Genome can then theoretically be assembled by following graph through the k-mers The larger the genome the larger the required k-mer This is the basis of de Bruijn graph assembly 33 Q de Bruijn Graph A Split reads into all possible k-mers – removes redundancy in reads Follow Hamiltonian cycle in which each successive node (k-mer) is shifted by one nucleotide Use of k-mers means that even though an individual k-mer may overlap with more than one other there is only one overlap that provides a path through the graph that passes through each k-mer only once 34 https://www.brainscape.com/flashcards/genome-assembly-12674193/packs/21207531 13/20 23/10/24, 15:02 Genome assembly Flashcards by Ola Żyto | Brainscape Q What is a hamiltonian graph?   Study These Flashcards A The Hamiltonian graph approach is used by numerous assemblers: SOAPdenovo , SGA and ABySS among others Traversing all nodes at once leads to the nondeterministic polynomial time (NP) -complete problem as the number of nodes increases As the size of the genome increases, the computation time required to solve the graph problem increases infinitely To compensate for this assembly programs adjust and simplify the graph, for example reducing branching nodes An alternative approach used by other assemblers (Velvet, EULER, SPAdes etc) is to use a Eulerian path. This scales better to larger genome 35 Q Eulerian Graph: A All k-mer prefixes and suffixes represented as nodes Each prefix and suffix can only occur once in the graph. (Note they will be much larger than 2 nuc in full genome assembly graph) Edges represent k-mers having particular prefixes and suffixes k-mer edge ATG has prefix AT and suffix Perform Eulerian cycle through graph - visits every edge of the graph exactly once 36 https://www.brainscape.com/flashcards/genome-assembly-12674193/packs/21207531 14/20 23/10/24, 15:02 Genome assembly Flashcards by Ola Żyto | Brainscape Q Assembly requirements   Study These Flashcards A Hamiltonian or Eulerian have the same requirements in order to assemble a complete genome: Requirements – if met a path through the graph, visiting each edge once, is possible if: Containsallk-mersinthegenome(unlikelytooccur).Ensuresgraph balanced - in directed graph number of edges in is same as number out All k-mers are error free (next gen sequences contain errors) Each k-mer occurs at most once in the genome (problem with repeats but paired end reads help to overcome this) Assembly programs adapt the method to compensate for these issues e.g. removing branches Low coverage areas will lead to multiple contigs Final stage of assembly is scaffolding, using paired end reads to join contigs 37 Q What is the significance of k-mers size in genome assembly? A Assembly requires presence of all (or nearly all) k-mers in genome Illumina reads are approx 100-200bp+ – k-mer of 100+ Reads will not contain all possible 100-mers etc present in genome, however deep the coverage Assemblers will break each read into overlapping k-mers e.g. 46 overlapping 55- mers (for 100bp read) https://www.brainscape.com/flashcards/genome-assembly-12674193/packs/21207531 15/20 23/10/24, 15:02 Genome assembly Flashcards by Ola Żyto | Brainscape This ensures that nearly all 55-mers in the genome are detected The k-mer size can be set when running he assembly so different options can   be tried as optimum option depends of the genome sequence Study These Flashcards 38 Q What are the stages of AbySS A Uniting The initial assembly of sequences using a de Bruijn graph approach Contig Paired-end reads aligned to the unitigs and the pair information is used to orient and merge overlapping unitigs Scaffold Align mate-pair reads to the contigs to orient and join them into scaffolds “N” characters are inserted at any gaps in coverage and for unresolved repeats 39 Q describe uniting in Assembly A The most resource demanding stage of the de Bruijn assembly, including memory requirement All k-mers from the sequence reads are stored in a hash table- Additional information for each k-mer is also stored: https://www.brainscape.com/flashcards/genome-assembly-12674193/packs/21207531 16/20 23/10/24, 15:02 Genome assembly Flashcards by Ola Żyto | Brainscape Number of k-mer occurrences in the reads Presence or absence of possible neighbour k-mers in the de Bruijn   graph Study These Flashcards 40 Q What is a bloom filter? A A Bloom filter is a compact data structure for representing a set of elements that supports two operations: (1) inserting an element into the set. These are the k-mers (2) querying for the presence of an element in the set Used by ABySS and reduces the memory requirement The Bloom filter structure consists of a bit vector and one or more hash functions The hash functions map each k-mer to a corresponding set of positions within the bit vector - the bit signature A k-mer is added to the Bloom filter by setting the its bit value to one Queried by testing if all positions of its bit signature are one 41 Q Describe the filtering process of k-mers A To filter out the majority of k-mers caused by sequencing errors all k-mers with an occurrence count below a user-specified threshold are discarded Optimum minimum typically 2-4 https://www.brainscape.com/flashcards/genome-assembly-12674193/packs/21207531 17/20 23/10/24, 15:02 Genome assembly Flashcards by Ola Żyto | Brainscape Retained k-mers are called solid k-mers In the second pass through the reads those that consist entirely of solid k-   mers (solid reads) are extend left and right within the de Bruijn graph to create unitigs During the read extensionStudy phaseThese Flashcards of assembly it’s possible for multiple solid reads to result in the same unitig Avoided by using an additional tracking Bloom filter to record k-mers included in previous unitigs A solid read is only extended if it has at least one k-mer that is not already in the tracking Bloom filter 42 Q What does the string graph give us? A Longer reads have enabled return to overlap graph approach String graph uses same methodology as overlap graph but simplified First, contained reads (red) - reads that are substrings of some other read - are removed: The resulting graph, called a string graph, shares many properties with the de Bruijn graph without the need to break the reads into k-mers 43 Q What is the FM index? A Theoretical work on efficiently constructing the string graph using the FM index led to memory-efficient assemblers for large genomes. https://www.brainscape.com/flashcards/genome-assembly-12674193/packs/21207531 18/20 23/10/24, 15:02 Genome assembly Flashcards by Ola Żyto | Brainscape The FM index is based on the Burrows-Wheeler transform and the suffix array   Study These Flashcards Key Links Pricing Corporate Training Teachers & Schools iOS App Android App Help Center Subjects Medical & Nursing Law Education Foreign Languages All Subjects A-Z All Certified Classes Company About Us Earn Money! Academy Swag Shop Contact https://www.brainscape.com/flashcards/genome-assembly-12674193/packs/21207531 19/20 23/10/24, 15:02 Genome assembly Flashcards by Ola Żyto | Brainscape Terms Podcasts   Careers Study These Flashcards Find Us Brainscape helps you reach your goals faster, through stronger study habits. © 2024 Bold Learning Solutions. Terms and Conditions https://www.brainscape.com/flashcards/genome-assembly-12674193/packs/21207531 20/20

Use Quizgecko on...
Browser
Browser