Document.odt
Document Details
Uploaded by ReliableMookaite1890
Full Transcript
Illumina vocab: Adapter Known primers added to ends of DNA fragments in illumina work flow Barcode Enables simultaneous sequencing of multiple samples in illumina. Short DNA sequences added to individual frags and act as molecular taf for tracking of that respective frafment Contained in ad...
Illumina vocab: Adapter Known primers added to ends of DNA fragments in illumina work flow Barcode Enables simultaneous sequencing of multiple samples in illumina. Short DNA sequences added to individual frags and act as molecular taf for tracking of that respective frafment Contained in adpdaters during amplifciation Flow cell Glass slide with oligos that attach to adapters Cluster Formed after bridge amplification, multiple copies of same DNA strand on flow cell Color determines the base each cycle Each cluster generates one single end read or two paired end reads Cycle In illumina, aids in identifying the base interpreted from the color of a cluster on the flow cell. Base calling from Raw Data The identity of the base is interpreted from the color of a cluster “A spot” on the flow cell during each cycle. The certainty of the base is determined by how distinct and clear the spot is Each cluster generates one single-end read or two paired-end reads. FASTQ Files Raw images are processed into FASTQ files The format is a txt file with four lines per read There can be millions of reads per text file Quality values in ASCII characters for the read in the same order as the bases in line 2 Line 1: starts with @ character and the sequence identifier for the read follows. also, can be referred to as the name. Line 2: The nucleotide sequence for that read Line 3: Starts with the + character and is optionally followed by the reads sequence identifier. This you can often ignore. Line 4: Has the quality values in ASCII characters for the read in same order as the base in line 2 Phred quality score It is a measure of the quality of the identification of the nucleobases generated by automated DNA sequencing. can be used to compare the efficacy of different sequencing methods. Best score to obtain is 30--> 0.10% is used to indicate the measure of base quality in DNA sequencing. High consistency of a sequenced base is indicated by greater values of Phred. The quality score of a base, also known as a Phred or Q score, is an integer value representing the estimated probability of an error, that the base is incorrect. ASCII quality characters in left-to-right increasing order of quality Assessing Read quality Generates quality reports: Distribution of rad length and quality score Trim/remove reads based on chosen criteria: Does downstream software use quality data? How sensitive is your analysis to errors? Shorter sequences Barcodes Adapters Primers Fragments of vectors For shorter sequences: For <15bp, blast can fail to find matches Use blast-short, or simple match script to trim reads Longer sequences Vectors Host bacteria Contamination from pathogens mtDNA chloroplasts For longer reads, search/map reads against database of contaminating sequences and REMOVE them In some cases, you should REMOVE DUPLICATE sequences that arise during the PCRs stage of DNA library preps Applications for Instruments: Sanger- older-style sequencer Sequencing of specific gene segments in 10s-100s of samples Microsatellite genotyping (forensics, population structure) Illumina (NGS)- shorter reads, lots of sequence, higher accuracy De novo genome assembly Genotyping by sequencing (RADseq, target selection, amplicons) RADseq is a method for SNP detection in genomes and it can identify polymorphic variants adjacent to restriction enzyme digestion sites. Can also be used in association mapping, genetic mapping, and estimation of allele frequencies. Copy number variation/structural differences CNV is the term to describe a molecular phenomenon in which sequences of the genome are repeated, and the number of repeats varies between individuals of the same species. Transcriptome and differential expression Refers to the protein-coding part of an organism’s genome. It refers to the set of RNA molecules such as mRNA, tRNA, and rRNA, and other noncoding RNA molecules that are present in cells. Differential expression is different gene expression and is the process where different genes are activated in a cell, giving that cell a specific purpose that defines its function. Metagenomes Is the study of the structure and function of entire nucleotide sequences isolated and analyzed from all the organisms in a bulk sample. barcoding Pac bio (NGS)- longer reads, less sequence, higher errors De novo bacterial/viral genome assembly BAC sequencing Is an engineered DNA molecule used to clone DNA sequences in bacterial cells. Re-sequencing to improve de novo assembly of eukaryotic genomes Approach to De Novo genome assembly De novo sequencing refers to sequencing a novel genome where there is no reference sequence available for alignment. Sequence reads are assembled as contigs, and the coverage quality of de novo sequence data depends on the size and continuity of the contigs (ie, the number of gaps in the data). Cuts chromosome into small 400-500bp pieces. Reconstructing the sequence from short reads Overlap reads to reconstruct the sequence of the chromosomal segment Reads from same part of chromosome align together Read depth or Coverage needs to be built on accurate contigs Calculating coverage ( C): The likelihood that you will not have any reads for a specific site Probability [site not covered]=E-c String graph In de novo assembly, refers to overlapping path to construct full sequence contig Nodes are reads Edges are overlaps Weights are lengths of non-overlapping prefix Find overlapping path to get full sequence contig A string graph is constructed by adding an edge for every pair of overlapping reads. A single node corresponds to each read and reaching that node while traversing the graph is equivalent to reading all the bases up to the end of the read corresponding to the node . Sequence obtained from consensus In de novo assmeblu obtained from contigs. Each consensus based obtained by weighted voting which incorporates quality scores. For example, lets say you have a heterogenous tumor and you want to confirm the short reads are accurate that you have obtained. You can make primers to amplify that section and then use PCR to amplify that region, then sequence it in a different platform and if you see the same sequence again, then you can confirm it. Assembly and coverage: Low coverage High coverage Ways to assess how good your assembly is: % error Compare with subset of loci from known reference Number of Contigs The more contigs the more fragmented Length of contigs and scaffolds N50 and max N50 Contigs: If you line up ALL contigs LONGEST to SHORTEST, add them up until you have 50% of total assembly. You pick the shortest one in that set. Collapsed Contig Means that reads from different parts of chromosome that has the same sequence are mapped/aligned together in one location of contig merging the reads to the potential repeat boundaries resolving collapsed Contigs repeats cause contigs that are occurring in multiple paths. there are 2 possible solutions for resolving collapsed contigs: sequence through repeat if the read length is longer than repeat, the conflicting node is removed from the graph. Sines:100-700bp Lines:7,000bp determine distance between unique contigs. try to see how far apart the two unique sequences are from each other Resolving repeats Mate Pair DNA libraries Fragment DNA contigs 1&2, circularize together, fragment and select pieces that ligated and sequenced. Assembly using short reads combine standard libraries at high coverage and large mate libraries at low coverage to assemble larger contigs Comparisons between assembly programs When comparing assembly programs, you need to consider speed and in addition to error, you need to consider the number of contigs, along with contig length. Velvet is the best assembly program Constructs whole-genome sequences from short DNA sequence reads, such as those generated by NGS platforms. Genome assembly program that is the fastest and one of the most accruat short read assemblers Genome Re-sequencing Goal: To get data on genetic variation by mapping rads to a known reference sequence: SNPs INDELs Structural variants Structural variation: The distance and orientation between paired reads provides information on structure. On Illumina sequencing flow cell, Bridge amplification is used to copy each DNA fragment many times for its abundance to be high enough to be sequenced. Bridge amplification Each cluster is where the sequence is being called and that comes from a single DNA molecule The C refers to the mean of the number of aligned reads that stack, at a particular site in the reference sequence. If you have a genome assembly with contigs of the following length, what's the N50? Contig 1:5mb Contig 2:25 mb Contig 3:40mb Contig 4 :30 mb Count up length of all contigs, 100Mb. Then you start adding up, 40+30=70, which already broken it up in half. The N50 is 30. What are the two best ways you can generate large scaffolds in genomes with many repeats? Sequences mate pair libraries Sequence with a method that generates very long reads Whole Genome vs. Enrichment Whole genome sequencing is straight forward, you are sequencing the entire genome. Enrichment is a subset of genes or regions of the genome are isolated and sequenced. Target enrichment works by capturing genomic regions of interest by hybridization to target-specific biotinylated probes, which are then isolated by magnetic pulldown. For most studies, you need to obtain allele frequency information. Sequencing of entire genomes are often not necessary, which leads to our goal. The goal is to sequence the SAME regions with INFORMATIVE VARIANTS in many individuals. The cost of the whole genome sequence for populations of many organisms is getting cheaper. Primary cost saving approaches for genome variation at population level: Low coverage whole genome sequencing of individuals 0.5x-5x genome coverage of 10s of individuals Can be done by pooling them into 1 library Reduced-representation sequencing: used in smaller proprtion of the population Targeted enrichment 30x-300x coverage of <1% of genome in 10-50 individuals Restriction enzyme digestion and sequencing >100x coverage of 0.1% of the genome in 100s of individuals Things to consider when it comes to sequencing: Capture amount How much of the genome do you target? Specificity How many reads will map to the targets? Variability How will coverage vary across targets? Reproducibility How will data from different library preps and sequencing runs compare? Cost and ease How many samples can you sequence with your resources? Input DNA How much DNA do you need vs have? Calculating coverage A key parameter when it comes to successful data generation Step 1: Add all the base pairs of the segments that you want to sequence Step 2: Multiply the coverage you want by the total segment length Step 3: Multiply by the number of samples you want Step 4: Identify platform that gives you roughly that much data Advantages to Low/Ultra low coverage sequencing You have data for the entire genome for multiple individuals You have a greater chance of detecting all the alleles Genome sequencing of Pooled DNA This is very efficient if you can do this by analyzing just simply by comparing populations Combine 10-100 samples and into one library pool Sequence at 0.5X per sample Get population allele frequency information: The number of reads per allele is proportional to its frequency Can detect rare alleles but sequencing error bigger issue Advantages: Save time and money on library preps Can sequence more samples Disadvantage: No individual information Targeted Enrichment PCR: a form of targeted enrichment PCR is used to amplify target regions in all samples 3 general strategies: Uniplex PCR 10-20kb long-range PCR of region Multiplex PCR Separate PCRs for each locus 10-400bp Hybridization: another form of targeted enrichment Solid Platform Well established approach developed for microarrays Hybridization DNA to oligonucleotide probes Specialized equipment and limited to 24 samples roughly Need a lot of DNA Identification of mutation causing Miller syndrome Exome capture via hybridization to a microarray 27.9Mb, spanning 160,000 exons 2 affected siblings, 2 affected unrelated, compared with human HapMap Sort out functional rare variants found in affected individuals with inheritance pattern Identified DHODH as a gene carrying casual mutation Restriction Enzyme Prepared Libraries Also called RadSeq Shear DNA with restriction enzyme which cuts chromosomal segments at specific places Ligate adapters bad barcodes to the restriction sites Sequence 100-150bp adjacent to the restriction sites Results in reads generated for the same sections randomly distributed in genome Then map reads to reference genome, get consensus sequences, and genotype variable SNPs at loci “RadTag” For whole genome coverage goal is to genotype 30-40k SNPs Steps: 1.Cut extracted DNA with restriction enzymes 2.Ligate adapters and barcode 3.Pool samples, size, select, PCR amplify, purify 4.Sequence on Illumina single end or paired end 5.Map reads to reference and genotype. Reads stacking on 1 cut site is one locus How can you identify region with phenotypic effect? Calculate diversity estimates on sliding windows across chromosomal regions Look for low, or high, levels of diversity Heterozygosity Nucleotide diversity Private alleles Regions of linkage disequilibrium (haplotype blocks) Divergence between populations (Fst) Identify structural rearrangements/CNVs Selective sweeps A selective sweep refers to the rapid increase in frequency of a specific genetic variant (Allele) due to positive natural selection. Sbf and ApeKI Both are restriction enzymes that shear DNA at specific sites GPR158 Allele related to meatbolism in bobcats;reduced energy expenditure and affects obsesity; more variable in north LECT2 Allele related to bob cat metabolism; abdominal fat sorage and lipid metabolism; compeltely different in North vs. South bobcats TRPM Allele related to noxious heat detection in bobcats; variable between north. Vs south Schematic difference between low coverage whole genome versus targeted enrichment