Lecture on Genomic Data Sets PDF
Document Details

Uploaded by QuietBohrium4494
Trent University
Tags
Summary
This document provides an overview of genomic datasets, including whole genome sequencing and the technologies used. Topics covered include Illumina, Ion Torrent, and different types of genetic markers like SNPs and RFLP. The document also examines concepts like Probability of Identity and discusses various applications within the field of genomics.
Full Transcript
What is a “genomic” data set? Defined as high-throughput sampling of the genome Whole genome sequencing Provides info on recombination and Information on methylation Know the pros and cons and key differences between the HTS/NGS technologies/pl...
What is a “genomic” data set? Defined as high-throughput sampling of the genome Whole genome sequencing Provides info on recombination and Information on methylation Know the pros and cons and key differences between the HTS/NGS technologies/platforms (e.g. sequence length) Roche 454 ○ Titanium, ~500 bp Illumina ○ HiSeq2000, ~ 150 bp but improving ○ Pro: high accuracy with good coverage ○ Con: long run time with phasing difficulties ABI SOLiD sequencing - ○ ligation based (100 bp) Pacific Biosciences ○ single molecule (SMRT) sequencing ○ Pro Very long reads can help resolve ambiguities; no DNA amplification required, comparatively faster turnaround time ○ Cons Expensive Higher error rate Ion Torrent ○ ion semiconductor chips (~100-200 bp) robability of Identity P D ef: Likelihood of two people having the same profile (genotype) PI = sum of each genotype probability squared ○ Ex. ○ Important concepts Protein electrophoresis & allozyme markers E arly protein electrophoresis studies revealed that the extent of genetic variation is much higher than previously thought Proteins move in the electric field (in gel medium) Relative speed (distance) depends on the charge, size, and shape of the protein RFLP What are they ○ Restriction Fragment Length Polymorphisms ○ Uses Restriction enzymes to cut at specific sequences usually every 4^length of the motif ○ How were they discovered: were detected by hybridizing radioactively labelled probesto DNA, transferred from a gel to a filter (“Southern blotting”) Restriction enzyme examples ○ Bam HI H ○ pa I K ○ pn I Method ○ Restriction enzymes recognize sequence motifs and cut DNA at the motif Applications ○ Experimental design ○ Genome coverage ○ Parentage using Sothern blotting gels ○ REs can be leveraged for downstream applications AFLP (and RAPDs) What are they slide ○ Random Amplified Polymorphic DNA – no enzyme, instead uses random set ofprimersand PCR. ○ Produces a profile for small fragments on a regular gel inisatellites M What are they ○ Repeat unit >9 bp, usually ~30 bp ○ Among the first markers to be used for DNA fingerprinting icrosatellites M What are they ○ arise through polymerase slippage during DNA replication and a higher rate of slippage with more repeats Applications ○ Paternity ○ Forensic profiling ○ Population genetics ○ Conservation genetics Pros ○ Abundant ○ Simple and cheap ○ Highly variable ○ We think they are neutral but some studies have show not Cons ○ Mutation process are hard to model ○ Null alleles SNPS D ef: A SNP is a single nucleotide variation at a specific location in the genome Pros ○ They are abundant ○ They can be genotyped in a high-throughput manner ○ The mutation mechanism is well established Cons ○ SNP discovery phase introduces ascertain bias due to sequencing a small pool of people ○ Less power per locus for individual identification compared to microsatellites = generally need more SNPs compared to microsatellites ○ No large databases for SNPs Cannot detect mixtures ○ itochondrial sequence M aploid and maternally inherited A H approximately 17,000 BP (37 genes) All SNPs (alleles) are linked and thus inherited together Uses cytochrome b for species identification and haplotypes of species region hloroplast sequence C d sDNA Plants & algae 120-247kbp Most are small 100+ genes I IR sequences ○ Repetitive regions ○ Vary in size and position b/w species rocess of Illumina Sequencing bridge amiplfcation P . U 1 ses a flow cell coated with a lawn 2. Hybridization of the oligos on the DNA to the complementary adaptor region on the flow cell 3. Polymerase creates a complement of the hybridized fragment 4. Then the double stranded hybridized fragment is denatured and the original single stranded DNA fragment is washed away 5. The strand gets clonally amplified through bridge amplification where 6. The strand folds over and the adaptor region on the end of the single stranded DNA fragment hybridizes to the secodn type of oligo on the flow cell 7. Polymerase generates the complement strand which form a double stranded bridge 8. The double stranded DNA bridge is denatured which forms two single stranded copies of the molecule which are tethered to the flow cell 9. Repeated bridge amplification 10.After this is done the reverse strands washed off 11.The forward strands are blocked to prevent unwanted priming rocess of Illumina sequencing P . T 1 he first primer is extended by adding fluorescently tagged nucleotides 2. When each nucleotide is added that produce a fluorescent signal which is tracked 3. This is repeated till the first read is completed 4. The first read product is washed away and the index 1 read primer is added and sequenced till completion 5. The index read is washed off and becomes deblocked 6. the 3’ end of the template folds over and binds to the second oligo on the flow cell 7. Then repeated Ion Torrent sequencing process . 1 ample of DNA is cut to fragments and attaches to its own bead S 2. The bead flow across the chip and deposit into a well 3. The chip is flooded with on of the 4 DNA nucleotides 4. The correct nucleotide gets incorporated causing a hydrogen ion to release 5. This changes the pH of the well which is tracked and converted to voltage 6. This is repeat till completion acBio sequencing process – Continous long read sequencings P . 1 igh quality double stranded DNA is isolated H 2. Ligate SMRTbell adaptors and size select 3. Anneal primers and bind DNA polymerase 4. Molecules is put into Zero mode wavelengths 5. As the molecule gets sequenced light is emitted which allows you to measure in real time acBio sequencing process – Circular Consensus Sequencing P . C 1 ircularized DNA is sequenced in repeated passes 2. Polymerase reads are trimmed of adapters to yield subreads 3. Consensus is called from subreads Lecture summaries Direct assessment of nucleotide variation: SNPs - Single Nucleotide Polymorphisms Mitochondrial sequence Chloroplast sequence he critical difference between Sanger sequencing and HTS is T sequencing volume ○ Sanger method only sequences a single DNA fragment at a time ○ HTS is massively parallel, sequencing millions of fragments simultaneously per run thus sequencing much more genes Sup reading Illumina seqeuncing - MiSeq compared to Oxford Nanopore approaches Feature llumina MiSeq Oxford Nanopore (ONT) Lower quality, improving with Data Quality High-quality, short reads new chemistry Read Length Short (~300 bp amplicons) Long (~1400 bp amplicons) Potential species-level Taxonomic Resolution Genus-level resolution Over- and Some species under-representation of misclassification due to Misclassification some taxa sequencing errors Performance on Low Biomass Samples (e.g., Gills) Better Poorer due to length filtering Limited (typically 12 samples Multiplexing Capacity High (up to 100+ samples) per flow cell) Expensive instrument, often Accessibility & Cost requires institutional access More affordable and portable Less standardized, ONT's Well-established pipelines EPI2ME software lacks Bioinformatics Support QIIME2, mothur) lexibility