Lecture on Genomic Data Sets PDF

Summary

This document provides an overview of genomic datasets, including whole genome sequencing and the technologies used. Topics covered include Illumina, Ion Torrent, and different types of genetic markers like SNPs and RFLP. The document also examines concepts like Probability of Identity and discusses various applications within the field of genomics.

Full Transcript

‭What is a “genomic” data set?‬ ‭ ‬ ‭Defined as high-throughput sampling of the genome‬ ‭‬ ‭Whole genome sequencing‬ ‭‬ ‭Provides info on recombination and Information on methylation‬ ‭Know the pros and cons and key differences between the HTS/NGS‬ ‭technologies/pl...

‭What is a “genomic” data set?‬ ‭ ‬ ‭Defined as high-throughput sampling of the genome‬ ‭‬ ‭Whole genome sequencing‬ ‭‬ ‭Provides info on recombination and Information on methylation‬ ‭Know the pros and cons and key differences between the HTS/NGS‬ ‭technologies/platforms (e.g. sequence length)‬ ‭‬ ‭Roche 454‬ ‭○‬ ‭Titanium, ~500 bp‬ ‭‬ ‭Illumina‬ ‭○‬ ‭HiSeq2000, ~ 150 bp but improving‬ ‭○‬ ‭Pro: high accuracy with good coverage‬ ‭○‬ ‭Con: long run time with phasing difficulties‬ ‭‬ ‭ABI SOLiD sequencing -‬ ‭○‬ ‭ligation based (100 bp)‬ ‭‬ ‭Pacific Biosciences‬ ‭○‬ ‭single molecule (SMRT) sequencing‬ ‭○‬ ‭Pro‬ ‭‬ ‭Very long reads can help resolve ambiguities; no DNA‬ ‭‬ ‭amplification required, comparatively faster turnaround time‬ ‭○‬ ‭Cons‬ ‭‬ ‭Expensive‬ ‭‬ ‭Higher error rate‬ ‭‬ ‭Ion Torrent‬ ‭○‬ ‭ion semiconductor chips (~100-200 bp)‬ ‭ robability of Identity‬ P ‭‬ D ‭ ef: Likelihood of two people having the same profile (genotype)‬ ‭‬ ‭PI = sum of each genotype probability squared‬ ‭○‬ ‭Ex.‬ ‭○‬ ‭Important concepts‬ ‭Protein electrophoresis & allozyme markers‬ ‭‬ E ‭ arly protein electrophoresis studies revealed that the extent of genetic‬ ‭variation is much higher than previously thought‬ ‭‬ ‭Proteins move in the electric field (in gel medium)‬ ‭‬ ‭Relative speed (distance) depends on the charge, size, and shape of the‬ ‭protein‬ ‭RFLP‬ ‭‬ ‭What are they‬ ‭○‬ ‭Restriction Fragment Length Polymorphisms‬ ‭○‬ ‭Uses Restriction enzymes to cut at specific sequences usually‬ ‭every 4^length of the motif‬ ‭○‬ ‭How were they discovered: were detected by hybridizing‬ ‭radioactively labelled probesto DNA, transferred from a gel to a filter‬ ‭(“Southern blotting”)‬ ‭‬ ‭Restriction enzyme examples‬ ‭○‬ ‭Bam HI‬ ‭‬ ‭‬ ‭‬ H ○ ‭ pa I‬ ‭‬ ‭‬ ‭‬ K ○ ‭ pn I‬ ‭‬ ‭‬ ‭‬ ‭Method‬ ‭○‬ ‭Restriction enzymes recognize sequence motifs and cut DNA at the‬ ‭motif‬ ‭‬ ‭Applications‬ ‭○‬ ‭Experimental design‬ ‭○‬ ‭Genome coverage‬ ‭○‬ ‭Parentage using Sothern blotting gels‬ ‭○‬ ‭REs can be leveraged for downstream applications‬ ‭AFLP (and RAPDs)‬ ‭‬ ‭What are they slide‬ ‭○‬ ‭Random Amplified Polymorphic DNA – no enzyme, instead uses‬ ‭random set of‬‭primers‬‭and PCR.‬ ‭○‬ ‭Produces a profile for small fragments on a regular gel‬ ‭ inisatellites‬ M ‭‬ ‭What are they‬ ‭○‬ ‭Repeat unit >9 bp, usually ~30 bp‬ ‭○‬ ‭Among the first markers to be used for DNA fingerprinting‬ ‭ icrosatellites‬ M ‭‬ ‭What are they‬ ‭○‬ ‭arise through polymerase slippage during DNA replication and a‬ ‭higher rate of slippage with more repeats‬ ‭‬ ‭Applications‬ ‭○‬ ‭Paternity‬ ‭○‬ ‭Forensic profiling‬ ‭○‬ ‭Population genetics‬ ‭○‬ ‭Conservation genetics‬ ‭‬ ‭Pros‬ ‭○‬ ‭Abundant‬ ‭○‬ ‭Simple and cheap‬ ‭○‬ ‭Highly variable‬ ‭○‬ ‭We think they are neutral but some studies have show not‬ ‭‬ ‭Cons‬ ‭○‬ ‭Mutation process are hard to model‬ ‭○‬ ‭Null alleles‬ ‭SNPS‬ ‭‬ D ‭ ef: A SNP is a single nucleotide variation at a specific location in the‬ ‭genome‬ ‭‬ ‭Pros‬ ‭○‬ ‭They are abundant‬ ‭○‬ ‭They can be genotyped in a high-throughput manner‬ ‭○‬ ‭The mutation mechanism is well established‬ ‭‬ ‭Cons‬ ‭○‬ ‭SNP discovery phase introduces ascertain bias due to sequencing‬ ‭a small pool of people‬ ‭○‬ ‭Less power per locus for individual identification compared to‬ ‭microsatellites = generally need more SNPs compared to‬ ‭microsatellites‬ ‭○‬ ‭No large databases for SNPs‬ ‭ ‬ ‭Cannot detect mixtures‬ ○ ‭ itochondrial sequence‬ M ‭‬ ‭ aploid and maternally inherited A‬ H ‭‬‭approximately 17,000 BP (37 genes)‬ ‭‬‭All SNPs (alleles) are linked and thus inherited together‬ ‭‬‭Uses cytochrome b for species identification and haplotypes of species‬ ‭region‬ ‭ hloroplast sequence‬ C ‭‬ d ‭ sDNA‬ ‭‬ ‭Plants & algae 120-247kbp Most are small 100+ genes I‬ ‭‬ ‭IR sequences‬ ‭○‬ ‭Repetitive regions‬ ‭○‬ ‭Vary in size and position b/w species‬ ‭ rocess of Illumina Sequencing bridge amiplfcation‬ P ‭.‬ U 1 ‭ ses a flow cell coated with a lawn‬ ‭2.‬ ‭Hybridization of the oligos on the DNA to the complementary adaptor‬ ‭region on the flow cell‬ ‭3.‬ ‭Polymerase creates a complement of the hybridized fragment‬ ‭4.‬ ‭Then the double stranded hybridized fragment is denatured and the‬ ‭original single stranded DNA fragment is washed away‬ ‭5.‬ ‭The strand gets clonally amplified through bridge amplification where‬ ‭6.‬ ‭The strand folds over and the adaptor region on the end of the single‬ ‭stranded DNA fragment hybridizes to the secodn type of oligo on the flow‬ ‭cell‬ ‭7.‬ ‭Polymerase generates the complement strand which form a double‬ ‭stranded bridge‬ ‭8.‬ ‭The double stranded DNA bridge is denatured which forms two single‬ ‭stranded copies of the molecule which are tethered to the flow cell‬ ‭9.‬ ‭Repeated bridge amplification‬ ‭10.‬‭After this is done the reverse strands washed off‬ ‭11.‬‭The forward strands are blocked to prevent unwanted priming‬ ‭ rocess of Illumina sequencing‬ P ‭.‬ T 1 ‭ he first primer is extended by adding fluorescently tagged nucleotides‬ ‭2.‬ ‭When each nucleotide is added that produce a fluorescent signal which is‬ ‭tracked‬ ‭3.‬ ‭This is repeated till the first read is completed‬ ‭4.‬ ‭The first read product is washed away and the index 1 read primer is‬ ‭added and sequenced till completion‬ ‭5.‬ ‭The index read is washed off and becomes deblocked‬ ‭6.‬ ‭the 3’ end of the template folds over and binds to the second oligo on the‬ ‭flow cell‬ ‭7.‬ ‭Then repeated‬ ‭Ion Torrent sequencing process‬ ‭.‬ 1 ‭ ample of DNA is cut to fragments and attaches to its own bead‬ S ‭2.‬ ‭The bead flow across the chip and deposit into a well‬ ‭3.‬ ‭The chip is flooded with on of the 4 DNA nucleotides‬ ‭4.‬ ‭The correct nucleotide gets incorporated causing a hydrogen ion to‬ ‭release‬ ‭5.‬ ‭This changes the pH of the well which is tracked and converted to voltage‬ ‭6.‬ ‭This is repeat till completion‬ ‭ acBio sequencing process – Continous long read sequencings‬ P ‭.‬ 1 ‭ igh quality double stranded DNA is isolated‬ H ‭2.‬ ‭Ligate SMRTbell adaptors and size select‬ ‭3.‬ ‭Anneal primers and bind DNA polymerase‬ ‭4.‬ ‭Molecules is put into Zero mode wavelengths‬ ‭5.‬ ‭As the molecule gets sequenced light is emitted which allows you to‬ ‭measure in real time‬ ‭ acBio sequencing process – Circular Consensus Sequencing‬ P ‭.‬ C 1 ‭ ircularized DNA is sequenced in repeated passes‬ ‭2.‬ ‭Polymerase reads are trimmed of adapters to yield subreads‬ ‭3.‬ ‭Consensus is called from subreads‬ ‭Lecture summaries‬ ‭Direct assessment of nucleotide variation:‬ ‭ ‬ ‭SNPs - Single Nucleotide Polymorphisms‬ ‭‬ ‭Mitochondrial sequence‬ ‭‬ ‭Chloroplast sequence‬ ‭ he critical difference between Sanger sequencing and HTS is‬ T ‭‬ ‭sequencing volume‬ ‭○‬ ‭Sanger method only sequences a single DNA fragment at a time‬ ‭○‬ ‭HTS is massively parallel, sequencing millions of fragments‬ ‭simultaneously per run thus sequencing much more genes‬ ‭Sup reading‬ ‭Illumina seqeuncing - MiSeq compared to Oxford Nanopore approaches‬ Feature‬ llumina MiSeq‬ Oxford Nanopore (ONT)‬ Lower quality, improving with‬ Data Quality‬ High-quality, short reads‬ new chemistry‬ Read Length‬ Short (~300 bp amplicons)‬ Long (~1400 bp amplicons)‬ Potential species-level‬ Taxonomic Resolution‬ Genus-level‬ resolution‬ Over- and‬ Some species‬ under-representation of‬ misclassification due to‬ Misclassification‬ some taxa‬ sequencing errors‬ Performance on Low‬ Biomass Samples (e.g.,‬ Gills)‬ Better‬ Poorer due to length filtering‬ Limited (typically 12 samples‬ Multiplexing Capacity‬ High (up to 100+ samples)‬ per flow cell)‬ Expensive instrument, often‬ Accessibility & Cost‬ requires institutional access‬ More affordable and portable‬ Less standardized, ONT's‬ Well-established pipelines‬ EPI2ME software lacks‬ Bioinformatics Support‬ QIIME2, mothur)‬ lexibility‬