BPS3101 C15-Rev2-L8-12 F2024 Molecular Biology Revision Lecture Notes PDF
Document Details
2024
Tags
Summary
Revision lectures 8-12 in preparation for the second midterm in BPS3101 include topics like Sanger sequencing, nucleotide reactions, gaps and assembly, and comparisons of sequencing techniques. The lectures also discuss genotype-phenotype relationships of CYP2D6 polymorphisms and drug responsiveness.
Full Transcript
Revision Lectures 8-12 In preparation for the second midterm - next class (C16) Test 2 Content: lectures 8-12, summarized in this slideshow, plus Sanger sequencing and the reaction of two nucleotides together as discussed. 3-4 essay questions (5-7 pts. each): mostly...
Revision Lectures 8-12 In preparation for the second midterm - next class (C16) Test 2 Content: lectures 8-12, summarized in this slideshow, plus Sanger sequencing and the reaction of two nucleotides together as discussed. 3-4 essay questions (5-7 pts. each): mostly questions about general principles discussed in lectures and that you explain using text and drawing Some, 20-25% may be about using these principles to address questions that will require imagination or problem solving. 5 short answer or multiple-choice questions (1 pt. each) General philosophy Few or no trick questions: -Identify the mot important principles, understand and explain them using text and drawing -Then, meet with friends to explain and discuss the topics to one another Critical notions to revise Examples of SNPs and pharmacogenomics Sanger Sequencing Reaction of two nucleotide together and how the results products are harnessed for sequencing Gaps and assembly Description and comparison of 2nd and 3rd generation sequencing techniques OLC and de Bruijn graphs for seq. assembly Gene annotation Protein and gene sequence similarity search Examples of drug responsiveness linked to genetic variation Cytochrome P450 metabolizing enzymes (isozyme family) - in liver, activate or inactivate different drugs - known SNPs which affect enzyme activity CYP2D6 gene (on chr 22) - eg. G-to-A mutation in exon 4 affects splicing, so no protein - cannot activate opioid analgesics (e.g. codeine), so different form of pain relief needed CYP2C9 gene (on chr 10) Arg-to-Cys mutation (at position 144 of protein) - eg.C-to-T SNP (causing R144C) - “poor metabolizers” of warfarin (anticoagulator) so higher risk of internal bleeding Ann.Rev.Genomics Hum.Genet. 2: 9-39, 2001 Sistonen Pharmacogenet Genom. 19:170, 2009 Genotype-phenotype relationships of CYP2D6-polymorphisms - drug uptake into cell by receptor - drug clearance from circulation - drug activity & metabolism Meyer Nature Rev Genet 5:645, 2004 erview of important consequences of genetic polymorphisms in CYP Ahmed Genom Prot Bioinformat 2016 in press AmpliChip CYP450 test (Roche) “The world's first microarray-based pharmacogenomic test cleared for clinical use.” by Affymetrix technology Powered Comprehensive detection of gene variations for the CYP2D6 and CYP2C19 genes, which play a major role in the metabolism of an estimated 25 “...percent containsof all prescription more drugs. that 15,000 different From Roche website: oligomers” Detects up to 33 CYP2D6 alleles and 3 CYP2C19 alleles Detects CYP2D6 gene duplication and deletions “Global distribution of individuals carrying duplication of the CYP2D6*1 or CYP2D6*2 genes (presented as percentage of the population).” * 1 and *2 are different alleles of isoform 6 in the 2D family of CYP450 Ingelman-Sundberg Pharmacol & Therapeutics 116:496, 2007 Sanger sequencing Sanger chain termination method (Fred Sanger, 1977) - enzymatic synthesis of DNA strand complementary to “template” of interest Nobel Prizes: Sanger 1958 (protein structure) 1980 (DNA sequencing) … but it stops when a dideoxynucleotide is incorporated Deoxy- versus dideoxyribonucleotide Deoxynucleotide (natural) 5’ 3’ HO dideoxynucleotide artificial) 5’ 3’ Draw the reaction of two nucleotides during DNA synthesis Three types of gaps 1. Sequence gaps: a clone containing the gap sequence between two contigs is present in the library, but was not sequenced originally 2. Physical gaps: a clone containing the gap sequence between two contigs is absent in the library. 3. Repetitive sequence gaps: failure to assemble fragments with repeated sequences. They will be discussed separately (& over several lectures) 1. Sequence gaps Approach to resolve them ASSEMBLING INFO FROM CLONES INTO CONTIGS 1. CHROMOSOME WALKING by hybridization - sequence from one clone is used as probe to screen library of clones to find overlapping one - repeat to “walk along” genome 2. CHROMOSOME WALKING by PCR - design primer pairs based on sequence at the end of clone - use other clones in the library as template DNA - will get PCR amplicon for any new clones with that sequence - reactions can be carried out as pools for more rapid screening (combinatorial screening) e.g. 50 clones together & if get positive signal, then screen smaller pool or individually to find clone of interest. Explain? 3. CLONE FINGERPRINTING To identify overlapping clones: by finding features that they share Restriction profile fingerprint This could suggest these 2 clones overlapped or clones having STS in common 2. Physical gaps Approach to resolve them What if there is a “physical gap”? if a particular region of genome is not represented in the library - can use a different vector to prepare a second clone library maybe region was unstable in the first vector) - then use probes (e.g. oligomers) mapping to ends of contigs from first library to screen the second library Example of closing a “physical gap” You have 9 contigs & design oligomers mapping close to their ends screen by hybridization Draw! Which contigs are adjacent? What if a “physical gap” is very short? < 10 kb or so - could use oligomers mapping to ends of contigs in PCR reactions with an uncloned DNA template 5’… … 3’ 3’… … 5’ Screen by PCR Sequence PCR product directly - Can also be used to find overlapping clones 3. Repetitive sequence gaps Approach to resolve them: an introduction to this topic – It will also be discussed in the lectures about Next Generation Sequencing Assembly of reads Contig - set of overlapping DNA segments that together represent a consensus region of DNA Scaffold - contigs separated by gaps of known length The assembly problem Assemble overlapping reads Conceptually very simple Relies on greedy algorithm …sequences with most overlap get combined 1st The assembly problem Repetitive sequences cause significant problems with assembly AGCTTTTCATTCTG CTTTTCATTCTGA TCATTCTGACTGC CTGACTGCAACG TGCAACGGGC ACGGGCAATATGT CGGGCAATATGTCTC TATGTCTCTGTGTGGA TCTGTGTGGATTA GATTAAAAAAACCGAGT Which read gets added? GATTAAAAAAAGAGTG Repetitive sequence Simple overlapping read assembly cannot solve repetitive sequences Overlap layout consensus graph It was the best of times, it was the worst of times, it was the age It was the best was the best of the best of times, best of times, it of times, it was times, it was the it was the worst or it was the age Fragments left to place it was the worst Use a directed graph to solve was the worst of the worst of times, the worst of times, it Edge of times, it was Indicates times, it was the overlap it was the age Nodes = sequences Overlap layout consensus graph It was the best Enables assemblies to be generated was the best of even with some repetitive sequence data the best of times, best of times, it Can lead to rearrangements in the sequence if not scrutinized carefully of times, it was times, it was the it was the age it was the worst was the worst of the worst of times, worst of times, it of times, it was When the repeated sequences are very large, OLC and other sequencing assembly methods cannot assemble the full sequence If some fragments cannot be integrated in the contig, they are discarded. This may also lead to gaps. In some whole genome sequencing, these gaps may remain unresolved for years. For example, 2000: draft human genome, 90% sequenced, 150,000 gaps 2003: ”complete human genome”, 92% sequenced, 400 gaps 2022: “first truly complete human genome” New Sequencing Technologies Second-Generation Third Generation Sequencing* Sequencing Large number of short reads are simultaneously One long read is performed performed Solid phase Liquid and solid phase Ex: pyrosequencing, ion are mixed semiconductor sequencing, Ex: PacBio and and Illumina nanopore Sequencing library construction overview Sheared DNA has both blunt and sticky ends – limit ligation of adapters! DNA is blunted first Why is the A-tailing important? 5’ phosphate are added blunt DNA is A-tailed (3’-A is added) ligation of adapter with a 3’-T overhang need 5’- 3’ phosphate & 3’-hydroxyl Adapters (Illumina) and P7: immobilization and DNA colony generation in the flow cell ex 1 (i5) and 2 (i7): barcode identifier allowing the sequencing of multiple DNA libraries (multiplexing or poo 1 SP and Rd2 SP: sequencing primer annealing sites NA insert: contain the region of interest (e.g. gDNA, sequen library fragment to be sequenced Adapters (Illumina)- details Example of an Illumina adapter The TruSeq Universal and Indexed primer forming an adapter 5’ AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC*T |||||||||||| ligate 3’GTTCGTCTTCTGCCGTATGCTCTA-index-CACTGACCTCAAGTCTGCACACGAGAAGGCTAG-P here The last 12 nt of adapters anneal. They are named forked adapters or Y adapters. Why is the adapter forked? DNA library amplification step Amplification required to obtain enough of each library member for detection In Sanger sequencing, this is done by cloning and replication/amplification in E. coli Single plasmid per E. coli colony (clone), ensure each library member is amplified without cross contamination In NGS: the challenge is to isolate and keep separate all library members on a chemical in vitro vessel Amplification step is platform specific Ion semiconductor sequencing* -Uses an emulsion PCR based amplification step on a bead Illumina -Uses bridging PCR based amplification on the surface of a flow cell Ion sequencing relies on beads attachment for library amplification 1. The sequencing DNA (sDNA) library with the proper fragments size is obtained 2. The sDNA library fragments are clonally attached and amplified on beads using emulsion PCR 3. The emulsion is broken to release beads 4. Beads with DNA are purified (to remove DNA-less beads) 5. The DNA containing beads are sequenced emulsion PCR (emPCR) for DNA amplification in ion semiconductor sequencing Must have 1 DNA strand and 1 bead in each droplet Approximately 1/3 of droplet will meet this requirement Amplification of DNA on bead by emPCR Bead is tethered to emPCR primer A In solution is emPCR primer B DNA is attached to bead at emPCR A site with 1x106-fold Enrichment of beads with amplified DNA ost beads will not have amplified DNA (60-80%) ust be removed prior to sequencing – otherwise sequencing efficiency dro Templated Separation beads 1. non-magnetic glycerol gradient centrifugation 2. magnetic separation dsDNA is denatured bead has ssDNA fragment Polystyrene beads with Untemplated complementary primer to beads emPCR B is added Templated sequencing bead is linked sequencing beads are to low density or separated from magnetic beads untemplated beads ole process = BEAMing: Beads Emulsions Amplification Magnetic separation Illumina relies on the formation of clusters for library amplification DNA to be sequenced is annealed to a surface and amplified in a single spot in a process similar to PCR using the principle of bridge amplification Each colony arises from a single template strand Neighbouring colony is from a different template sequence High colony density on a surface – lots of different sequences Enables isolation and amplification of Input DNA Prepare DNA library Fragment DNA of interest into smaller strands that are able to be sequenced adapters Sonication Nebulization Enzyme digestion Ligate Adapters 4 cycles of PCR to install adapters Denature dsDNA into Sequencing library ssDNA by heating to 95° C Attachment of ssDNA to the 3’ flow1. cell Dense lawn of oligonucleotides on the surface of each lane, which are complementary to the adapter 2. ssDNA anneals to the complementary oligo at surface of flow cell lanes 3. DNA polymerase and attached oligo is used to synthesize the reverse strand 4. Denaturation leave only the newly Flow Cell synthesized DNA strand attached to the surface 1 2 3 4 Several lanes per flow cell (8x, here) Bridge Amplification 1. The free 3’-end of the tethered reverse strand anneals with the complementary oligo 2. The DNA polymerase forms a dsDNA bridge where both strand are attached to the surface 3’ 3. Denaturation generate ssDNA of both strands 5’ 4. This process is repeated (e.g. 35 cycles) to form a 3’ cluster of >1000 identical DNA sequences forming a 1-2 mm spot (i.e. a DNA colony). Preparation for sequencing 1. The result of bridge amplification is a mix of forward and reverse strands 2. The forward strand is removed by specific base cleavage (mechanism is not public) 3. The 3’ ends are blocked to prevent priming interference that would be detrimental to the sequencing reaction NB: the orange circles represent the the sequencing primer. Ion Torrent Sequence by Synthesis Each well with a bead containing a single DNA fragment yield one ionogram t common error with this technique occurs in stretch of the same nucleot Ion semiconductor sequencing by Synthesis details GM , Ion Proton, and Ion Torrent are different kinds of apparatus base on the princ n semiconductor sequencing. e reads per run is directly related to the number of wells that produce an ionogra Illumina sequencing by synthesis Incorporate Detect Deblock 3’ Cleave fluorophore Reminder: the complementary strand is cleaved and the 3’ end of purple oligos s blocked. Illumina sequencing by synthesis Illumina sequence by synthesis color of signal indicates nucleotide Blocking group at 3’ prevents extension > 1 nt … So, intensity does not indicate the number of nt but confidence in the Analysis overview base calling Demultiplexing (if needed) Indexing for multiplexing Indexing reads are key to multiplexing The reading of both index sequences are completed separately using unique primers after reading the first strand. Sequence associated with the same index belong to the same library and are pooled together during computer analyses of the sequencing. Pair-ended reads minder 1st read (reverse strand)- recap slight 17 Through bridging and cleavage of the reverse strand, the forward strand can then be sequenced Since the forward strand will locate at the same position in the flow cell, it is possible to directly compare the sequences of the 2nd reverse and forward strand. read (forward strand) Bridge amp To do this, the process in slide 16- 17 is repeated, except now the reverse strand is cleaved and a primer for the forward strand is used for sequencing Pair-ended reads correct for errors and resolve ambiguous alignments and helps Illumina Sequencing by Synthesis details Read length is typically 150 bp Much shorter than Sanger sequencing WHY? The answer is on the next slide! The chemistry limits read length Even with high yielding chemistry for cleavage of fluorophore and deblocking of 3’ after 100 bp there is very little correctly elongated DNA remaining Deblocking and % correctly % correctly fluorophore elongated DNA elongated DNA removal yield remaining after remaining after 50 bp 100 bp Incorporate 95% 7,7% 0.6% Detect Deblock 3’ 99% 61% 37% Cleave fluorophore 99.9% 95% 90% 99.99% 99,5% 99% Illumina sequencing Advantages Disadvantages High throughput: this Read length is short allows high coverage complicating the of genome and reduce resolution of gaps, particularly those due errors to repeated sequences The technology is expensive: both the equipment and consumables With 2nd generation sequencing, assembly is challenging due to shorter reads Overlap Layout Consensus (OLC) graph assembly becomes computationally more demanding with shorter sequences Alternative strategy needed for shorter NGS reads de Bruijn graph-based ncept: k-mer strategy …a k-mer is a subchain of length k that is d from a larger DNA fragment of length n (e.g. a sequence read) ATGGCGTGCA Convert this sequence into all the possible k-mers of 3 de Bruijn graph for genome assembly assembly becomes too computationally demanding with rter sequences efore, an alternative strategy was sought for shorter NGS re – de Bruijn graph-based strategy (its solution, a Eulerian path is efficient, i.e. solvable in a reasonable time) Graph SPAdes – a popular DNA read assembler Edge uses default setting K-mers with k-mers of size 21, 33, and 55 Nodes = k-1-mers How does de Bruijn graph differ from OLC Overlap layout consensus De Bruijn graph graph Reads converted to k-mers Edge overlap Edge K-mers Nodes = reads Nodes = k-1-mers Genome sequence defined Genome sequence defined by nodes by edges! Must find a path through Must find a path through each node to get each edge to get sequence of contig sequence of contig Called a Hamiltonian path Called a Eulerian path Construct a de Bruijn graph from k-mers Example with k-mers of 3 ATG GCG TGC AAT, sequence of n=12, so how many k-mers of length k=3? de Bruijn graph briefly explained k-mer ATGCAGCTATATAGCGGATG Successive k-mer of the chosen siz Prefix ATGCAGCTATATAGCGGAT are made across the different reads: Suffix TGCAGCTATATAGCGGATG This creates a series of suffix and prefix sequences (k-1-mers) tha overlap. ↑In this example with kmer of length k= 3, there is no repeated sequences. Thus, the Bruijn graph is linear, and it is straightforward to deduce the master sequence: TGACCGCAGTTA When there are repeated sequences, the redundant k-mers are removed to keep only one representative; the Repeated k-mers are installed at branching points; the removal of the Repeated k-mers reduce the computin cost. The algorithm solve the assembl Master sequence: TGACCGCACCTA problem by visiting each edge only on (i.e. the Eulelian path). Repeated sequence highlighted in red. Which one look simpler? Pair ended sequencing for contig assembly Single read sequencing: read only one end (one primer) Paired-end sequencing: read both ends (two primers) Paired end reads can also help solving ambiguity in contig construction The limitations of 2nd generation addressed by the 3rd Reminder: Next generation sequencing or 2nd generation is massively parallel sequencing DNA source mitations PCR required 3rd Introduces bias in Library generation step generation Amplification step sequencin Short reads (99% PacBio workflow: circular library enable multiple pass to improve accuracy (2) Q=- Single-pass 5 10log10(error) 0 sequencing read errors are rapidly washed out with 4 Predicted Read Quality increasing number of 0 passes (intra- molecular coverage) 3 0 (Phred) 1 Pass → ~90% accuracy 2 Passes → ~96% 2 accuracy 3 Passes → ~99% 0 accuracy 1 0 0 1 1 1 1 1 1 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 Number of Passes Predicted read quality for CCS reads* with HiFi read gold different numbers of passes for a GIAB HG002 human sample. standard is 99.9% accuracy or Q30 PacBio workflow (suite) Nanopore sequencing by threading of DNA Sensitivity relies on measuring current through a membrane pore as the DNA transit through it When a small voltage (~100 mV) is Ionic current imposed across a nanopore in a membrane separating two chambers containing acqueous electrolytes, the ionic current through the pore can be measured pid bilayer with high electronic resistance Molecules going through the nanopore cause disruption in the ionic current, and by measuring the disruption molecules can be identified. Much like patch clamping – well known single molecule method Pore can distinguish different bases as DNA is pulled through -The nanopore is Difference formed by a protein changes in -The processivity is current and improved by a motor dwell time protein to yield more consistent data peaks Two types of proteins Nanopore sequencing – no polymerase needed! Has long reads but very high error rate Motor protein nanopore Capture and translocation of lead Sequencing of 5’ and 3’ strands adapter enables some error correction …today replaced by independent sequencing reads Principles of nanopore sequencing The library can be designed to contain large fragments matching the long read lengths Library workflow is simple and similar to other methods The system is composed of two protein: 1. A protein nanopore that read the sequence 2. A motor protein that initially associate with the DNA molecule to increase sequencing processivity 2nd versus 3rd gen sequencing The error rate is for a single pass 3rd gen has longer read length … but higher error rates How to find genes within a DNA sequence? 1. Bacterial genomes - genes tightly packed, no introns... in silico Scan for ORFs (open reading frames) - check all 6 reading frames (both strands) - look for significant distance between potential start and stop codon (e.g. 100 codons) … but if examining very short regions of genome, start codon (or stop codon) might be located further upstream (or downstream) Using computer programs to search for ORFs: Query: 3 kb sequence The longer the open reading frame, the lower the probability of it occurring by random chance. 2. Eukaryotic genomes (such as human) - genes usually far apart, long introns & short exons Would an ORF scan work here? Can also use algorithms to look for: 1. Exon-intron boundaries - “GT-AG” rule, but consensus sequences very short Can also use algorithms to look for: 1. Exon-intron boundaries - “GT-AG” rule, but consensus sequences very short 2. Regulatory motifs - upstream promoters, downstream polyA addition signals… - but consensus sequences usually very short 3. Codon bias patterns - synonymous codons are not all used equally - patterns differ among organisms If ORF has same codon bias as known genes in a genome, supports view that it might encode a protein Can also use algorithms to look for: 4. Homologous sequences in other organisms - comparative sequence analysis (BLAST searches) to find related genes - usually advantageous to search using amino acid sequence rather than nucleotide sequence of gene because homologous genes from divergent organisms typically show greater similarity at amino acid level than at nt level Degeneracy of genetic code Codon bias among organisms Probability of specific stretch of nucleotides occurring by random chance (“spurious hits”) is higher than for the same length of amino acids 4 different nucleotides vs. 20 different amino acids Tools to search for homologous sequences in databanks BLAST searches www.ncbi.nlm.nih.gov/BLAST/ Basic Local Alignment Search Tool - search programs to look for similarity between your sequence of interest (protein or DNA) and entries in global data banks Query Subject database – search retrieve BLASTN – nucleotide nucleotide BLASTP – protein protein BLASTX – translated nucleotides protein tBLASTN – protein translated nucleotides ample of BLASTP search with wheat mitochondrial ATP8 protein as quer … obtained strong “hits” (significant matches) for closely- related organisms (i.e. plant & algal mitochondria) Physcomitrella ATP8 = 174 aa … but for more distant relatives (protist mito & bacteria), lower sequence similarity & no “hit” for C-terminal region Reclinomonas ATP8 = 133 aa ATP8 = 160 aa E-values: statistical measure of likelihood that sequences with Seed plants were excluded in this this degree of similarity occur randomly search to show outcome with i.e. reflects number of hits expected by chance distantly-related sequences t if this search was done at nucleotide (instead of protein) le BLASTN Query: wheat mitochondrial ATP8 gene (468 bp) Homologous ATP8 genes were identified only in closely-related organisms in nt search … many short “spurious hits” in data bank “Lack of homology between two sequences is often more apparent when comparisons are made at the amino acid level” To illustrate the power of amino acid level searches, text shows 2 sequences with 76% nt identity … but only 28% aa identity - conclude that sequences are not homologous t it’s a rather artificial example to illustrate the statement at the top of this slide because if 2 DNA stretches of 300 bp or so (normal default length in ORF Finder) and unbiased base composition showed 76% nt identity, it’s improbable that such similarity occurred by chance Homologous genes - genes that share a common evolutionary origin This is the definition given in text 1. Orthologs - homologs genes in different organisms (e.g. b-globin genes from mouse and human) 2. Paralogs - homologous genes in same organism (e.g. multi-gene family members, a-globin and b-globin from mouse) But note a-globin from mouse and b-globin from human are also paralogues NB: two genes are either evolutionarily related or they are not …. so instead of “…% homologous”, use “… % identity” Are two sequences homologous or independent in origin? Factors to consider: 1. Length of sequence - short sequences more likely to occur by chance 2. Base composition - highly biased (e.g. if only AT) more likely to occur by chance “low complexity regions” 3. Similarity at amino acid level (if protein-coding region) - high % identity is strong argument for homology - usually implies common protein function - nt changes such that minimal effect on aa sequence RNA-seq is often used to map intron and identify coding sequences in eukaryotes RNA-seq – Uses NGS (usually Illumina) to sequence cDNA: give transcripts abundance and help define exon boundaries mRNA Fragmente d cDNA reads Reads aligned to genome Identifies exonic sequences (5’ UTR, 3’UTR, coding sequenceS) er read # for exon B suggest there may be a splice variant lacking that exon in th