BIOTECH4BI3 - Bioinformatics Lecture Notes PDF
Document Details
Uploaded by EfficientHurdyGurdy
McMaster University
Tags
Summary
These lecture notes cover the basics of bioinformatics, focusing on Python review, BioPython modules, and different DNA sequencing technologies. The notes provide examples of code and demonstrate how to work with sequences using Python libraries.
Full Transcript
BIOTECH4BI3 – BIOINFORMATICS Lecture 1 – Python review and introduction to BioPython Things to Remember Python has a large standard library Python is built on the idea that new functionality should be easy to add The way your code is structured MATTERS Designed to be not clu...
BIOTECH4BI3 – BIOINFORMATICS Lecture 1 – Python review and introduction to BioPython Things to Remember Python has a large standard library Python is built on the idea that new functionality should be easy to add The way your code is structured MATTERS Designed to be not cluttered and uses English language keywords “…clever is not considered a compliment…” Hello World! #! /usr/bin/python print(“Hello world!”) Each program starts with a declaration of where to find the python interpreter (not necessary in Windows) The ‘\n’ signifies the newline character Words to print are surrounded by quotation marks Save this to a text file with extension ‘.py’ and run Use the Python 3.12 installation from www.anaconda.com Types of Data Python has a number of built in types like numbers and strings We assign data to variables in python as follows; myNumber=1 mySentence=“You will love my class” Operators allow us to manipulate data; ▪ +,-,*,/,%,** (numbers) ▪ ,++,=,!= Built-in Types in Python There are many built-in data types but the following are the most commonly used Integers myNumber=1 int(x) Floats myFloat=2.1 float(x) Lists myList=[1,2,3,’four’] Range myRange=range(0,10,2) Strings myString=“biotech” Dictionaries myHash={‘Joe’:123,’John’:456} Flow Control Like many other programming languages Python has familiar constructs to control the flow of a program if/elif/else for statements while statements break continue If/elif/else A condition statement that offers the program a choice of actions Has the following structure if True/False: operation 1 elif True/False: operation 2 else: operation 3 If/elif/else Example #! /usr/bin/python var=5 if var>5: print("value is too high\n") elif var’ Multiple FASTA records can be put in a single file Do not mix DNA and protein records in the same file Reading in a file is called ‘parsing’ FASTA Example >gene1 disease response gene CCTATTAAAGATAACAAGGAAGAAGATGATATAGATGAAGATGCTTTTGAGGCACTGTTTCAGCTA GAGGAGGACCTCGCGAAGCTTGAACGTGAATTGGAAGAGGCACTGAAGGATGATGAACTATTAGGA GGGGTAGTAGAAGATGATAACGAAGAAGAAGAAGAAGAAGAATTACCCGTGAATTTGAAAAATTGG AGTTGTGATCTCTATGTTAGTCTTCTCAAATTAAGATGTTGCAATCGCACTACCAGATACTCATTA AGCACGAAGCAAAGAACATCAATCCATGTGTCTAAAATTTTAAATTGCAGTCCGTATGTAGCTTCA AAAAAACTGGACATCTACTTTATGCTTGCACCATTTTCTTGGCATTCATTAGCTAAAACTATATAT TTCTAGATAAAGTCATCATAAGTATAGTCGGAAGTTTCAGAACCTGTGGGCCTGTGGCTGTAACAT ATTTAAAGAGAATATTTCTACCACTGCTAATTCTGTATCTGTAAATTCTATGTTTCCTTCCAGATA FASTQ Format A common format for data generated by high throughput DNA sequencing instruments Incorporates base call information as well as information about the probability about the base calls. These values are called ‘Phred’ scores or ‘quality’ scores A FASTQ entry has 4 lines; – Line 1 – starts with the ‘@’ symbol followed immediately by the sequence name. Option descriptive information can follow the name – Line 2 – the DNA base calls – Line 3 – the plus symbol (‘+’) – Line 4 – the Phred scores for the base calls Multiple FASTQ records can be present in a file FASTQ Example @SEQ001 TACGGTAGCTAAGTGAGTAGTACGTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGC + AAAAAEEEEEEEEEEEEEEAEEEEEEAEEEE6EEEAEEEE/EEEEEEEEEEEEEEEEEEEE @SEQ002 CGATCGTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTA + A6AAAAAEEAEEEEEEEEAEEEEEEEEAEEEEEEEEEEAEEEEEEEEEEEAEEEEEEEEEE @SEQ003 GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTA + EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE @SEQ004 ATAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG + E6EEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEAEEEEEEEEEEEEAEEEEEEEEEEEE Genbank Format LOCUS AY069118 1502 bp mRNA linear INV 17-DEC-2001 for sequence accuracy, presence of a polyA tail and contiguity within 100 kb in the genome. Thus we believe the sequence to DEFINITION Drosophila melanogaster GH13089 full length cDNA. reflect accurately this particular cDNA clone. However, there are ACCESSION AY069118 artifacts associated with the generation of cDNA clones that may VERSION AY069118.1 GI:17861571 have not been detected in our initial analyses such as internal KEYWORDS FLI_CDNA. priming, priming from contaminating genomic DNA, retained introns SOURCE Drosophila melanogaster (fruit fly) due to reverse transcription of unspliced precursor RNAs, and reverse transcriptase errors that result in single base changes. ORGANISM Drosophila melanogaster For further information about this sequence, including its location Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; Pterygota; and relationship to other sequences, please visit our Web site Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; (http://fruitfly.berkeley.edu) or send email to Ephydroidea; Drosophilidae; Drosophila. [email protected]. REFERENCE 1 (bases 1 to 1502) FEATURES Location/Qualifiers source 1..1502 AUTHORS Stapleton,M., Brokstein,P., Hong,L., Agbayani,A., Carlson,J., /organism="Drosophila melanogaster" Champe,M., Chavez,C., Dorsett,V., Farfan,D., Frise,E., George,R., /strain="y; cn bw sp" Gonzalez,M., Guarin,H., Li,P., Liao,G., Miranda,A., Mungall,C.J., /db_xref="taxon:7227" Nunoo,J., Pacleb,J., Paragas,V., Park,S., Phouanenavong,S., Wan,K., /map="39B3-39B3" Yu,C., Lewis,S.E., Rubin,G.M. and Celniker,S. gene 1..1502 /gene="E2f2" TITLE Direct Submission /note="alignment with genomic scaffold AE003669" JOURNAL Submitted (10-DEC-2001) Berkeley Drosophila Genome Project, /db_xref="FLYBASE:FBgn0024371" Lawrence Berkeley National Laboratory, One Cyclotron Road, Berkeley, CA 94720, USA COMMENT Sequence submitted by: Berkeley Drosophila Genome Project Lawrence Berkeley National Laboratory Berkeley, CA 94720 This clone was sequenced as part of a high-throughput process to sequence clones from Drosophila Gene Collection 1 (Rubin et al., An Example Bio.SeqIO ‘Bio.SeqIO provides a simple uniform interface to input and output assorted sequence file formats’ The design of the this module was based on Perl’s Bio::SeqIO The Bio.SeqIO object allows one to iterate over all of the sequence records in a file Create the object as follows – seqIOObj=SeqIO.parse(FILEHANDLE,FORMAT) Example - SeqIO Example - SeqIO Bio.SeqIO – Format Conversion It is easy to use BioPython to convert DNA sequences among differenct formats OR to use BioPython formatting features to clean up your sequences The approach is to open a source file in one format and a sink file in another format As you iterate over the source sequences you write the objects to the sink file Example – Format Conversion Bio.SearchIO There are a number of bioinformatics tools that are used to search one sequence against a database of others The goal of this is to find similarities between known sequences and unknown sequences with the hope of assigning a putative function Basic Local Alignment Search Tool (BLAST) is one of the most commonly used bioinformatics tools and revolutionized what came before it. BLAST reports can be very detailed and very long. BioPython makes the easy to work with Example – Perform a BLAST Search Example – Parse BLAST Result Example – Parse BLAST Result BLAST Record BIO4BI3 - Bioinformatics Lecture 2 – DNA Sequencing Technologies and Applications Where are we going? DNA Sequencing DNA Read DNA Assembly Sequencing Quality Control Mapping Genome Expression Annotation Analysis Marker-Trait Population Polymorphism Genotyping Associations Analysis Discover Learning Outcomes History of DNA sequencing technology Understand the positives and negatives of current sequencing technology Learn about innovations in sequencing library preparations Use a case study to understand how technology choices influences the outcomes of genome sequencing projects Understand the considerations when choosing sequencing technology Sanger Sequencing Referred to as dideoxy sequencing or chain termination sequencing DNA extension occurs through the polymerization of dATP, dCTP, dTTP, dGTP using an existing DNA strand as a template Radiolabelled primer, dNTPs plus nucleotides unable to continue the polymerization (dideoxy-NTPs) are mixed in four separate sequencing reactions, one for each dNTP The ddNTPs are randomly incorporated into the new strand terminating extension resulting in radioactive fragments of varying length. If enough template is used chain termination will happen at every nucleotide position The fragments are run on a gel to separate them by size and the sequence is read from an autoradiogram Autoradiogram Four separate reactions resolved on an acrylamide gel Smaller fragments run towards the ‘bottom’ Fluorescent Dye-Termination Instead of radioactivity a fluorescent dye is attached to each ddNTP Each nucleotide is a different colour Allows all four reactions to be run in a single lane Capillary Electrophoresis Sanger cont’ The standard against which other types of sequencing are measured First sequenced human genome was generated from Sanger sequences Read lengths of 800-1000 today Still widely used today for sequencing small regions or fragments of DNA Used in combination with next-gen sequencing technology because of its length, high fidelity and well understood error model Quality Scores A measure of confidence in a base-call Originated from the Phred software used to make base-calls from dye- terminator fluorescent sequencing (Ewing and Green, 1998) q=-10log10(p) where p is the probability that the base has been correctly called At q=20 there is a 99% probability that the base has been correctly called. Useful in removing low quality ends of sequence reads Useful in generating consensus sequence from aligned reads Useful in polymorphism discovery Next-gen platforms are trying to develop measures of quality that are compatible with the Phred q-value. Next-gen Sequencing Technology Next-generation sequencing refers to platforms which sequence DNA in a highly parallel manner Sanger -> 1 run = 1 sequence Next-gen -> 1 run = billions of sequences No need for bacterial colonies to enable single clone selection Input library construction is fast and not complex Orders of magnitude faster and cheaper than Sanger sequencing 4 main technologies in the market; Illumina, MGI, Oxford Nanopore and Pacific Biosciences each with different approaches Next-gen Sequencing Technology NGS can be broadly grouped into two types – short read and long read Characteristics of each type to keep in mind short read technologies typically generate much more data per dollar short read technologies use a strategy of signal amplification for detecting base incorporation Short reads are good at SNP detection while long reads excel at identifying structural variations among individuals Long read tech is primarily used in de novo genome sequencing and full- length transcript sequencing, genome structure analysis Short reads are used mainly in re-sequencing, genotyping, genome structure analysis, transcript abundance Illumina Sequencing by synthesis Currently the absolute market leader at > 80% market share Nucleotides are added in a step-wise fashion to the single strand template. Nucleotide determination is based on the wavelength of the fluorescence Polymerization is interrupted after each nucleotide is added. Therefore fluorescence always represents a single base Bridge Amplification http://www.pasteur.fr/ip/easysite/go/03b-00002o-00a/recherche/plates-formes-technologiques/technopole-de-l-institut-pasteur/ genopole/pf/pf1-genomique/sequencage-tres-haut-debit-gaii-illumina-solexa Sequencing by Synthesis http://www.pasteur.fr/ip/easysite/go/03b-00002o-00a/recherche/plates-formes-technologiques/technopole-de-l-institut-pasteur/ genopole/pf/pf1-genomique/sequencage-tres-haut-debit-gaii-illumina-solexa Illumina Sequencing Applications include whole genome re-sequencing, small RNA identification, RNA-seq, digital gene expression, ChIP-Seq, ATAC-Seq, Hi-C, genotyping, Skim-Seq, TILLING Benefits include very large quantities of data generated (latest offering generates multiple TBs of sequence per run), no issues with sequencing homopolymers A major limitation of this technology was the short lengths of the reads it generates. Read lengths of 150 are common but 300 bp are possible. Probability of errors increases at greater read positions It has a non-random error model Overview of library construction. Hirst M , and Marra M A Briefings in Functional Genomics 2010;9:455-465 © The Author 2011. Published by Oxford University Press. All rights reserved. For permissions, please email: [email protected] http://nextgen.mgh.harvard.edu/CustomPrimer.html Paired-end Sequencing https://galaxyproject.github.io/training-material/topics/introduction/tutorials/galaxy-intro-ngs-data-managment/tutorial.html Illumina Sequencing Libraries A sequencing library is a collection of DNA fragments with have been modified to enable sequencing The range of Illumina applications is due to the different types of libraries which are created Third party companies develop novel library preps to extend how Illumina sequence can be applied The innovation is usually associated with processing DNA fragments such that they received a unique combination of barcodes The barcodes allow reads to be associated with individual samples or fragments of DNA 10X Genomics An Illumina library construction technique Allows the assignment of individual sequencing reads to a particular starting template molecule(s) Want to isolate a single cell (or large DNA fragment) with the components for a library preparation Conceptually it is millions of independent Illumina library preparations each with its own unique barcode (Gel Bead in Emulsion – GEM) Benefit is that during sequence assembly you know which reads came from the same starting piece of DNA Has extensive use in single-cell RNA sequencing and single- cell ATAC-Seq No longer offered for whole genome sequencing due to patent disputes. MGI’s single-tube long fragment read (stLFR) is an equivalent offering 10X Genomics – Single Cell Zheng, G., Terry, J., Belgrader, P. et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun 8, 14049 (2017) 10X Genomics Linked Reads https://wheaton5.github.io/projects/tenx Hi-C What is Hi-C Hi-C is an extension of the Chromosome Conformation Capture (3C) method. It captures the three-dimensional organization of genomes by identifying the physical proximity of chromosomal regions in the nucleus. Purpose of Hi-C: To understand how chromosomes fold and interact within the nucleus. To identify topologically associating domains (TADs), enhancer-promoter interactions, and other chromatin interactions. Key Reference: First described in Lieberman-Aiden et al., Science, 2009 Hi-C Workflow Step 1 Step 2 Step 3 Step 4 Crosslinking and Ligation DNA Purification Data Analysis Digestion The sticky ends of and Sequencing Mapping of reads to Cells are the fragmented The crosslinks are the reference crosslinked with DNA are filled in reversed, and the genome. formaldehyde to with biotin-labeled ligated DNA is Construction of a preserve DNA- nucleotides and purified. contact matrix that protein ligated. High-throughput shows interaction interactions. This step joins DNA sequencing is frequencies The DNA is then fragments that are performed to between different digested with a physically close in identify ligated DNA regions. restriction enzyme, 3D space but may fragment pairs. cutting it at specific be far apart in linear sites. genomic distance. Hi-C Overview Lieberman-Aiden et al., Science, 2009 Applications of Hi-C Genome Assembly: Hi-C data helps in scaffolding contigs during de novo genome assembly by providing long-range contact information. Identification of Topologically Associating Domains (TADs): TADs are contiguous regions of the genome that preferentially interact with themselves. These domains are fundamental units of chromosome folding and gene regulation. Enhancer-Promoter Interactions: Hi-C can identify interactions between enhancers and promoters that are critical for understanding gene regulation. Cancer Genomics: Helps in understanding chromosomal rearrangements and aberrations that lead to oncogene activation. Epigenetics and Gene Regulation: Provides insights into how the 3D structure of the genome impacts gene expression and epigenetic regulation. Illumina Complete Long Reads A method from Illumina that allows pseudo-long reads to be created from overlapping short-reads Released in 2023 so it is uncertain whether this technology will have any significant up take The overall goal of this library prep is to identify series of short reads which come from the same DNA template fragment 5 – 7 kb in length These long fragments have superior base call accuracy than true long read sequencing There is criticism that this technology was too long to market and is inferior to long read sequencing technologies whose read accuracy has improved significantly in the last number of years Illumina Complete Long Reads It first randomly inserts primer binding sites into the genome using a transposable element based system they call ‘tagmentation’ These sites will be where PCR primers bind to amplify the genome. The distance between the primer bind sites is the size of the long read fragments. Keep in mind that there are millions of identical copies of the genome being ‘tagged’ so you will end up with PCR fragments that will overlap in an assembly In addition to the primer sites, the genome is enzymatically modified with ‘landmarks’. Illumina hasn’t been clear on the details of this but the purpose of land marks are to allow us to identify Illumina short reads which come from the same fragment Following PCR amplification, the long read fragments with landmarks are prepared as standard Illumina sequencing libraries In addition to the landmarked sequences, a library preparation on is performed on an unmodified genomic DNA sample. These reads are used to removed the landmarks which were introduced into the long reads Illumina Complete Long Reads https://www.illumina.com/science/technology/next-generation-sequencing/long-read-sequencing.html MGI DNA Nanoball Sequencing A newer sequencing platform based on technology from Complete Genomics It has an innovative library preparation that uses rolling circle amplification to create a long oligo with tandem repeats of the fragment to be sequenced (300-500 copies) This single molecule, called a ‘DNA Nanoball’ is loaded onto a flowcell through interactions between the negative phosphate backbone of the DNA and positively charged spots on the flowcell The method of determining the DNA bases is called combinatorian probe anchor synthesis (cPAS) MGI DNA Nanoball Sequencing The details of cPAS haven’t been disclosed but based on a 2018 patent it is similar to sequencing-by-synthesis With cPAS the nucleotide bases are modified with a molecule (sugar?). Each nucleotide has it’s own moiety attached Fluorescently labelled monoclonal antibodies against the moieties are used to detect the base incorporated. Like SBS, the moiety is cleaved from the DNA leaving a 3’OH and sequencing can continue Instruments create 150 GB – 6 TB of sequence per run MGI – DNA Nanoball Sequencing https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-019-5569-5 Illumina vs MGI MGI and Illumina technologies were found perform very closely Either is appropriate for whole genome variant analysis PacBio Sequencing by Binding New method of short read sequencing released in late 2022 from a company leading the long- read sequencing market. It is called Sequencing by Binding (SBB) They claim that the accuracy of their technology is 15X greater than Illumina (error rate of 1 in 1000 to 1 in over 10,000) The main innovation seems to be in the way they detect and incorporate bases in the sequencing process. They still have a cluster generation step like other technologies to amplify the base detection step. The break their sequencing process into 4 steps; initiate, interrogate, activate, incorporate Like Illumina, it is adds once DNA base to the sequenced read in each cycle Unlike Illumina, the incorporated bases are blocked at the 3’ end to prevent chain extension Fluorescent labelled nucleotides flood the flow cell and are allowed to base pair with the template strand but not to form a phosphodiester bond. The labeled nucleotides are then washed away, the 3’ hydroxyl group is unblocked and new unlabelled nucleotides are flooded on the flow cell and allowed to be incorporated The incorporated base is block at the 3’ end which allows only one base to be added at a time Incorporating unlabelled bases reduced something they call ‘scarring’, which is the short linker that is left behind during Illumina sequencing. This leads to increased accuracy PacBio - SBB https://www.pacb.com/blog/sbb-sequencing/ Why do companies create new short read sequencing technologies? It is clear that long read sequencing is the future for whole genome assembly Why are companies continuing to invest in new short read sequencing approaches? Short reads are good for re-sequencing, that is comparing short dna sequences against a high-quality reference genome. This is how we find SNPs and short Indels Short reads are good for counting assays like RNA expression analysis or structural analysis like Hi-C DNA extraction for long-read sequencing can be quite difficult (expensive) while short read technologies have a easier time with various types of samples As long as long-read sequencing is more expensive than short read sequencing will remain in demand PacBio – RT-Sequencing PacBio (Pacific Biosciences) sequencing uses Single Molecule Real-Time (SMRT) technology to read long DNA fragments with high accuracy. Key Features: Single-molecule sequencing: Reads individual DNA molecules without amplification. Long reads: Capable of reading tens of thousands of bases, up to 100 kb. High accuracy with HiFi reads: Combines the length of long reads with accuracy comparable to short-read sequencing. Applications: Useful for genome assembly variant detection Epigenomics transcriptomics PacBio – RT Sequencing PacBio – RT Sequencing http://decodingdna.yolasite.com/single-molecule-real-time-sequencing.php PacBio – RT Sequencing http://decodingdna.yolasite.com/single-molecule-real-time-sequencing.php PacBio - HiFi HiFi (High-Fidelity) sequencing combines the benefits of long reads and high accuracy (>99.9%). Circular Consensus Sequencing (CCS): HiFi reads are generated by sequencing the same DNA molecule multiple times to build consensus sequence. Key Features: Long read lengths (10-25 kb) with high accuracy (Q30 or above) Reduced error rates compared to older long-read technologies Applications: de novo genome assembly detecting structural variants haplotype phasing resolving complex genomic regions PacBio – HiFi Sequencing https://www.pacb.com/technology/hifi-sequencing/ Pacific Biosciences – HiFi https://www.pacb.com/technology/hifi-sequencing/how-it-works/ PacBio Error Rate https://www.pacb.com/blog/understanding-accuracy-in-dna- sequencing/ PacBio Unique Characteristics Poisson distribution of read lengths Random error model PacBio – Pros and Cons Advantages: High-accuracy long reads (HiFi) Ability to capture large structural variants and complex regions Comprehensive epigenetic information Limitations: Higher cost per sample compared to some short-read platforms Longer library preparation time and more extensive computational requirements Requires very high-quality DNA to achieve acceptable read lengths Nanopore Sequence Oxford Nanopore Technologies (ONT) devices reads DNA and RNA sequences by measuring changes in electrical current as nucleic acids pass through a biological nanopore. Unique Features: Real-time sequencing: Data is generated in real time, allowing immediate analysis. Long reads: Capable of sequencing very long fragments (up to hundreds of kilobases). Portable devices: Platforms range from handheld (e.g., MinION) to high- throughput (e.g., PromethION). Applications: whole-genome sequencing targeted sequencing Epigenetics Metagenomics transcriptomics Oxford Nanopore Sequencing https://www.nextgenerationsequencing.info/ngs-products/ngs-technologies/ oxford-nanopore-technologies ONT Sequencing https://nanoporetech.com/uploads/Technology_New/Nanopore_Sensing/filemanager-1.jpg?mw=720 ONT – Raw Data ONT Sequencing https://nanoporetech.com/blog/news-blog-kilobases-whales-short-history-ultra-long-reads-and-high-throughput-genome ONT Applications Whole Genome Sequencing De novo assembly of genomes from high quality ONT reads Genome scaffolding using long ONT reads to position contigs Real-Time Pathogen Detection: Used in field diagnostics (e.g., Ebola, COVID-19). Provides rapid sequencing of pathogens to guide public health interventions. Metagenomics: Sequencing of complex microbial communities without the need for culturing. Used in environmental microbiology, human microbiome studies, etc. Structural Variant Detection: Long reads enable the detection of large structural variants, repeat expansions, and phasing. Epigenetics: Direct detection of base modifications such as methylation without additional library preparation. Full-Length RNA Sequencing: Enables sequencing of entire RNA molecules for isoform discovery and gene expression studies. ONT Pros and Cons Advantages: Portable and scalable devices. Real-time data generation. Ability to produce ultra-long reads (>100 kb). Direct detection of nucleotide modifications such as methylation Limitations: Lower raw read accuracy compared to some short-read technologies (but improving with new chemistries and software). High error rate in homopolymeric regions. Higher cost per base for some applications compared to short-read platforms. Case Study - Cherry Tree Assembly Purpose – Solidify IP protection of a commercialized cherry variety. A genome sequence was requested to unambiguously assign chromosome locations to existing gene-based genetic markers Parameters Limited funds Limited DNA Short time-frame Assembly Considerations The goal was to assign chromosome locations to gene-based markers which are unique sequences in the genome Complete genome characterization of repetitive regions not necessary High quality contig assembly needed for unique chromosome regions Had to achieve long scaffolds for anchoring on an existing genetic map of ~2200 markers Technology Choices Assembly Stage Technology Rational Contig Construction 10X Genomics 10X Genomics gives high quality contig sequence in non-repetitive regions PacBio cost was too high Nanopore cost was too high Scaffold Generation 10X Genomics, 10X Genomics achieves excellent Proximo Hi-C initial scaffold lengths Hi-C low cost with in-house library construction Optical mapping was not readily available Pseudochromosome Genetic Map Use an existing genetic map to Building place scaffolds into linkage groups Cherry 10X Genomics Input Region From Average Region To [bp] Conc. [ng/µl] Molarity % of Total Color [bp] Size [bp] Comment [nmol/l] 20000 >60000 - 13.5 0.322 82.83 48500 >60000 - 12.1 0.241 74.17 Kmer Frequency Analysis Genome Length: 338 Mbp Unique Content: 53.8% Heterozygosity: 0.37% 10X Genomics Assembly Summary Number of Reads 150,000,000 Estimated Coverage 57X Median Insert Size 393 bases Repetitive Fraction 28% Assembled GC Content 38% Mean Molecule Size 66,400 bases Mean Distance Between Heterozygous SNPs 624 bases Contig N50 46,660 bases Scaffold N50 > 10,000 bp 2,760,000 bases Hi-C Scaffolding 10X Assembly Evaluation of the Hi-C Scaffolds Evaluation of the Hi-C Scaffolds Evaluation of the Hi-C Scaffolds Hi-C Scaffolding Results 10X Assembly 10X Assembly Before Hi-C after Hi-C Number of Contigs 16,584 7576 N50 1,894,700 23,810,154 Count >= N50 31 5 Longest Scaffold Length 10,855,963 44,294,128 Total Scaffold Length 292,931,652 288,746,240 Cherry Genetic Map Pseudomolecule Construction Pseudochromosome Linkage Map Scaffolding (bases) chr1 44,294,128 chr2 34,408,754 chr3 26,824,391 chr4 22,649,137 chr5 18,400,590 chr6 27,262,132 chr7 23,810,154 chr8 22,114,774 Total 219,764,060 Assembly Evaluation Number of BUSCO Percent of BUSCO Genes Genes Complete Single- 1337 92.8 Copy Complete 34 2.4 Duplicated Fragmented 32 2.2 Missing 37 2.6 Total 1440 100.0 Assembly Evaluation Choosing a Technology Application dependent Considerations; Total cost Cost per sample Number of reads generated per $ Number of bases generated per $ Length of sequence Quality of the sequence Availability BIOTECH 4BI3 - Bioinformatics Lecture 3 – DNA Sequence Quality Control Where are we going? DNA Sequencing DNA Read DNA Assembly Sequencing Quality Control Mapping Genome Expression Annotation Analysis Marker-Trait Population Polymorphism Genotyping Associations Analysis Discover Learning Outcomes Be able to describe why quality control is an important step in bioinformatics analysis Understand what are the steps in sequence QC Review the types of errors most commonly found in DNA sequencing experiments and their sources Review common metrics of sequence quality Identify sources of sequencing read duplications in experiments and how we can control for them Understand k-mer frequency distributions, their uses in DNA sequence QC, and what information can be learned from them Why is Quality Control Important Ensures Accuracy of Results: Without QC, errors in sequencing data can lead to incorrect conclusions in downstream analyses. Minimizes Cost and Time: Detecting and addressing sequencing errors early in the process can save significant time and resources that would otherwise be spent on re-sequencing or correcting data. Prevents Misinterpretations: Poor quality data can lead to misinterpretation in research studies or clinical applications. For example, false-positive or false- negative variant calls can mislead disease association studies. Improves Reproducibility: Standardized QC measures improve the reproducibility of experiments, which is critical in both academic research and clinical settings. What Happens Without QC Incorrect Genome Assemblies: Poor quality data can lead to fragmented or misassembled genomes, affecting evolutionary studies or functional genomics. False Variant Calls: Errors in sequencing data can cause false SNP or indel calls, impacting studies of genetic variation, disease association, or population genetics. Biased Expression Data: In RNA-Seq experiments, low-quality reads can introduce biases in gene expression quantification, leading to incorrect biological interpretations. Contaminated Data Interpretation: Without QC, contamination from different species or unintended samples can lead to spurious findings, especially in metagenomics or microbiome studies. Key Types of Quality Control Read Quality Assessment: Use tools like FastQC to evaluate the quality of raw reads. Adapter and Contamination Filtering: Remove adapter sequences and check for any contamination that might skew results. Trimming Low-Quality Bases: Remove low-quality bases from the ends of reads to improve overall read quality. Error Correction: Implement error correction methods to identify and correct sequencing errors. K-mer Analysis: Analyze k-mer frequency distributions to detect errors, contamination, or other anomalies. Duplication Level Analysis: Assess the level of duplication in the sequencing reads to identify potential over-sequencing or PCR artifacts. Read Quality Assessment Objective: Evaluate the quality of raw sequencing reads to identify potential issues that could affect downstream analyses. Tools Used: FastQC, MultiQC. Common Metrics Assessed: Per Base Sequence Quality: Identifies low-quality bases that may indicate errors, particularly at the ends of reads. Per Sequence Quality Scores: Evaluates the overall quality of each read; a bimodal distribution may suggest a subset of low-quality reads. Per Base N Content: High levels of 'N' bases can indicate low-quality reads or sequencing issues. Outcome: Identify regions or reads that need trimming or correction to improve the dataset's overall quality. Read Quality Assessment - FastQC A piece of analysis software that is widely used for quality control of raw sequencing data It analyzes the sequencing reads from an experiment (or a subset of the reads) and creates a number of diagrams and tables summarizing different view of the information These diagrams are not difficult to create but FastQC offers a simple method of generating them and provides an HTML output Commercial sequencing providers will often provide reports with much of the same information as that provided by FastQC Per Base Quality Scores Shows the quality score distribution at all positions along the reads The yellow box captures the inter- quartile range The whiskers represent the top and bottom 10% of values https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Per Sequence Average Quality This diagram illustrates the overall quality of the sequencing reads We would prefer a tight distribution on the right side of the graph The bi-modal distribution in this image reveals a subset of reads with overall lower quality. We may want to filter those out https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Per Nucleotide Sequence Content In a random library we would expect the average nucleotide abundance at each read position to be consistent We want to see roughly parallel lines Some library preps (random hexamer and tagmentation) can result in bias, typically at the 5’ end https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Per Sequence GC Content We expect a normal distribution shape centered on the average GC content for the sequenced organism Deviations from a ‘normal’ shape may indicate a contamination in the library preparation (mix of 2 distributions) A shifted distribution indicates a systemic bias that is not associated with base position https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Sequence Length Distribution Only useful for technologies creating variable length sequencing reads Significant differences from what was expect may indicate a bias with the the sequencing run https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Duplicated Sequences In the data from a sequenced genome we expect most reads to have a low level of duplication A moderate level of duplication across the graph can indicate over- sequencing It is not uncommon to see some spikes on the right which indicate the presence of high copy number elements in the genome https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Adapter and Contaminant Filtering Objective: Remove adapter sequences and contaminants that can interfere with downstream processing. Why It’s Important: Adapter sequences from library preparation can lead to false-positive variant calls or incorrect read mapping. Contaminants such as human DNA or bacterial sequences can skew results, particularly in metagenomics or microbiome studies. Tools Used: Trimmomatic, Cutadapt, BBMap, Fastp, fastq_screen Outcome: Cleaner datasets with fewer artifacts, leading to more accurate alignments, assemblies, and variant calls. Contamination Filtering – fastq_screen https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/ Contamination Filtering – fastq_screen https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/ Trimming Low-Quality Bases Objective: Remove low-quality bases from the ends of reads to enhance the overall quality of the sequence data. Why It’s Important: Sequencing technologies like Illumina often show decreased quality at the 3' end of reads. Trimming these regions can reduce error rates. Preserves high-quality central regions of the reads, which are more reliable for downstream analyses. Tools Used: BBDuk, Trimmomatic. Outcome: Increased read quality scores and reduced downstream error propagation, leading to more reliable results. Error Correction Objective: Identify and correct sequencing errors to improve data accuracy. Types of Errors: Substitution Errors (incorrect base calls) Indel Errors (insertions or deletions of bases) Methods for Error Correction: K-mer Based Correction: Identify low-frequency k-mers likely caused by errors and correct them using high-frequency k-mers. Consensus-Based Correction: Utilize overlapping paired-end reads to generate a consensus sequence that corrects errors. Outcome: Higher-quality data that is more accurate for applications such as genome assembly or variant calling. Why do our DNA sequence reads have errors? Each DNA sequencing technology we’ve covered has its own type of errors Errors can be attributed to either the library preparation step or the sequencing process itself Most sequencing technologies have a bias meaning that the errors created are not random Errors that aren’t random mean that we can’t always detect or fix errors in our sequencing data Sources of Errors – Library Prep PCR-Based Errors: Transitions: More common than transversions; these are substitutions between purines (A G) or between pyrimidines (C T). Transversions: Less frequent; substitutions between purine and pyrimidine (A C, A T, G C, G T). PCR Bias: Impact on Amplification: Some sequences are preferentially amplified, leading to biased results. High GC content sequences are harder to amplify than AT-rich sequences. Self-Annealing Fragments: DNA fragments with high self-annealing properties may lead to uneven amplification, causing deviations that are not strictly errors but still affect data quality. Sources of Errors – Instrument Error Amplification-Based Technologies: PCR-Based Errors: Instruments that amplify DNA using polymerases can introduce errors similar to those seen in PCR library preparation. Signal Amplification Errors: Bridge Amplification (Illumina): Errors occur when nucleotide incorporation cycles fall out of phase, resulting in mixed fluorescent signals and increased errors at the 3' end. Rolling-Circle Amplification: Similar out-of-phase issues can occur, leading to confounded base- calling. Platform-Specific Error Types: Oxford Nanopore: Prone to errors with homopolymers (repeated sequences like AAAAA or TTTTT). PacBio SMRT Cells: Known for having random errors throughout the reads, making error rates less predictable. Handling Sequencing Errors Goal: Improve the accuracy and reliability of DNA sequencing data by addressing errors. Common Approaches: Masking Low-Quality Bases Trimming Low-Quality Ends Error Correction Techniques Masking Low Quality Bases Concept: Use the IUPAC standard to mask low-confidence bases with ‘N’ (any nucleotide). Why Mask?: Improves Downstream Analysis: Ignoring low-quality bases helps avoid misleading alignments or variant calls. How It Works: Replace bases with low-quality scores (e.g., below Q20) with ‘N’. This allows downstream software to ignore these uncertain bases. Trade-off: May lose some correct bases, but the overall data quality improves. Trimming Low-Quality Ends Concept: Remove stretches of low-quality bases from the ends of reads. Why Trim?: Sequencing Bias: Sequencing-by-synthesis technologies (e.g., Illumina) accumulate errors toward the 3' end of reads. How It Works: Identify and remove low-quality tails that fall below a set quality threshold. Trade-off: Potential loss of some high-quality bases, but the benefits to overall data integrity outweigh this loss. Error Correction Techniques Concept: Replace erroneous bases with the correct base call or adjust quality scores. Methods: K-mer-Based Correction: Use high-frequency k-mers to correct low-frequency erroneous k-mers. Consensus-Based Correction: Use paired-end read overlap to generate a consensus sequence for higher accuracy. Why Correct?: Reduces the error rate, making the data more reliable for assembly, variant calling, and other analyses. Tools: BFC, BayesHammer, SPAdes error correction. Error Correction with kmers Error High count kmers Low count kmers Any kmers which include the erroneous base have lower counts than kmers which don’t include the error Paired-End Consensus Correction Objectives: Used to correct errors at the 3’ of sequences in short-read data Description: Use overlapping low-quality bases from forward and reverse read pairs to increase the confidence of shared calls. In cases where they disagree the higher quality score can be used. Approach first used with Sanger. Usage: Common in Illumina data 3’ ends of Illumina data tend to be error prone Only works when the forward and reverse reads overlap Outcome: Dramatically improves the quality in a sequence and therefore improves follow-on analysis like genome assembly PE Consensus Correction Correcting PCR Library Errors Purpose: Error correction in PCR-amplified libraries aims to remove or mitigate biases and errors introduced during the amplification process, ensuring more accurate downstream analyses. When to Use: Non-Quantitative Applications Only: Correction is suitable for applications where accurate quantification is not required, as correction methods may skew quantification results. Requires Paired-End Reads: This approach relies on overlapping paired-end reads to generate a consensus sequence. Optimal for Larger Inserts: Larger insert sizes increase the likelihood of overlap between paired-end reads, improving the effectiveness of error correction. Assumption: Assumes that each template fragment in the library is unique. Reads mapping to the same position are considered PCR duplicates. Correcting PCR Library Errors Step-by-Step Approach 1. Identify Duplicate Reads: Identify read pairs where both forward and reverse reads map to the same genomic position. These are likely to have arisen from PCR amplification rather than independent sampling. 2. Utilize Unique Molecular Identifiers (UMIs) If Available: UMIs are short, random sequences added to each DNA fragment before PCR amplification. They help distinguish true duplicates (identical UMIs) from independent reads. 3. Generate a Consensus Sequence: For each set of duplicate reads, generate a single consensus sequence. This involves aligning the overlapping regions of the paired-end reads and using the higher-quality base calls where they differ. Quality Improvement: Consensus sequences reduce the impact of random sequencing errors and provide a more accurate representation of the original DNA fragments. Outcome: Reduces errors and biases introduced by PCR, leading to more accurate downstream analyses, such as variant calling and genome assembly. Correct PCR library errors AAAAAAAAAAAAAAAAAAAAA AAAAAAATAAAAAAAAAAAAA AAAAAAAAAAAGAAAAAAAAA AAAAAAAAAAAAAAAAAAAAA consensus read Read Duplication Analysis Objective: Assess the level of duplicate reads to identify over-sequencing, PCR artifacts, or optical duplicates. Types of Duplicates: PCR Duplicates: Caused by amplification of the same fragment during library preparation. Optical Duplicates: Caused by physical proximity of identical sequences on the sequencing flow cell. Why It’s Important: High duplication levels can skew quantification results (e.g., RNA-Seq) and affect variant calling accuracy. Identifying duplicates can help optimize sequencing depth and data usage. Tools Used: Picard, SAMtools. Outcome: Cleaner data with minimized duplication bias, resulting in more accurate quantification and variant analysis. Addressing PCR Duplicates These types of duplicates are controlled by the use of ‘Unique Molecular Identifiers (UMIs) 3’ to the sequencing primer binding sites (P5 and P7) a short random sequence is introduced These short sequences are usually 6-8 bases long This allows the identification of sequencing reads which originated from the same template molecule during PCA This is very important for sequence counting assays like RNA-Seq and genotyping to control for PCR duplicates Addressing Optical Duplicates This problem was not well understood by Illumina when first discovered When identical sequences are found coming from similar coordinates on the flowcell it indicates an optical duplicate. This is because it is unlikely that a small fragment of DNA will be independently sampled from a genome twice and land close to each other on a flowcell which can hold billions of fragments. The solution is the use of UMI when available. Optical duplicates originate from the same initial fragment that bound to the flowcell. If UMIs are not available you can identify optical duplicates as fragments with the same sequence being generated from physically close positions on the flowcell. This requires setting a maximum distance between identical sequences. K-mers K-mer Definition: A k-mer is a subsequence of length k extracted from a longer DNA sequence. For example, for k = 4, the sequence "ATCGG" contains the k-mers "ATCG" and "TCGG." Importance in Bioinformatics: K-mer analysis is a foundational technique used for various applications, including genome assembly, error correction, contamination detection, and genome size estimation. What K-mers Can Tell Us: By analyzing the frequency of k-mers, we can learn about the underlying sequence quality, complexity, and characteristics, such as repetitive regions, sequencing errors, and coverage. Selecting K-mer Size A small k (e.g., k=5) may result in too many repetitive or non-unique k-mers, especially in complex genomes, leading to ambiguity in analysis. A large value of ‘k’ increases the uniqueness of k-mers in the genome, but too large a value can reduce sensitivity and make the analysis computationally expensive. Considerations: Larger genomes with higher complexity require larger k-mers to differentiate between unique and repetitive regions. Higher sequencing depth allows the use of larger k-mers because of better coverage, reducing the chances of missing low-abundance k-mers. For most genome assembly projects, a k-mer size of 21 to 31 is a good starting point. For error correction or genome size estimation, a k-mer size around 19 to 25 is often optimal. Optimal k 100% % of Paired K-mers with Uniquely 90% 80% Assignable Location 70% 60% E.COLI 50% HUMAN 40% 30% 20% 10% 0% 8 10 12 14 16 18 20 Length of K-mer Reads (bp) Dr. Jianhua Ruan and Jay Shendure Optimal Kmer Size For Analysis Kmer Frequency Process of identifying all of the fragments of a given length, k, and counting how many times that kmer is found in your population Want a k-mer size where most of the fragments will be unique in the genome Presented as a histogram of data values Useful for many applications including; Prediction of genome size Identifying contamination in sequencing libraries Error correction of reads Quality control across sequencing runs Understanding a K-mer Frequency x-axis: k-mer abundance (number of times a k-mer appears in the dataset) y-axis: the number of k-mers with that abundance. unique k-mers: A peak at low abundance values represents sequencing errors (low-frequency unique k-mers). true k-mers: The main peak represents correctly sequenced k-mers (high-frequency unique k-mers). This represents the unique genome content repetitive k-mers: Higher abundance k-mers indicate repetitive regions in the genome. Kmer Frequency Analysis Kmer Peak Unique Content Errors Repeated Sequence Contamination Genome Size Estimation Can use kmer analysis to determine the total length of the genome you sequenced Identify the peak representing unique kmers from your sample on the plot Determine where the curve hits its maximum Determine the total number of kmers in the distribution Genome size = (total kmer number)/(peak depth) Genome Size GS = number of kmers/peak GS = 43,099,162,547/25 GS = 1,724 MB Transcriptome Kmer Frequency http://www.homolog.us/blogs/blog/2011/10/26/k-mer-distribution-of-a-transcriptome/ Popular K-mer Tools Popular Tools Jellyfish: Fast k-mer counting and analysis; suitable for large datasets. KMC (K-mer Counter): Highly efficient tool for counting and manipulating k-mers. GenomeScope: Utilizes k-mer spectra models to estimate genome size, heterozygosity, and repeat content. BFC: Error correction tool that uses k-mer frequencies for correcting sequencing errors in Illumina reads. Considerations When Choosing a Tool Speed and Memory Usage: Important for large genomes or deep sequencing projects. Output Format and Visualization: Look for tools that provide clear and interpretable plots and statistics. Ease of Integration: Tools that can be integrated into existing workflows or pipelines (e.g., Snakemake, Nextflow). Genome Composition Estimation Mixture model KFA with Genomescope (http://qb.cshl.edu/genomescope/) Summary Read Quality Assessment - Detects low-quality reads. Adapter and Contamination Filtering - Removes artifacts and contaminants. Trimming Low-Quality Bases - Improves read reliability. Error Correction - Corrects sequencing errors. K-mer Analysis - Provides insights into errors, contamination, and genome characteristics. Duplication Level Analysis - Reduces duplicate reads and biases. Outcome: High-quality, reliable datasets that enhance the accuracy of downstream bioinformatics analyses. BIOTECH 4BI3 - Bioinformatics Lecture 4 – Scoring Alignments and Similarity Searches Where are we going? DNA Sequencing DNA Read DNA Assembly Sequencing Quality Control Mapping Genome Expression Annotation Analysis Marker-Trait Population Polymorphism Genotyping Associations Analysis Discover Describe what a dotplot is and what information can be learned from it Learn how similarity scores are calculated between DNA sequences Understand how similarity scoring matrices are created and how they are used in comparing the Learning relatedness of proteins Detail the steps of the BLAST algorithm Outcomes Interpret BLAST output, including E-values, P- values, and alignment scores, to assess biological relevance Describe the Smith-Waterman alignment algorithm and be able to implement it Understand the difference between global and local alignments and when to apply each. Dotplots Definition: A dotplot is a graphical tool used to visualize the relationship between two sequences by plotting matching residues or nucleotides in a matrix. Axes Representation: One sequence is placed along the x-axis, and the other along the y-axis. Matches between the sequences are represented by dots where they agree. Visualization: Diagonal lines indicate regions of similarity or identity between sequences. Useful for spotting duplications, inversions, and repeats. Dotplots Advantages: 1. Easy visualization of regions shared between sequences. 2. Identifies structural features: Repeats Duplications Inversions Palindromes Software: MUMmer: A widely-used tool for generating dotplots, particularly for genome comparisons. Other tools: Gepard, Dotter. Limitations of Dotplots 1. Limited Scalability: a. Dotplots work well for small to moderately sized sequences but become cluttered and hard to interpret with large genomic sequences. b. Memory and computational constraints. As sequence sizes increase, the dotplot matrix grows exponentially, which can lead to performance issues. 2. Sensitivity to Noise: a. Noise in repetitive or low-complexity regions. Repetitive sequences or regions with low complexity can produce misleading results, creating false positives that appear as diagonal streaks or blocks. b. High sensitivity to random similarities. Even non-homologous sequences can show matches, which can confuse the analysis. 3. Lack of Quantitative Measure: a. No weighted scoring system. Dotplots provide a visual representation but do not generate numerical scores for similarity or alignment quality, making them less useful for detailed analysis. b. Subjective interpretation. Interpretation of dotplots can be subjective, especially when patterns are not obvious or regions of similarity are unclear. Limitations of Dotplots 4. Gaps and Insertions Handling: a. Difficulty visualizing complex indels: While small gaps or insertions can be observed, dotplots struggle to visualize large or complex insertions and deletions in an informative way. b. Does not show alignments directly: Dotplots only show similarity without explicitly aligning the sequences or considering gap penalties. 5. Limited Use in Divergent Sequences: a. Dotplots are more effective for comparing closely related sequences. As sequence divergence increases, meaningful patterns become harder to detect. Dotplots Interpreting Dotplots Perfect Agreement Duplicated Region Interpreting Dotplots Palindromic Sequences Homologous Sequences Interpreting Dotplots Microsatellties Sequence Inversion Dotplot - Comparing Two Sequences https://mummer4.github.io/tutorial/tutorial.html Dotplot – Self Comparison Compare a sequence against itself Note the reflection across the diagonal Useful to visualize duplicated regions. They look like lines parallel to the diagonal Regions of low complexity appear as blocks https://en.wikipedia.org/wiki/Dot_plot_%28bioinformatics%29 Sequence Similarity Definition: Sequence similarity refers to the degree of likeness or matching between two DNA, RNA, or protein sequences. It can be quantified by various methods, depending on the nature of the comparison (e.g., exact matches, evolutionary distances). Sequence similarity is a fundamental concept in bioinformatics that underpins many types of analysis, from gene annotation to evolutionary studies. Applications: Identifying Homologs: Determining if two sequences are evolutionarily related. Functional Inference: Inferring gene or protein function based on similarity. Phylogenetic Analysis: Reconstructing evolutionary trees based on sequence data. Sequence Similarity Methods to Assess Sequence Similarity: 1. Hamming Distance: Measures the number of mismatches between two sequences of equal length. Useful when no insertions or deletions (indels) are involved. 2. Edit Distance (Levenshtein Distance): Measures the minimum number of operations (insertions, deletions, or substitutions) required to transform one sequence into another. Accounts for indels and substitutions. Hamming Distance ATCGGGATGCCAGAGCCC ATCAGGATGTTAGAGCGC Hamming Distance is 4 Edit Distance ATCGGGATGC-CAGAGCCC AT--GGTTGCCCAGAGCGC Insertions - 2 Deletions – 1 Transversions – 2 Total = 5 Scoring Matrices Definition: Matrices that assign scores to different types of matches and mismatches based on the evolutionary or functional significance of the changes. Commonly used in protein alignments. Examples: PAM (Point Accepted Mutation) matrix: Focuses on mutations over evolutionary time. BLOSUM (Blocks Substitution Matrix): More commonly used for distant sequence relationships. DNA Alignment Scoring Definition: DNA Alignment Scoring refers to assigning numerical values (scores) to the matches, mismatches, and gaps between two or more DNA sequences during alignment. Goal: Maximize the alignment score to represent the best possible sequence comparison, balancing matches and differences (mismatches/gaps). Why Scoring is Important: Provides a quantitative way to compare sequences and assess similarity. Helps identify evolutionary relationships or functional similarities between sequences. DNA Alignment Scoring A G T C Identity Matrix: This simple scoring matrix assumes that the likelihood of mutating from A 1 0 0 0 one nucleotide to any other is equal G 0 1 0 0 Key Feature: T 0 0 1 0 Assumes that all mismatches are equally likely and doesn’t account for biological or C 0 0 0 1 evolutionary significance. Limitations: Simplistic and does not reflect the biological reality that certain mutations are more common than others. DNA Alignment Scoring A G T C Transition-Transversion Matrix: A biologically informed scoring matrix A 4 2 1 1 Transitions: More common in evolution and G 2 4 1 1 therefore given higher scores. T 1 1 4 2 Transversions: Less common in evolution and are given lower scores. C 1 1 2 4 Benefit: More biologically accurate than equal scoring with the identity matrix Gap Penalties Gaps represent insertions or deletions and should be penalized to prevent introducing too many gaps in the alignment. Types of Gap Penalties: 1. Constant Gap Penalty: A fixed penalty for each gap, regardless of length. 2. Affine Gap Penalty: Penalizes the introduction of a gap more than its length, encouraging the use of fewer, longer gaps rather than many shorter ones. Why Use Gap Penalties: Avoids excessive use of gaps that would artificially inflate alignment scores. More biologically realistic, as long gaps (indels) are often rarer than small mutations. Gap Penalties In the context of affine gap penalty, the word affine refers to a linear relationship between the gap length and the total penalty applied for introducing a gap in sequence alignment. An affine gap penalty has two components: 1. Gap opening penalty: A fixed cost or penalty for starting a gap. 2. Gap extension penalty: A smaller, recurring cost for each additional unit (base or amino acid) in the gap after the initial one. Affine gap penalty example: 𝐺𝑎𝑝 𝑝𝑒𝑛𝑎𝑙𝑡𝑦 = 𝜔𝑜𝑝𝑒𝑛 + (𝑛 × 𝜔𝑒𝑥𝑡𝑒𝑛𝑑 ) Where 𝜔𝑜𝑝𝑒𝑛 is the penalty for opening a gap and 𝜔𝑒𝑥𝑡𝑒𝑛𝑑 is the penalty for extending it. Interpreting an Alignment Score Key Concepts: Higher Scores mean a better alignment, indicating closer evolutionary or functional relationships between the sequences Lower Scores means a mismatches or gaps, suggesting more divergence between the sequences Factors Affecting Alignment Scores: 1. Choice of scoring matrix 2. Gap penalties applied Biological context: Depending on the organisms or sequences being compared, different matrices and gap penalties may be more appropriate. PAMN Matrices Point Accepted Mutations The first positional scoring matrix used to determine protein relatedness A quantitative way to measure the evolutionary distance between two protein sequences 1572 mutations observed within 71 closely related proteins (85%) to generate PAM1. All other PAMs are powers of PAM1 Idea is that some protein mutations are more likely based on what is observed in nature N represents the number of mutations per 100 amino acids Point Accepted Mutation (PAM) Matrix Definition: PAM matrices are substitution matrices that score alignments based on the likelihood of one amino acid replacing another during evolution. They measure evolutionary distances between protein sequences. Key Concept: A PAM matrix is built on the idea that some mutations are more likely to occur than others due to evolutionary pressure. Origin: The first PAM matrix (PAM1) was generated by Margaret Dayhoff by studying 1,572 mutations in 71 closely related proteins with at least 85% sequence identity. Other PAM matrices (PAMn) are derived by multiplying the PAM1 matrix by itself n times. Nomenclature: PAM1: Represents a 1% change per 100 amino acids. PAM250: Represents 250 accepted mutations per 100 amino acids, which indicates a much more distant evolutionary relationship. Creating a PAM Matrix 1. Align Protein Sequences: Perform global alignments on sequences with at least 85% identity. 2. Identify Accepted Mutations: Look for mutations in the alignments that are accepted evolutionarily (i.e., they have been preserved in nature). 3. Calculate Mutation Probabilities: Each entry in the matrix reflects the probability of one amino acid mutating into another over time. 4. Extrapolation to Higher PAMs: Multiply the PAM1 matrix by itself to create matrices for more distant evolutionary events, assuming that mutations follow a Markov process (future state depends only on the present, not past mutations). Interpreting a PAM Matrix Positive Scores: Indicate amino acid substitutions that occur more frequently than expected by chance (conservative mutations). Negative Scores: Indicate substitutions that occur less frequently than expected (non-conservative mutations). Usage: PAM matrices are typically used to align sequences that are more evolutionarily related (closer in time), with PAM250 commonly used for moderately divergent proteins. PAM 250 PAM 250 means we expect 250 amino acid changes per 100 amino acids A positive number means an amino acid change occurs more often than expected by chance A negative number means an amino acid change occurs less often than expected by chance Blocks Substitution (BLOSUM) Matrix Definition: BLOSUM matrices score alignments between proteins by considering how frequently pairs of amino acids are substituted in conserved regions of protein families. Unlike PAM matrices, BLOSUM matrices are based on observed substitutions in blocks of conserved sequences without gaps. Key Concept: BLOSUM matrices are derived from regions of proteins that are highly conserved, meaning they do not change much over evolutionary time, and are typically used for detecting more distant evolutionary relationships. Nomenclature: BLOSUM62: The most widely used matrix, created by clustering sequences that share at least 62% identity. BLOSUMn: Different BLOSUM matrices are built by clustering sequences at various identity thresholds. A lower number (e.g., BLOSUM45) is used for comparing more distantly related sequences, while higher numbers (e.g., BLOSUM80) are used for closely related sequences. How to Create a BLOSUM Matrix 1. Align Conserved Protein Blocks: Use alignments from highly conserved regions of proteins, which are generally free from gaps. 2. Cluster Sequences: Group sequences based on a specific identity threshold. For instance, BLOSUM62 clusters sequences with at least 62% identity. 3. Count Amino Acid Pairs: Within these clusters, count the number of times pairs of amino acids are aligned at a specific position. 4. Calculate Substitution Frequencies: For each position, calculate the frequency of observed amino acid pairs compared to the expected frequency (based on amino acid abundance in proteins). 5. Compute Log-Odds Ratios: For each amino acid pair, calculate a log-odds score, which reflects how much more likely the pair is to be found in conserved sequences than by random chance. Positive scores indicate pairs that occur more often than expected, and negative scores indicate pairs that occur less frequently than expected. Interpreting BLOSUM Matrices Positive Scores: Amino acid substitutions that occur more frequently than expected by chance, indicating "conservative" substitutions that maintain protein structure and function. Negative Scores: Substitutions that occur less frequently than expected, often representing "non-conservative" changes that are less likely to be evolutionarily preserved. Usage: BLOSUM matrices are used to compare protein sequences. Depending on the evolutionary distance of interest, different matrices (BLOSUM45, BLOSUM62, BLOSUM80) can be used. Lower numbers (e.g., BLOSUM45) are used for more distantly related sequences, while higher numbers (e.g., BLOSUM80) are better for closely related sequences. How to Create a BLOSUM Matrix Sequence Position 1 Position 2 Position 3 Seq-A T T A Seq-B A A Q Seq-C A Q Q Seq-D A T A 1. Find AA observed AA frequency Amino Count Acid A 6 Q 3 T 3 Total 12 How to Create a BLOSUM Matrix Sequence Position 1 Position 2 Position 3 Seq-A T T A Seq-B A A Q Seq-C A Q Q Seq-D A T A AA Pairs Count 2. Calculate the count of pairs of AA AA 4 QQ 1 TT 1 AT 5 AQ 5 QT 2 Total 18 How to Create a BLOSUM Matrix Sequence Position 1 Position 2 Position 3 Seq-A T T A Seq-B A A Q Seq-C A Q Q Seq-D A T A 3. Calculate the observed frequency of pairs of AA AA Pairs Count Observed AA 4 4/18 QQ 1 1/18 TT 1 1/18 AT 5 5/18 AQ 5 5/18 QT 2 2/18 How to Create a BLOSUM Matrix Sequence Position 1 Position 2 Position 3 Seq-A T T A Seq-B A A Q Seq-C A Q Q Seq-D A T A 4. Calculate the expected frequency of pairs of AA AA Pairs Expected AA 6/12 * 6/12 QQ 3/12 * 3/12 Amino Count Acid TT 3/12 * 3/12 A 6 AT 6/12 * 3/12 *2 Q 3 AQ 6/12 * 3/12 * 2 T 3 QT 3/12 * 3/12 * 2 Total 12 How to Create a BLOSUM Matrix 5. Calculate the log odds ratio 1/alpha * log2(observed/expected) AA Pairs Expected AA Pairs Observed AA Pairs 2log2(O/E) AA 6/12 * 6/12 AA 4/18 AA 2log2(0.222/0.250) = -0.170 QQ 3/12 * 3/12 QQ 1/18 QQ 2log2(0.056/0.063) = -0.170 TT 3/12 * 3/12 TT 1/18 TT 2log2(0.056/0.063) = -0.170 AT 6/12 * 3/12 *2 AT 5/18 AT 2log2(.278/.250) = 0.152 AQ 6/12 * 3/12 * 2 AQ 5/18 AQ 2log2(.278/.250) = 0.152 QT 3/12 * 3/12 * 2 QT 2/18 QT 2log2(0.111/0.125) = -0.107 http://courses.washington.edu/bioinfo/Pabio536/Data/Blossum%2062.jpg BLOSUM Matrix Read the article “Where did the BLOSUM62 alignment score matrix come from?” by Sean R Eddy Basic Local Alignment Search Tool (BLAST) Definition: BLAST is a widely used algorithm that searches a database of sequences to find regions of similarity to a query sequence. It’s crucial for identifying homologous sequences and assessing biological relevance. Key Concept: BLAST works by comparing a sequence (nucleotide or protein) to a large database to identify similar sequences. It is fast and efficient for both local alignments and large-scale searches. Applications: Identifying homologous genes or proteins. Inferring evolutionary relationships. Discovering functional similarities between sequences. Common Types of BLAST blastn: Compares a nucleotide sequence against a nucleotide database. blastp: Compares a protein sequence against a protein database. blastx: Compares a translated nucleotide sequence against a protein database. tblastn: Compares a protein sequence against a translated nucleotide database. tblastx: Compares translated nucleotide sequences against each other. PSI-BLAST: Iteratively searches for distant protein relationships by updating the search profile after each round. How BLAST Works 1. Query Segmentation (Word Size):The query sequence is divided into shorter, fixed-length words (k-mers).For DNA words are typically 11 nucleotides long. For proteins words are usually 3 amino acids long. These words represent potential alignment "seeds“ 2. Word Matching (Hits): BLAST scans the database to identify sequences that contain exact or near-exact matches to the query words. Matches (hits) between the query and the database sequences are identified at this stage. 3. Word Extension: Once hits are found, BLAST attempts to extend the matches in both directions, forming high-scoring segment pairs (HSPs). Extension occurs by adding adjacent nucleotides or amino acids until the score starts to drop significantly. During this phase, BLAST allows for mismatches or gaps, which help to improve the alignment over longer regions. How BLAST Works 4. Scoring and Filtering: The alignment is scored based on the scoring matrix (e.g., PAM or BLOSUM for protein alignments). BLAST filters out low-scoring HSPs, retaining only the highest-scoring matches for further evaluation. 5. Statistical Evaluation: The remaining alignments are evaluated using statistical methods to calculate the E-value (expected value). The E-value measures the likelihood that the match occurred by chance. A lower E-value indicates a more significant result. 6. Output: BLAST outputs the results in a ranked list, with the highest- scoring alignments at the top. For each hit, BLAST provides the alignment score, E-value, and a detailed alignment showing where matches, mismatches, and gaps occur. How BLAST Works How a BLAST Report is Organized Query sequence: Your sequence The database: The collection of sequences you are comparing the query sequence to Hits: These are the sequences in the database that share similarity with the query sequence Alignment: For each hit, the report shows an alignment of the hit sequence with the query sequence. This includes a score for the alignment and an estimate of the statistical significance of the match. These alignments are presented as high scoring pairs (HSP). The HSPs are segment pairs whose scores cannot be improved by extension or trimming of the alignment Score: The score reflects the quality of the alignment. Higher scores indicate better alignments. p-value: The probability that the observed match could have happened by chance in the database e-value: The E-value estimates the number of hits one can "expect" to see by chance when searching a database of a particular size. It decreases exponentially as the Score (S) of the match increases. Essentially, the lower the E-value, the more significant the score. BLAST Report http://bioinformatics.bc.edu/~marth/BI820/images/cshlBlastReport.png BLAST Report Statistics p-value: The probability that the observed match could have happened by chance in the database. The value depends on the e- value e-value: The number of matches as good as the observed one that would have arisen by chance. The e-value depends on the length the query sequence, the size of the database, and the scoring matrix used. Suggestions to Interpret a BLAST Report P-value Meaning 10-1 Likely insignificant What is DNA alignment Definition: DNA alignment is the process of arranging two or more DNA sequences to identify regions of similarity. The goal is to line up the sequences so that similar or identical bases are in the same columns, maximizing the match score between them. Purpose: DNA alignment helps to: 1. Detect evolutionary relationships: Alignments can reveal how closely related two sequences are, helping to trace ancestry and speciation events. 2. Identify functional regions: Highly conserved (unchanged) regions in DNA are often important functional elements, such as genes or regulatory sequences. 3. Predict mutations and polymorphisms: Alignments highlight differences (mutations, SNPs, indels) between sequences, which are important in evolutionary biology and medical genetics. Types of DNA Alignment Global Alignment: 1. Attempts to align the sequences across their entire length, even if it requires introducing gaps to match the sequences fully 2. Used when comparing sequences of similar lengths that are expected to align across their entirety (e.g., two complete genes or genomes) 3. Algorithm: Needleman-Wunsch Local Alignment: 1. Finds regions of similarity within two sequences and aligns only those segments. This approach is more flexible, as it allows for partial matches 2. Used for sequences that are only partially similar or contain large regions of divergence (e.g., when comparing genes from different species) 3. Algorithm: Smith-Waterman Needleman-Wunsch General method of performing sequence comparison Goal is to maximize the alignment score between two sequences Finds the best global alignment. This means that all bases from both sequences are included in the alignment. Two step process All pairs of residues are compared/scored in a 2-D matrix using a Dynamic Programming approach All alignments are represented by pathways through this matrix Highest scoring path is the best global alignment The highest scoring path begins at the largest value in the last row or column and ends in the top left cell Needleman-Wunsch https://upload.wikimedia.org/wikipedia/commons/3/3f/Needleman-Wunsch_pairwise_sequence_alignment.png Smith-Waterman Alignment Definition: The Smith-Waterman algorithm is a dynamic programming algorithm used for local sequence alignment. It identifies the best matching subsequences between two sequences by maximizing the alignment score and finding the highest scoring local region. Key Features: Local Alignment: Finds regions of similarity within two sequences, rather than aligning them end-to-end (global alignment). Optimal for Substrings: Used when you want to find the best- matching local regions between sequences with large dissimilar segments. Gaps and Mismatches: Allows gaps (insertions or deletions) and mismatches in the alignment, with penalties applied to prevent overuse. More Smith-Waterman w(gap) = −1 w(match) = +2 w(mismatch) = -1 http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm Demonstrate SW NW vs SM Needleman-Wunsch Smith-Waterman Global alignments Local alignments Alignment scores can be negative Alignment scores can only be zero or positive No gap penalty required Works best with a gap penalty First row and column are calculated First row and column are set to zero with the gap penalty https://upload.wikimedia.org/wikipedia/commons/0/03/Alignment-Comparison-En.png What is Multiple Sequence Alignment (MSA) Definition: Multiple sequence alignment (MSA) is the process of aligning three or more biological sequences (DNA, RNA, or proteins) to identify regions of similarity across all sequences. Key Objective: The goal of MSA is to align sequences to maximize similarity, which can reveal evolutionary relationships, conserved regions, and functional elements shared by the sequences. Applications: Phylogenetics: Helps build evolutionary trees by identifying homologous sequences. Functional Analysis: Reveals conserved motifs or domains important for the function of proteins or genes. Protein Structure Prediction: Conserved regions may indicate structural or functional importance. Methods for MSA Progressive Alignment (e.g., ClustalW): Aligns sequences two at a time, starting with the most similar pair, then progressively aligns more sequences. Builds a guide tree to determine the order in which sequences are aligned. Advantage: Fast and easy to implement. Limitation: Sensitive to initial alignment errors, which can propagate as more sequences are aligned. Iterative Alignment (e.g., MUSCLE): Starts with an initial alignment, then refines it by repeatedly realigning subgroups to improve the overall score. Advantage: More accurate than progressive methods. Limitation: Requires more computational resources and time. Methods for MSA Consensus-Based Alignment (e.g., T-Coffee): Uses a combination of different alignment methods and creates a consensus by weighting and combining the results. Advantage: Typically more reliable when aligning divergent sequences. Limitation: Can be slower due to multiple alignment steps Challenges in MSA Gaps and Indels: Introducing gaps for insertions or deletions in one sequence can create difficulty when aligning sequences of varying lengths. Divergence: Highly divergent sequences can be challenging to align due to large evolutionary distances. Computational Complexity: Aligning multiple sequences requires significantly more computational power compared to pairwise alignments. MSA https://ugene.net/multiple-sequence-alignment-overview/ Lecture 5 – DNA Assembly BIOTECH 4BI3 - Bioinformatics Where are we going? DNA Sequencing DNA Read DNA Assembly Sequencing Quality Control Mapping Genome Expression Annotation Analysis Marker-Trait Population Polymorphism Genotyping Associations Analysis Discover Be able to describe what a genome assembly is and why it is done Discuss the different algorithms commonly used for assembling DNA sequencing reads Learning Objectives Understand the what ‘coverage’ is and how it impacts an assembly Discuss why genome assemblies are fragmented and what strategies can be used to help with this Describe what a DNA minimizer is and what role they play in modern DNA assembly When we sequence a genome we don’t have the technical ability to sequence a chromosome from end-to-end Instead the genome’s chromosomes are fragmented into many smaller pieces and then some subset of those are sequenced What is Genome Conceptually we are sequencing a single genome but practically we are fragmenting and Assembly sequencing millions of copies of identical genomes The more sequencing we do the higher the likelihood we will sequence the whole genome We use the Lander-Waterman equation to help us determine how much sequencing should be done to achieve our goal Lander-Waterman Equation An equation that was developed to help estimated the depth of coverage of sequencing required to sequence a genome to a minimum level of confidence C=LN/G C = The genome coverage. This is the average number of times you’d expect to sequence any particular base L = The length of the sequencing reads used N = The number sequence reads generated G = The haploid genome size of organism being sequenced Sequencing any particular base follows a Poisson probability distribution P(Y=y) = (Cy × e-C)/y! y = the number of times the base was sequenced C = The genome coverage Sequencing Coverage Example Probability of sequencing a base at a particular level of coverage; Assume coverage is 10X, what’s the probability of sequencing a base only once? P(Y=y) = (Cy * e-C)/y! P(Y=1) = (101 * e-10)/1! P(Y=1) = 0.00045 Probability of sequencing a base 3 times or less is P(Yc and c->b in such a way that the overlap of a and b is implied by the overlap with c String Graph Transitive reduction reduces the number of redundant edges in the graph making it easier to resolve repeats This leaves only the information required to solve the assembly Moves this to an Eularian path problem which is easier to solve efficiently Idealized String Graph Myers EW. The fragment assembly string graph. Bioinformatics. 2005 Sep 1;21 String Graph Continued Once reduced the nodes with an in-degree and out-degree of one are compressed into compound edges These represent unique stretches and are called ‘unitigs’ Repeat edges can also be compressed to have multiple in/out-degrees Example String Graph Myers EW. The fragment assembly string graph. Bioinformatics. 2005 Sep 1;21 De Bruijn Graph Assembly The nodes represent sequences of a fixed size (k-mer) Edges represent reads that exist containing the adjacent kmers The distance between any adjacent kmers is 1 nucleotide where the kmer-1 suffix of node 1 is the kmer-1 prefix of node2 Traversal through the graph generates the sequence with no alignment necessary Because nodes represent kmers and not reads this type of assembly paradigm makes assembly of NGS reads possible Travel each edge once while visiting each node multiple times (Eularian path) De Bruijn Graphs and K-mer Size Impact Too Small (Low k value): Over-connectivity: A small k-mer size results in more shared k-mers between non-adjacent sequences, leading to many false overlaps. Collapsing of Repeats: Repetitive sequences are more likely to share the same small k-mers, causing repeats to collapse and misrepresenting the genome structure. More Ambiguity: Higher chance of ambiguous paths in the graph, making it harder to resolve unique genome regions. Too Large (High k value): Fewer Overlaps: As k increases, fewer k-mers will overlap between reads due to sequencing errors and random mutations, leading to fragmentation of the graph. Lost Data: Larger k-mers require perfect matches between sequences, so even small sequencing errors or mutations can prevent overlaps, disconnecting parts of the assembly. Insufficient Coverage: High k-mer values may result in insufficient overlap, especially in low- coverage regions, which can lead to fragmented assemblies. De Bruijn Graph Assembly AGAC GACC TAGAC C C ACCC ATAGA A CCCA G GACC CAGA CCAG ACCTA T C A Sequence: ATAGACCCAGACCTA DBG Advantages 1. The number of nodes is a function of the kmer size not the amount of sequence 2. No sequence-to-sequence overlap has to be calculated as the reads are encoded in the DBG. This results in substantial computational savings 3. There is no construction of a final sequence from the aligned reads. The final sequence is generated by traversing the graph Why are assemblies fragments? Weaknesses of the sequencing technology The complexity of the genome Major issue is how repeats are represented and resolved during assembly DBG repeats result in fragmentation as the path through the graph is ambiguous. This can be mitigated through larger kmer size There is a practical limit to the kmer size because of the error rate of the sequencing technology. If set too high few reads with share identical kmers. Hamiltonian vs