Applied Genomics Lecture Notes (BIOT9008) PDF
Document Details
Uploaded by Deleted User
null
Francesca Bottacini
Tags
Related
- Lecture 2 Genomics - Sept 5 2024 - Abbreviated Version PDF
- Wilson and Walker's Principles and Techniques of Biochemistry and Molecular Biology 8th Edition PDF
- Introductory Lecture on Functional Genomics PDF
- Biological Databases PDF
- Review Isolasi DNA Genome Hewan dan Bakteri PDF
- Introduction to Omics Science PDF
Summary
These lecture notes cover the principles of applied genomics, specifically focusing on analyzing genomic data. The document discusses different aspects of genomes from simple bacteria to complex humans, including their structures and functions. Key topics include sequence analysis, analysis for microbial and human genomes, and important applications of genomics.
Full Transcript
Applied genomics – BIOT9008 Week1 (Lectures1-2) Dr. Francesca Bottacini © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted BIOT9...
Applied genomics – BIOT9008 Week1 (Lectures1-2) Dr. Francesca Bottacini © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted BIOT9008 Francesca Bottacini ([email protected]) Lectures delivered across 12 weeks (W1-12) Topics include Genomic data analysis for sequencing data Analysis of genomes and metagenomes Comparative genomics and pan-genome analysis Differential gene expression SNP analysis → Introduction to Genomics – Lesk → Metagenomics for Microbiology - Izard © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Module organisation Lectures: Two hours/week (Pre-recorded) Labs: Two hours/week (Live-recorded over Zoom) Assessment: Week6 MCQ: 25th October 7pm (Canvas) Week12 MCQ: 6th December 7pm (Canvas) Practical evaluation: PPT recorded presentation - 20th December (Canvas) © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Module contents © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Genomics Genomics is a branch of molecular biology that studies structure, functions and evolution of genomes. This also includes the study of genes, their structure, function, and expression. This field is comprises several areas of research, which includes: Bacterial and viral genomics: study of genome architecture and function of bacterial and viruses Genomics of complex organisms: study of genome architecture and function of polyploid organisms (multiple chromosomes) Transcriptomics: the study of global RNA expression Genotyping: measurement of DNA diversity through polymorphisms and mutations Several bioinformatic tools are being developed to systematic analyse the enormous amount of biological data generated by genomic technologies. © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Genomic applications Genomic approaches can be applied to study the entire DNA sequence of organisms and reveal the architecture of their chromosomes. Comparative genomics: study of differences and relationships between the genome sequence and function across different strains or species. → Similarity and differences in whole genomes, proteins, RNA and regulatory elements Functional genomics: Using genomic data information this field attempts to find protein functions and mechanisms involved in gene expression, transcription, translation and protein-protein interaction Epigenomics: Using genomic application this field studies the epigenetic modification of the DNA and its effect in gene expression, regulation and cellular processes Structural genomics: By combining genomic information, known 3D structures and modelling approaches this field attempts to predict the structure and function of proteins → The study of single genes is primary focus of Molecular Biology and does not fall into the area of genomics! © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted What is a genome? A genome is the complete DNA sequence of an organism, containing all the hereditary information to build and maintain an organism. Genomes come in different sizes and structures. Can be in circular or linear form and can interact with proteins that enhance its stability. Depending on the organism the genome can be organised in different chromosomes. Each genome contain specific regions encoding for physical products of genetic information. These regions are traditionally called “genes”. Gene structure differ between organisms © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Bacterial genomes Single chromosome Circular molecule Size up to ~14 Mbp Bidirectional replication (leading and lagging strand) Origin of replication at one side of the chromosome Termination on the opposite side Higher ORF density in leading strand G+C content: % of guanine and cytosines in a DNA molecule GC Skew: guanine vs cytosines skew in a given DNA molecule (G-C)/(G+C) → Used to highlight the strand-specific guanine overrepresentation in leading vs lagging strand Can be sequenced using a combination of long/short reads to obtain a complete chromosome © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Viral genomes (bacteriophage) Microorganisms infecting and replicating in bacteria. Extremely diverse in size, morphology, and genomic organization. High variability in terms of genome size (from a minimum of 13kb to 100kb jumbo phages) and highly diverse genome architecture. Linear and typically double-stranded DNA molecule. Mosaic architecture, where genes encoded for early/middle/late infection cycle are encoded within specific regions of the genome Can be sequenced using a combination of long/short reads or short reads only can be sometimes sufficient © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Viral genomes (SarsCoV) Belongs to the family of Coronaviruses (CoVs) Enveloped viruses, having a positive single-stranded RNA genome Approx 30 Kbp of size Resembles the structure of a typical eukaryotic transcript Sequencing of this DNA involves an additional step of retrotranscription to cDNA → The nature of the nucleic acid of this organism influences the NGS pipeline used for its sequencing Can be sequenced using long or short reads or a combination of the two © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Human genome The human genome is encoded in 23 chromosome pairs: 22 pair of autosomes 1 pair of sex chromosomes Total length of 6 Gbp Approximately 30,000 genes Average GC-content in human genomes has a mean of 41% Multiple chromosomes Significant number of repeated elements and DNA repeats To obtain the complete sequence of each chromosome (as for other eukaryoric polyploid organisms) a combination of short and long reads sequencing is necessary. © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Prokaryotic vs Eukaryotic transcript Prokaryotic One transcript per operon Eukaryotic Multiple transcripts per operon © Francesca Bottacini – Distribution of this material outside https://pediaa.com/ the current BIOT9008 module is not permitted NGS platforms available to date © Francesca Bottacini – Distribution of this material outside the current BIOT9009 module is not permitted DNA sequencing © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted RNA sequencing © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Before starting any genomic data analysis a few considerations should be made: Genomic What am I analysing (DNA or RNA)? data analysis Which organism my data come from? Which genome architecture should I expect? How many muber of genes should I expect? What is the GC content of the organism? Which gene structure should I expect? Am I looking at large differences or small differences? Am I looking at a novel organism or do I have a reference? Irrespectively of the quality of your data, you almost always obtain a result! A significant effort needs to be made in order to ensure that the correct method is applied to the correct dataset. … and sometimes, even in that case, things can go wrong! © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Steps in genomic data analysis Data collection: Obtain your input data for analysis. ✓ Public data downloaded from public repositories ✓ Sequencing data obtained in the laboratory using NGS platforms Data quality check and filtering: Assess and ensure that your input data are of acceptable quality to proceed with your analysis. ✓ Number and quality of sequencing reads acceptable ✓ Public data derived from a curated database or assessed for quality Data processing: Preparation and formatting of my input data for my analysis pipeline → May involve adjustment or change in file format Exploratory data analysis: Run your input data through an analysis pipeline or tool, in order to obtain your result Output visualisation: Visualisation of the obtained output → The moment of truth! © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Data collection: sequencing reads Input your data into FastQC tool to perform quality control checks on sequencing raw data Read length Quality scores across all read sequence Good data Bad data © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Data collection: contigs, genes or complete chromosomes Contigs from assembled genomes: Set a threshold in contig number: filter out fragmented genomes with assemblies showing a too high number of contigs. Generally no more than 100 or 200 (depends on the organism). Set a threshold in genome size: filter out asseblies with overall size below ~80% of the expected genome size in your organism Check the GC content of your contigs, should be comparable to the expected GC% of your organism Nucleotde or protein sequences: If obtained from public datasets, curated databases (e.g. Refseq or Uniprot) should be preferred © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Applied genomics – BIOT9008 Week2 (Lectures3-4) Dr. Francesca Bottacini © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Input data Depending on the different type of analysis, genomic pipelines typically start from the following file formats: FASTQ Genbank FASTA/MULTIFASTA ü Protein ü Nucletide © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Genomic platform: Linux The majority of genomic tools are designed to run in Linux platforms because they are are free and designed to run in an open source environment. Open source: code available to all scientific community to be reused and modified as needed. Linux is open-source Operating System, of which kernel (core) is similar to MacOS. They are both based on Unix and share very similar core commands. Linux comes in different flavours, called “Distributions”: ü Debian ü Fedora ü Suse ü Ubuntu Linux Ubuntu by now is one of the main Linux system containing pre-compiled bioinformatic packages that can be installed in a semi-automatic way Biolinux: variation of Ubuntu system, specifically designed for bioinformtic applications. © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Linux OS: layers Applications Shell Kernel Hardware © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Linux file system A file system is a way in which all components of the operating system are organised, stored, retrieved and updated in the hard disk. Each operating system has a different filesystem, therefore programs designed to be installed in a windows PC can not be installed in a MacOS à In Linux (and MacOS) the root(/) directory is the base for the filesystem and all other files are the children. © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Linux: directory structure Each Linux distribution comes with a fixed set of directories: / à Root of the file system (equivalent to our C disk) /home à Where your home directory resides /mnt à a separate mount point, usually used for external hard drives and external storage units Your sysadmin will look after everything else! © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted ssh [email protected] Secure remote login User Remote server address © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Access a Linux server/cluster © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Commands and options The Linux command line (shell) is the Linux utility where all basic and advanced tasks (jobs) can be done by executing commands. The commands are executed on the Linux terminal. The terminal is a command-line interface to interact with the system. àCommands in Linux are case-sensitive Using the terminal is possible to do basic tasks such as creating a file, deleting a file, moving a file, and more. In addition, we can also perform advanced tasks such as administrative tasks (including package installation, user management), networking tasks (ssh connection), security tasks, and many more. Most genomic analyses are not suitable to be ran on a laptop or a Desktop PC, but require remote connection to a high-performance computing server or cluster. Connection to a remote server is performed using the terminal. Most bioinformatic pipelines are launched using the terminal. Even our analyses in Galaxy performed last year! To invoke a bioinformatic tool we type the tool name, followed by several options (usually preceded by minus “-” or double minus “--”) Commands options provide high level of customisation, based on: Input format Output format How we want to execute the task How we want to use the hardware resources © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted How to navigate the bash shell Bash is the Linux/Unix main shell accessible to the terminal (black window) Is composed of a series of commands and utilities used to: Navigate across direcotries in the filesystem (cd) Create/detete directories and files (mkdir, rm) Copy, cut and paste (cp, mv) Create and expand archives (tar, gzip) Edit text files (vi, nano) But also other very useful options: Join and split files (cat) Obtain informations on files/directories (wc, du) Connect to remote servers (ssh, ftp) Find and extract information on files (find, locate) Replace files or filenames (mv) © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Text parsing: GREP command Grep is a Linux and Unix command used to search for matching text in a given file. This command search and extracts lines containing a match to a given pattern (can be a single character or a word, also including comma, spaces, etc). It is one of the most useful commands on Linux and Unix-like system for programmers and bioinformaticians. Example1: extract sequence headers from a multifasta FASTA file: grep ">" input.fasta Example2: count the number of sequences in a multifasta fileFASTA file: grep ">" input.fasta | wc -l © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Text parsing: SED command Sed is a Linux and Unix command that stands for stream editor and 's/fasta sequence/Fasta sequence/g' it can perform lots of functions on file like searching, find and replace, insertion or deletion. substitute FIND REPLACE globally This command is mostly used in UNIX is for substitution or for find and replace functions. Is a powerful tool because you can edit files even without opening them. You can find and replace a word or character in one or thousand files without the need of opening them with a text editor. Example 1: replace the word “fasta sequence” with “Fasta sequence” in a multifasta file called input.fasta sed 's/fasta sequence/Fasta sequence/g' input.fasta Example 2: delete the word “sequence” in a multifasta file called input.fasta sed 's/ sequence//g' input.fasta © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Genomic packages: Github Most of bioinformatic and genomic tools are publicly available and accessible in public repositories. Github is a repository of all available packages (not only bioinformatics), directly available from the developers. From Github each of us can retrieve, download and install a tool in a personal machine. Bioinformatic tools are designed and distributed to be used on a Linux system. These tools can be downloaded and installed in the system through the shell (terminal or command line) Github keeps record of all modifications to source code of each deposited tools. Nothing is kept secret and the scientific community can access, reuse and distribute the software freely! © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Genomic package manager: Bioconda Bioconda is an open-source environment and package manager, containing thousands of pre- compiled ready to install packages in Linux. The majority of bioinformatic tools are available for easy installation through conda environment ü Create a new environment for a specific tool (e.g. BLAST): conda create ü Load the environment: conda activate ü Install the tool in the appropriate environment: conda install BLAST ü Run your tool with the appropriate command ü Unload the environment: conda deactivate © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Virtual OS Platform: Docker Docker is an open source platform that enables developers to build, deploy, run, update and manage containerized applications. Applications are executed in isolated environment called a "container” Docker is widely used in bioinformatics for its ability to provide reproducible, portable, and isolated computing environments. 1. Reproducibility: running tools inside containers allows to overcome installation hurdles and run software packages of any version (even unsupported ones) 2. Portability: Docker containers package the entire environment, including code, dependencies, and configurations. This allows bioinformatics tools and pipelines to be run across different platforms. 3. Isolation: Bioinformatics workflows often involve conflicting software versions. Docker isolates applications in containers, preventing dependency conflicts and ensuring that one tool’s environment does not affect another. 4. Efficient Resource Usage: Docker containers are lightweight compared to traditional virtual machines 5. Collaboration and Sharing: Researchers can share Docker images via repositories like Docker Hub, facilitating collaboration. © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Genomic data analysis Typical NGS genomic analysis pipeline starts with raw reads files in FASTQ format as an input. This file format contains reads sequence and quality information Raw reads are then checked for quality, to ensure that only the portion of reads with sufficient quality (>20 Phred score) are retained for data processing. Following quality filtering, raw reads are assembled with appropriate pipeline based on Library type (paired-end or single end) Read length (long or short reads or both) Reads assembly produces contigs or scaffolds in FASTA format that are then used as an input for gene prediction and functional annotation. ANNOTATION © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Applied genomics – BIOT9008 Week3 (Lectures 5-6) Dr. Francesca Bottacini © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Genomic data analysis Genomic analyses are based on the identification, measurement and comparison of genomic features: DNA sequence Protein sequence Chromosomal structural variation Gene expression Regulatory and functional element identification and annotation To apply genomic analysis methods the whole genome sequence of one or more organisms of interest must be known. Genome sequencing and genomic analysis are typically performed using a combination of high-throughput NGS sequencing and bioinformatics (reads assembly, annotation and downstream analyses). Depending on the type of analysis different input files are used! © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Genomic data analysis The genome sequence of novel and known organismsis is commonly obtained using NGS sequencing approaches. Through NGS platforms a wide amount of sequence information is available for analysis. A typical NGS pipeline is composed of four steps: DNA extraction: preparation of genomic DNA (gDNA) for processing Library preparation: fragmentation of gDNA to obtain a library of fragments to be used as a template for DNA synthesis Sequencing: DNA synthesis and read-out of synthesised DNA using the library as a template Reads generation and analysis: generated FASTQ files undergo pre-processing and post-processing analysis à Compete genome sequence is obtained, along with the predicted structural components (genes) and their predicted functions © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Data download: SRA Toolkit Sequence read archive is the sequence database where all fastq reads are deposited and publicly available. SRA Toolkit is a command line tool allowing to connect and download fastq files directly into your local Linux machine or server Two-step process with two commands to be ran sequenctially: Ø prefetch Connect to NCBI database usiong http protocol and save your download package into a single *.sra file Ø fastq-dump --gzip --skip-technical --readids --read-filter pass -- dumpbase --split-3 --clip --outdir..sra Extract paired reads in to two gzipped fastq files from the downloaded sra file At the end of the process you obtain your input for an assembly pipeline: *_R1.fastq.gz © Francesca Bottacini – Distribution of this material outside *_R2.fastq.gz the current BIOT9008 module is not permitted Pre-processing of NGS reads To perform a FASTQ quality check you need: NGS sequencing (base calling and reads generation) The FASTQ files containing the sequence data. The Adapter Sequences: file containing a list of adapter sequences which will be explicitly searched against the library. Contaminant Sequences: This option allows to specify a file which contains the list of contaminants to screen over-represented sequences. Demultiplexing and reads pre-processing Pre-processing operations (on the reads before starting with analysis): Adapter Removal: using default adapter sequences, or providing custom ones. Trimming: removal of reads portions below QC threshold Reads assembly, functional annotation Filtering: remove reads with average Quality below threshold and then filter by Length. and downstream analysis (post-processing) © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Quality control: FastQC FastQC provides a simple way to perform quality control on raw sequence data obtained from high throughput sequencing platforms. The main functions of FastQC are Import of data from BAM, SAM or FastQ files (any variant) Providing a quick overview to tell you in which areas there may be problems Summary graphs and tables to quickly assess your data Export of results to an HTML based permanent report Ø fastqc -o outdir *.fastq The fastqc pipeline will perform fastq quality check on all files with extension fastq in the working directory and will save the output report inside the name specified as output directory. © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Post-processing: reads assembly Sequencing reads are stored as a fastq files containing both sequence and base quality information. Before starting to work with the reads, QC control, filtering and reads trimming is necessary to exclude reads (or portion of reads) with a low quality. Downstream analyses can start on the filtered “good reads”. The assembly process consist in “putting back together” all the short reads to obtain a representation of the chromosome from which they originated. à The reads assembly is a post-processing step, performed on the pre-processed QC filtered reads. © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted At the end of the assembly process, the pipeline generates a nucleotide sequence files containing: A chromosome sequence (if the assembler managed to complete resolve the chomosome!) Contigs sequences Scaffolds sequences © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Genome assembly: Spades SPAdes is a common assembly pipeline using to assemble sequencing reads obtained from different sequencing platforms. Simple assembly (e.g. Illumina data only) Hybrid assembly (e.g. Illumina + Nanopore data combined) Illumina only: Ø spades.py -k 21,33,55 --pe1-1 forward_reads_R1.fastq.gz --pe1-2 reverse_reads_R2.fastq.gz -o myassembly Hybrid Illumina-Nanopore: Ø spades.py -k 21,33,55 --pe1-1 forward_reads_R1.fastq.gz --pe1-2 reverse_reads_R2.fastq.gz –nanopore nanopore_reads.fastq -o myassembly As a result the pipeline will save all the output files (assembled contigs + additional information into the folder named “myassembly” © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Gene prediction methods Gene prediction of protein-coding genes consist in two approaches: Ab-initio methods: not based on previous knowledge, they rely on “signals” in the DNA sequence for gene identification Homology-based: based on previous knowledge of homologous predicted genes in their database, to identify genes in a new sequence Following genome assembly, the nucleotide sequence is used as an input to predict protein-coding sequences and functional RNAs © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Genome annotation: Prokka Prokka annotation pipeline is the widest Open Reading Frame prediction and annotation tool available for functional annotation of assembled contigs and genomes. Several parameters can be implemented based on the type of organism and the database you want to use as a reference for your annotation Ø prokka --outdir mydir --locustag LOCUS --proteins referencedb.faa --evalue 0.0001 --gram pos --addgenes contigs.fasta As a result the annotation output will be stored inside the folder named mydir. The gbk and gff files will contain CDS and gene features © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Microbiomes and metagenomes A microbiome represents the community of microorganisms that can be found living together and sharing the same habitat. 1988 by Whipps et al: “A characteristic microbial community occupying a reasonably well-defined habitat which has distinct physio-chemical properties”. Microbiomes recently became the center of attention in microbiology research. à Advancement in NGS sequencing has allowed a massive investigation of all organisms in different contexts (habitats, hosts, body sites, health conditions, etc). Genome: complete set of genes present in a given organism. Genomics explores the complete genetic information of single organisms. Metagenome: collection of genes and genomes from all members of a microbial community. Metagenomics explores a mixture of DNA from multiple organisms. © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Human Microbiome The human microbiome is one of the largest organs, weighing approximately two to three kilograms in an adult! The make-up of the microbiome changes during our life span. à A decrease in the number and diversity of its constituents is associated with diseases and ageing. In fact, healthy individuals and centenarians are known to house a wider diversity of microbial partners than unhealthy individuals. Location specific functions! Importance of profiling the microbiome composition in different body sites in different conditions (health, disease, etc..) Microbiome profiling Classical microbiology studies were based on cultured microorganisms and communities. These studies allowed to establish the importance of microbes interactions in their ecosystem. Limitation to classical approaches: Several microbes cannot be cultured (growth conditions unknown or not reproducible in the laboratory) Loss of microbial diversity when attempting to establish microbial community in lab conditions. Metagenomics approaches allow to sequence the DNA of entire samples without the need of culturing à DNA isolated directly from the sample and sequenced. Possibility to explore two aspects of a microbial community: who is there and what are they capable of doing? © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Microbiome analysis pipeline Metagenome analysis avails of next-generation sequencing (NGS) involving several steps. Total DNA is first extracted from a sample. Following a fragmentation, the DNA undergoes adapter ligation for library preparation. The metagenomic libraries are sequenced using paired-end reads to maximise de novo assemblies. The reads are assembled into contigs. An additional step of genome binning may be performed to combine contigs from the same organisms (and attempt their genome reconstruction). Gene prediction and annotation is performed on all contigs, to identify all encoded genes in the sample. Functional analysis is performed to detect the gene functions and reconstruct metabolic activities. © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Applied genomics – BIOT9008 Week4 (Lectures 7-8) Dr. Francesca Bottacini © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Microbiome profiling Two main approaches in microbiome profiling: 16S rRNA sequencing: only the 16S rRNA gene is amplified and sequenced in the sample. This approach is also called “amplicon sequencing”. à DNA is extracted from all cells in a sample and only a taxonomically informative gene marker (e.g. 16S rRNA) common to a specific group of interest is sequenced. The resultant amplicons are sequenced and bioinformatically assessed to determine which microorganisms are present (taxonomical classification based on 16S rRNA analysis) and their abundance. Deep sequencing: instead of targeting a specific genomic locus for amplification, all DNA is sheared into fragments and independently 97% for 16S rRNA sequence identity sequenced by a shotgun sequencing approach. The resulting reads are for species separation assembled in whole genomes. 94% for 16S rRNA sequence identity © Francesca Bottacini – Distribution of this material outside for genera separation the current BIOT9008 module is not permitted 16S rRNA sequencing – who is there? Metagenomic profiling based on 16S rRNA sequencing allows to determine the species composition across different samples. Detect microbial diversity Comparison between health and disease Identification of bacterial species associated with health and diseased status Fluctuation and changes in microbial diversity over time (e.g. disease progression) Co-occurrence and co-exclusion of bacterial taxa and their relation with health and disease status of their host Amplicon sequencing does not allow to investigate the gene functions in a microbiome. à Only based on taxonomical profiling of a single marker gene © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Shotgun sequencing – what they do? Whole shotgun metagenomic sequencing allows to sample all genes in all organisms present in a sample or community Detect microbial diversity Comparison between health and disease Identification of bacterial species and their genes associated with health and disease Identification of metabolic pathways Analysis of marker genes associated with a disease Identification of microbe-host relationships à Culture independent approach à Microbiome sequencing can be combined with the sequencing of host genome, in order to identify host genetic variations associated with microbial composition Microbiome genome-wide association (mGWAS) © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Operational Taxonomic Unit Operational Taxonomic Units (OTUs) represent unique biological sequences in your sample. Their abundance in the samples is reflected by their read counts. Microbial community sequencing data is typically organized into large matrices where the columns represent samples, and the rows contain reads counts of each OTU. à These tables contain the record of the number of reads counted for each biological sequence in each of your samples. They are called OTU tables or abundance tables. OTU tables are the starting point for downstream OTU analysis: Taxonomical analysis Community composition (alpha/beta diversity) Differential abundance analysis Functional analysis © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted OTUs challenges 1) Differences in microbial community abundances in biological samples may be affected by library sizes and difficulties in DNA extraction, affecting our ability to detect rarely occurring taxa. à Detection of rarely occurring taxa requires a deeper sequencing (effect of sequencing depth). 2) Most OTU tables are sparse with high proportion of rare OTUs with zero counts (~90%) à This reflects our limit of detection or rare taxa in the sample 3) The total number of reads obtained for a sample does not reflect the absolute number of microbes present. What is detected in each sample is just a fraction of the original environment. Microbiome data is compositional! Uneven sampling depth, sparsity, and the fact that we attempt to draw inferences on taxon abundance using a limited number of taxa represents serious challenges for in data interpretation! © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted OTUs normalisation Proper normalization potentially remove biases and variations introduced by sampling and sequencing strategies, so that the normalized data reflects better the underlying biology. 1) Normalisation by rarefying: randomly removing data without replacement from each sample such that all samples have the same number of total counts. àpotentially reduces statistical power depending upon how much data is removed and does not address the challenge of compositional data 2) Normalisation by scaling: multiplying the matrix counts by fixed values or proportions, i.e., scale factors. à overestimate or underestimate the prevalence of zero counts! Putting all samples of varying sampling depth on the same scale ignores the real differences! 3) Normalisation by log-ratio transformation: log transformation of the OTU table is preferred method and widely applied to compositional data à the log transformation cannot be applied to zeros, so sparsity can be problematic for methods that rely on this transformation. Zeros counts can be replaced with a small value, known as a pseudocount. © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Relative abundance One of the goals of microbiome studies is to determine which taxa, if any, drive phenotypic changes among study groups. This involves identifying differentially abundant taxa between study groups. Relative species abundance is a measure of how common or rare is each species relatively to other species in a defined community. à Percent composition of an organism is relative to the total number of detected organisms. Relative abundances are an example of compositional data! Limitation: All taxa in two ecosystems may be identically abundant in a unit volume of two ecosystems except for one differentially abundant taxon. Due to this one differentially abundant taxon, the two ecosystems may seem to differ in the relative abundance of all taxa. à Can not be compared between samples ! © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Absolute abundance Standard methods for determining changes in microbial taxa measure relative, rather than absolute abundances because are easier to compute. à every increase in one taxon’s abundance causes an equivalent decrease across the remaining taxa! à cannot fully capture how individual microbial taxa differ among samples or experimental conditions! Absolute species abundance is the absolute quantification of an organism in a defined community. Quantifying the absolute abundance of microbial taxa by using known “anchor” points to convert relative data to absolutes. à Synthetic DNA spikes of known concentration added to the sample before sequencing Absolute abundance require the use of internal standard with known abundance as an anchor (standardisation of counts). à Re-calculation of taxa abundances to obtain standardised absolute counts © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Alpha diversity Alpha diversity (α-diversity) is defined as the mean diversity of species in different sites or habitats within a local scale. à Describe the "within-sample" diversity. It's a measure of how diverse a single sample is, usually taking into account the number of different species observed. à Alpha diversity indexes are: Chao index: Chao1 estimator based on abundance data, while Chao2 is based on incidence data (presence/absence of taxa). Shannon index: estimator for both species richness and evenness, based on assumption that the more species you observe, and the more even their abundances are. Simpson index: is a probability that two entities (microbes, or reads) taken from the sample at random are of different types. © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Beta diversity Beta diversity (b-diversity) is defined as the ratio between regional and local species diversity. Regional species richness: is the number of species observed in all habitats together Local species richness: is the the number of species observed in each habitat à Describe the “between-samples" diversity. It's a measure of how species turnover between communities Measure of the change in species diversity from one environment to another. It calculates the number of species that are not the same in two different environments. A high beta diversity index indicates a low level of similarity, while a low beta diversity index shows a high level of similarity. Influenced by presence/absence of rare or low abundant taxa at limit of detection Jaccard Index UniFrac distance Bray-Curtis dissimilarity © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Jaccard distance Jaccard distance is a measure of how dissimilar two sets are. Indicates the similarity between two sets of data to see which members are shared and distinct. It can be a value between 0 and 1 where 0 indicates no overlap and 1 indicates perfect overlap. If two datasets share the exact same members, their Jaccard Similarity Index will be 1 Is computed to calculate a distance matrix for clustering (or grouping) sample sets. Limitations: It doesn’t consider frequency (how many times a term occurs in a document), but only common terms. Does not consider that rare terms in a collection are more informative than frequent terms. The number of common members in two sets divided by the total Different sized sets with same number of common members will result in number of members in both the two sets is the Jaccard coefficient. the same Jaccard similarity. © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted UniFrac distance UniFrac distance is the distance metric used for comparing biological communities based on phylogenetic information. Measures the phylogenetic distance between sets of taxa in a phylogenetic tree by dividing the branch lengths that are not shared between two samples by the branch lengths covered by either samples Reflects differences between the lineages that are adapted to live specifically in one environment or the others. The UniFrac metric has been used to cluster many different environments according to shared similarities in community composition determine whether communities are different compare many communities simultaneously using clustering To compute the UniFrac distance we need to know the sum of the branch length that was observed in either sample (the observed measure the contributions of different factors (e.g. chemistry and branch length), and the sum of the branch length that was unique geographical locations) to the similarities between samples of a single sample (the unique branch length). © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Bray Curtis dissimilarity Bray-Curtis dissimilarity is often used in ecology and biology to quantify how different two communities are in terms of the species found in them. Is always a number between 0 and 1: If 0, the two communities share all the same species If 1, the two communities don’t share any species The dissimilarity index is often multiplied by 100 and treated as a percentage. A Bray Curtis dissimilarity of 0.21, for instance can be referred to as a Bray Curtis dissimilarity percent of 21%. To calculate the Bray-Curtis dissimilarity between two communities you must assume that both have the same size. This is because the equation is only based on the counts themselves. Absolute abundance require the use of internal standard with known If the two communities are not the same size, you will need to abundance added to each sample (standardisation of counts). normalise your counts before doing the Bray-Curtis calculation. © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Visualisation: Clustering and PCA (Dis)similarity matrices resulting from Beta diversity estimation are often used for visualisation methods. Cluster analysis or clustering is the process of establishing groups of objects in such a way that objects in the same group (called a cluster) are more similar Often applied to microbiome data to group subjects based on the taxa that naturally occur in them. The result of the clustering can then be further assessed for associations between the diversities and with characteristics of interest. Principal component Analysis (PCA) is a dimension reduction technique and amalgamation method applied to visualise differences between samples. Due to the complexity of microbiome data, visualization methods often use dimension-reduction approaches. Using PCA is possible to group samples based on their dissimilarity, define clusters of more similar samples and compare them. © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Differential abundance analysis Differential Abundance Analysis aims to find differences in the abundance of each taxa between two samples, assigning a significance value to each comparison. For OTU differential abundance testing between groups (e.g., case vs. control) the count matrix is usually normalised to a fixed depth and then a nonparametric test is applied: Mann-Whitney/Wilcoxon rank-sum test (two groups) Kruskal-Wallis test (multiple groups) Nonparametric tests are often preferred because OTU counts are not exactly normally distributed. Limitation: when analysing relative abundance data, this approach Following detection of significantly occurring taxa or genera, does not account for the fact that data are compositional. these are selected for graphical representation, based on their log2 fold change between the compared samples. © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Correlation and association Microbiome multi-omics are typically noisy, sparse (zero-inflated), high-dimensional, non-normal, and often in the form of Pearson correlation: evaluates the linear compositional measurements. relationship between two continuous variables (relationship following a straight Correlation (Pearson or Sperman) and Association analyses in microbiome are essential to identify combination of factors. These line). analyses are conducted to identify important taxa in microbiome data, but also environmental factors or metabolites production. The Spearman rank correlation: evaluates the monotonic relationship between two Correlation analysis is used to identify whether correlation exist continuous or ordinal variables (the between variables. Often results in artefactual associations variables tend to change together, but not between low-abundant microbes and it fails to account for necessarily at a constant rate). compositionality. Association analysis in microbiome datasets is performed to identify whether microbial features or taxa are significantly MaAsLin: Microbiome Multivariable Associations with associated with a disease/treatment/condition Linear Models, a new method used to identify associations between multiple variable in large, As multiple variable are being tested, association analyses rely on heterogeneous meta-omics datasets. multivariate analysis © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Functional profiling In addition to the taxonomic composition, microbiome studies often require the identification of differences in metabolic function between microbial populations. Using 16S sequencing data: one can predict a functional profile by using programs such as PICRUSt or Tax4Fun. Using the relative abundance of taxa within the community, these programs predict the potential functionality based on the reference genomes in public databases for each taxa present. However, these methods are a rough approximation and inexact! Using shotgun metagenomic data: once a metagenome is assembled, gene predictions are made using tools such as, Prokka, MetaGeneMark and Glimmer-MG. Following structural gene prediction, functional annotation is carried out using protein sequence homology-based searches (typically UBLAST and USEARCH) against databases of orthologues (e.g. EggNOG or COG), enzymes (e.g. KEGG), or protein domains and families (e.g. Pfam, TIGRFAMs, or InterPro). Pathway enrichment analysis, clustering, and scoring can be performed using programs such as Pathfinder, or KEGGscape The results are generally represented in the form of a table, heatmap, bar plot, or butterfly plot containing functional pathway abundance levels relative to a reference group in order to visualize variation between different communities. © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Challenges in metagenomic data analysis Sampling problem. Too few samples are typically taken, often because of cost limitations, so that the samples do not reasonably approximate the truth about the environment being sampled Microbial community in each biological sample may be represented by very different numbers of sequences (i.e., library sizes), reflecting differential efficiency of the sequencing process rather than true biological variation. The full range of species in a metagenomic sequencing is rarely saturated, so that more bacterial species are observed with more sequencing (depending on coverage depth). Thus, samples with relatively few sequences may have inflated beta (β, or between sample) diversity. Most OTU tables are sparse, meaning that they contain a high proportion of zero counts (~90%). This sparsity implies that the counts of rare OTUs are uncertain, since they are at the limit of sequencing detection ability when there are many sequences per sample (i.e., large library size) and are undetectable when there are few sequences per sample. The total number of reads obtained for a sample does not reflect the absolute number of microbes present, since the sample is just a fraction of the original environment. Since the relative abundances sum to 1 and are non-negative, the relative abundances represent compositional data. Uneven sampling depth, sparsity, and the fact that researchers are interested in drawing inferences on taxon abundance in the ecosystem represent serious challenges for interpreting data from microbial survey studies. Applied genomics – BIOT9008 Week5 (Lectures 9-10) Dr. Francesca Bottacini © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted 16S sequencing and analysis 16S Analysis: QIIME QIIME is an open-source bioinformatics pipeline for performing 16S-based microbiome analysis from DNA sequencing data. Sequencing reads need to be pre-processed (de-multiplexed and quality filtered) in order to construct an OTU table. A suit that combines numerous python tools to perform the analysis QIIME processes raw sequencing data generated on the Illumina or other platforms and the steps involve the following: 1. Pre-processing: Demultiplexing and quality filtering 2. OTU picking: Reads assembly and OTU generation 3. OTU table: Generate OTU table in biom format 4. Summarise taxa: Taxonomic assignment and phylogenetic reconstruction 5. Diversity analysis: compute alpha and beta diversity 6. PICRUSt: metagenome function prediction 7. Data visualizations: charts, plots, PCoA and cytoscape networks Example workflow with commands (one or more scripts per operation): https://sites.google.com/site/knightslabwiki/qiime-workflow © Francesca Bottacini – Distribution of this material outside http://qiime.org the current BIOT9008 module is not permitted 16S Analysis: MOTHUR MOTHUR is an open-source and platform-independent software developed for the analysis of 16S sequencing data. Comprehensive single software for the analysis of microbial communities → Facilitates the standardisation of obtained output and procedures → Easier to implement and maintain 1. Data preparation 2. Quality control 3. Alignment to selected database 4. Removal of poor alignments 5. OTU clustering 6. Classification 7. Diversity Analysis 8. Data visualisation © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted https://mothur.org MOTHUR vs QIIME GreenGenes and SILVA are the two main 16S rRNA sequence databases commonly used for classification an metagenomic profiling purposes. SILVA database is being continuously updated, while GreenGenes is still an old database. A newer version called GreenGenes2 is now available. However, based on a 2018 study on rumen microbiome profiling, it seems that MOTHUR with both GreenGenes and SILVA databases assigned the highest number of OTUs → Before choosing a pipeline, it is advisable to run a test on a mock or known dataset! https://doi.org/10.3389/fmicb.2018.03010 © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted https://astrobiomike.github.io/metagenomics/metagen_anvio Metagenome Assembly: MEGAHIT MEGAHIT is an ultra-fast and memory-efficient NGS assembler. It is optimized for metagenomes, but may be used on generic single genome assembly (small or mammalian size) and single-cell assembly. Designed for assembling large and complex metagenomics data in a time- and cost-efficient manner. Makes use of succinct de Bruijn graphs → shorter version of a de Bruijn graphs assembly. Sequencing errors are highly problematic High level of assembly fragmentation → Suitable when analysing large datasets where the contig length are not a concern (e.g. when only interested on presence of single genes rather than their genomic context) https://academic.oup.com/bioinfor © Francesca Bottacini – Distribution of this material outside https://github.com/voutcn/megahit matics/article/31/10/1674/177884 the current BIOT9008 module is not permitted Metagenome Assembly: MetaSPAdes MetaSPAdes is a version of the popular SPAdes software, specifically designed for the assembly of metagenomic datasets. SPAdes works well for assembling low-complexity metagenomes but its performance deteriorates in the case of complex bacterial communities → development of a dedicated tool MetaSPAdes first constructs the de Bruijn graph of all reads using SPAdes, transforms it into the assembly using various graph simplification procedures, and reconstructs the assembly of long genomic fragments within a metagenome. Good trade-off between accuracy and continuity → Most popular assembler of shotgun metagenomic data, however, takes a significant longer time to compute compared to faster alternatives (e.g. MEGAHIT). https://www.ncbi.nlm.nih.gov/pmc/articles /PMC5411777/ © Francesca Bottacini – Distribution of this material outside https://github.com/ablab/spades the current BIOT9008 module is not permitted OTU clustering: UPARSE UPARSE is method used to generate OTU clusters from NGS sequencing data The input of UPARSE are sequences with associate abundance (number of reads mapping to that sequence) Clustering criteria: 1. All pairs of OTU sequences should have = 97% identity. Greedy algorithm Based on the assumption that high-abundance reads are more likely to be true biological sequences → OTU are selected from the more abundant sequences © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted Metagenome Taxonomy: MetaPhlAn2 MetaPhlAn2 (Metagenomic Phylogenetic Analysis) is a computational tool for profiling the composition of microbial communities from metagenomic shotgun sequencing data. Used to profile the relative abundance of microbial taxa in genomic data. MetaPhlAn2 relies on clade-specific taxonomical marker genes identified from ~17,000 reference genomes (~13,500 bacterial and archaeal, ~3,500 viral, and ~110 eukaryotic). Taxonomic profiling of a sample as the MetaPhlAn markers are clade-specific (alternative to 16S based approach) Estimation of organismal relative abundance Species-level resolution for bacteria, archaea, eukaryotes and viruses Validation of the profiling accuracy on synthetic datasets and reference metagenomes © Francesca Bottacini – Distribution of this material outside https://huttenhower.sph.harvard.edu/metaphlan2/ the current BIOT9008 module is not permitted Metagenome Taxonomy: KRAKEN KRAKEN is a popular tool to assign taxonomic labels to short DNA sequences obtained by shotgun metagenomic assembly. → Can be applied to reads, genes and even contigs! Significantly faster than MetaPhlan, with 99% sensitivity at genus level The standard Kraken database downloads and uses all complete bacterial, archeal, and viral genomes in Refseq, constantly updated → over 25k genomes. Option to build a custom database too The more comprehensive the database, higher the chances to classify metagenomic reads or contigs. © Francesca Bottacini – Distribution of this material outside the current BIOT9008 module is not permitted http://ccb.jhu.edu/software/kraken/ Reads mapping: BBTools/BBMap BBTools is a suite of bioinformatic tools used for the analysis of metagenomic and metatranscriptomic datasets. BBMap is a component of BBTools and is used to align reads to a reference sequence (contig or gene). Is a splice-aware aligner and can be used for either DNA or RNA sequences. Can align sequences from all major platforms (including long- reads with higher error rates)! Reference sequence in FASTA format to be indexed Reads in FASTQ format (either interleaved or in a single file) Performs global alignments, but can be instructed to convert to local alignments too. Perfectmode: reads must match the reference perfectly Noperfectmode: tolerance for N bases and for reads going over the edges of the reference © Francesca Bottacini – Distribution of this material outside https://jgi.doe.gov/data-and- the current BIOT9008 module is not permitted tools/software-tools/bbtools/ Reads mapping: Minimap2 Minimap2 is a general purpose sequence alignment program that aligns DNA or mRNA sequences against a large reference database. Typical use cases include: mapping PacBio or Oxford Nanopore genomic reads to the reference genome finding overlaps between long reads with error rate up to ~15% splice-aware alignment of PacBio Iso-Seq or Nanopore cDNA or Direct RNA reads against a reference genome aligning Illumina single- or paired-end reads assembly-to-assembly alignment full-genome alignment between two closely related species with divergence below ~15% https://academic.oup.com/bioinformati cs/article/34/18/3094/4994778 © Francesca Bottacini – Distribution of this material outside https://lh3.github.io/minimap2/mi the current BIOT9008 module is not permitted nimap2.html Reads mapping: COVERM COVERM is a fast DNA reads alignment that also computes relative abundance of a sequence in a metagenomic dataset. Uses Minimap2 for alignment and generates: number of mapped reads depth of coverage relative abundance of the reference sequence in the dataset. Easy to use and fast, especially suited for large datasets. COVERM genome: calculates the coverage of a genome sequence https://github.com/ © Francesca Bottacini – Distribution of this material outside wwood/CoverM the current BIOT9008 module is not permitted Metagenome Rapid Annotation: MG-RAST MG-RAST (Metagenomic Rapid Annotations using Subsystems Technology) is an open-source web application server that performs automatic phylogenetic and functional analysis of metagenomes. One of the biggest public repositories for metagenomic data. PRO: Assessments of sequence quality and annotation with respect to multiple reference databases, are performed automatically with minimal input from the user CONS: Not suitable for data that can not be uploaded to an external server. Also the speed is dependent on server usage. Computes an initial metabolic reconstruction for the metagenome and allows comparison of metabolic reconstructions of metagenomes and genomes. © Francesca Bottacini – Distribution of this material outside https://www.mg-rast.org the current BIOT9008 module is not permitted Functional Annotation: PROKKA PRokka is an open-source sequence annotation that performs structural and functional annotation of prokaryotic sequences. Prokka is not designed to work with metagenomes, but people use it anyways. A new--metagenome option has been recently implemented to tell the Prodigal gene predictor to look harder for partial genes are common in highly fragmented assemblies. IT is advisable to remove all small contigs (