Reading Alignment and Taxon Identification

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

What is one of the characteristics of big data?

It is dimensionally simple
It is only obtained from a single source
It can be easily modelled with numerical formulae
It is multi-layered (correct)

What is the main goal of the omic approach?

To study all genes simultaneously (correct)
To study only the genome
To study only the transcriptome
To study a single gene at a time

What is the name of the researcher who longitudinally measured his own omics over 14 months?

Michael Snyder (correct)
Sulston
Waterson
None of the above

What is the name of the project that freely released genomic data without restrictions?

Human Genome Project (A) Signup and view all the answers

What is the main purpose of data reproducibility in omic experiments?

To define standards and guidelines for submitting data (A) Signup and view all the answers

What is the typical structure of the results of an omic experiment?

A matrix with more rows than columns (D) Signup and view all the answers

What is the name of the open database that collects data regarding different tumour types?

TCGA (B) Signup and view all the answers

What is one of the principles of the open-data agreements in the scientific community?

Data must be made freely available to the scientific community (C) Signup and view all the answers

What is the main challenge in dealing with omic data?

The variables are more than the sample size (A) Signup and view all the answers

What is the name of the standards for submitting data in microarray experiments?

MIAME (B) Signup and view all the answers

What is the significance threshold mentioned in the text?

0.05 (C) Signup and view all the answers

What is the limitation of the software mentioned in the text?

It is limited by the ability to reconstruct the genome of the organism (B) Signup and view all the answers

Why are de novo approaches preferred in certain cases?

Because the genome sequence is not available (D) Signup and view all the answers

What is the first step in de novo approaches?

Filtration from contaminant sequences (C) Signup and view all the answers

What type of sequences can be detected using the MEGAN approach?

Off-target species sequences (A) Signup and view all the answers

What is the advantage of working with the transcriptome compared to the genome?

It covers mainly protein-coding regions (B) Signup and view all the answers

What is the median length of reconstructed transcripts mentioned in the text?

60% (C) Signup and view all the answers

Why are virus and bacteria genomes easier to reconstruct?

Because they have smaller genomes (D) Signup and view all the answers

What is the primary goal of digital normalization in transcriptome analysis?

To reduce the number of reads from deeply sequenced molecules (D) Signup and view all the answers

What is the approximate error rate of Illumina sequencing per base?

1-2% (B) Signup and view all the answers

What is the purpose of the median k-mer abundance method in normalization?

To estimate the coverage of a particular read (D) Signup and view all the answers

What is the name of the algorithm used to assemble transcripts from NGS data?

Trinity (C) Signup and view all the answers

What is the purpose of the N50 metric in assembly evaluation?

To measure the continuity of the contigs (D) Signup and view all the answers

What is the name of the tool used for reference-free quality assessment of de novo transcriptome assemblies?

TransRate (C) Signup and view all the answers

What is the primary advantage of using Kraken for taxonomic classification of NGS data?

It is faster than other classification methods (A) Signup and view all the answers

What is the purpose of the de Bruijn graph in assembly?

To find a Eulerian path in the graph (B) Signup and view all the answers

What is the primary limitation of the Overlap-Layout-Consensus method for assembly?

It is computationally intensive (C) Signup and view all the answers

What is the purpose of the inchworm phase in the Trinity algorithm?

To reconstruct linear contigs from k-mers (D) Signup and view all the answers

Flashcards are hidden until you start studying

Study Notes

Genomics and NGS Data Analysis

1. Reads Alignment and Taxon Assignment

0 reads are aligned using BLAST (alignment is performed against non-redundant, nucleotide or environmental databases)
Each hit is associated with a taxon and each read is assigned to the Lowest Common Ancestor (LCA) of the set of taxa identified

2. Faster Approach for Classification

Uses marker genes present in nearly all microbes (single copy) or specific to certain clades
Approach was originally implemented in MetaPhlAn (used to analyze several trillion bases of metagenomic sequences)
Cannot classify the entire gene content in a sample (requires comparing every read to a known gene)

3. Kraken Approach

Avoids alignments and uses an exact-match database built from k-mers linked to the LCA of all organisms whose genomes contain that k-mer
Searches each k-mer in the input read inside the pre-computed database
Builds a tree of taxa including only the matched LCA
Assigns a weight to each node equal to the number of k-mers found in the read that are linked to that taxon
Scores each root-to-leaf (RTL) path in the classification tree and selects the maximal RTL path
Assigns the read to the taxon corresponding to the leaf of the path

4. Contaminant Removal

Most RNA-seq libraries select and enrich for mRNAs, but may contain contaminating RNA (rRNAs)
SortMeRNA software is used to detect and remove contaminating RNA
Uses a sliding window approach to search for short similarity regions between the reads and the chosen rRNA database
Selects and removes reads that have enough windows (in a proportion larger than ¼) with all these values being empirical

5. Statistical Robustness

Coverage value is normally granted by keeping it homogeneously at 100x on average
In transcriptome analysis, each gene is characterized by different expression levels
Normalize data using the median k-mer abundance method
Digital normalization eliminates most of the reads, but loses some connections in the de Bruijn graph and fragments the assembly

6. Assembly Algorithms

Overlap-Layout-Consensus (OLC) method is used to assemble the original molecule
Sets up all possible all-against-all pairwise comparisons between the reads
Builds a graph connecting the partially overlapping reads
Manipulates the graph to produce a read layout and performs multiple sequence alignments to produce a consensus sequence

7. Trinity Assembly

Solves the issues of uneven coverage, sequencing biases, and repeat sequences
Composed of 3 modules: inchworm, chrysalis, and butterfly
Inchworm phase reconstructs the linear contigs
Chrysalis phase groups connected components into clusters
Butterfly phase reconstructs full-length linear transcripts

8. Quality Evaluation

Assembly statistics include: number of contigs, overall assembly size, median contig length, and N50
TransRate is a reference-free quality assessment tool for de novo transcriptome assemblies
Detects multiple common artifacts of assembly, including chimeras, structural errors, incomplete assemblies, and base errors

9. Illumina Sequencing

Quality control and sequencing: RNA sample quality is assessed using the RNA Integrity Number (RIN) algorithm
Fragmentation and retrotranscription of the RNA sample
Hybridization of the cDNA fragments to the flow cell
Amplification of the sample and addition of labelled nucleotides
Record of the light signal that comes from the incorporation of specific labelled nucleotides### Big Data in Genomics
Big data in genomics refers to complex, multi-layered data that:
- Is obtained by integrating data from different sources
- Cannot be easily modeled with numerical formulae due to complicated correlations
- Is dimensionally challenging, requiring significant time and storage
Example of multi-layered data: Geographic Information System (GIS), which stores, transforms, integrates, and visualizes large amounts of data from different sources related to positions on Earth's surface

Omics Approach and Precise Medicine

The technological revolution has enabled the study of all genes simultaneously (omics approach) and multiple omic datasets simultaneously (multi-omics approach)
This has led to the development of precise medicine, which emphasizes the systematic use of individual patient information to select and optimize medical care
Example: Snyderome, a research study where a researcher measured his own omics (whole genome, transcriptome, proteome, metabolome, and clinical tests) over 14 months to provide a partial overview of his health state and observe possible predispositions to pathologies

The open-data agreement between the scientific community and scientific journals was established due to patent issues arising from the Human Genome Project (HGP)
The agreement's 2 main principles are:
- Data must be made freely available to the scientific community
- Data must be released within 24 hours of generation to encourage research and maximize the benefits of the HGP for society
The principles have revolutionized the way science is done in life sciences and have become the foundation for all subsequent large international projects

Data Reproducibility and Standards

Data reproducibility is crucial, and standards and guidelines have been established for submitting data
Minimum Information About a Microarray Experiment (MIAME) is a set of standards for reporting microarray-based gene expression data
Reviewers check for compliance with guidelines and ensure data is submitted to public databases before reviewing a paper

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.