Reading Alignment and Taxon Identification

HaleBarium avatar
HaleBarium
·
·
Download

Start Quiz

Study Flashcards

Questions and Answers

What is one of the characteristics of big data?

It is multi-layered

What is the main goal of the omic approach?

To study all genes simultaneously

What is the name of the researcher who longitudinally measured his own omics over 14 months?

Michael Snyder

What is the name of the project that freely released genomic data without restrictions?

<p>Human Genome Project</p> Signup and view all the answers

What is the main purpose of data reproducibility in omic experiments?

<p>To define standards and guidelines for submitting data</p> Signup and view all the answers

What is the typical structure of the results of an omic experiment?

<p>A matrix with more rows than columns</p> Signup and view all the answers

What is the name of the open database that collects data regarding different tumour types?

<p>TCGA</p> Signup and view all the answers

What is one of the principles of the open-data agreements in the scientific community?

<p>Data must be made freely available to the scientific community</p> Signup and view all the answers

What is the main challenge in dealing with omic data?

<p>The variables are more than the sample size</p> Signup and view all the answers

What is the name of the standards for submitting data in microarray experiments?

<p>MIAME</p> Signup and view all the answers

What is the significance threshold mentioned in the text?

<p>0.05</p> Signup and view all the answers

What is the limitation of the software mentioned in the text?

<p>It is limited by the ability to reconstruct the genome of the organism</p> Signup and view all the answers

Why are de novo approaches preferred in certain cases?

<p>Because the genome sequence is not available</p> Signup and view all the answers

What is the first step in de novo approaches?

<p>Filtration from contaminant sequences</p> Signup and view all the answers

What type of sequences can be detected using the MEGAN approach?

<p>Off-target species sequences</p> Signup and view all the answers

What is the advantage of working with the transcriptome compared to the genome?

<p>It covers mainly protein-coding regions</p> Signup and view all the answers

What is the median length of reconstructed transcripts mentioned in the text?

<p>60%</p> Signup and view all the answers

Why are virus and bacteria genomes easier to reconstruct?

<p>Because they have smaller genomes</p> Signup and view all the answers

What is the primary goal of digital normalization in transcriptome analysis?

<p>To reduce the number of reads from deeply sequenced molecules</p> Signup and view all the answers

What is the approximate error rate of Illumina sequencing per base?

<p>1-2%</p> Signup and view all the answers

What is the purpose of the median k-mer abundance method in normalization?

<p>To estimate the coverage of a particular read</p> Signup and view all the answers

What is the name of the algorithm used to assemble transcripts from NGS data?

<p>Trinity</p> Signup and view all the answers

What is the purpose of the N50 metric in assembly evaluation?

<p>To measure the continuity of the contigs</p> Signup and view all the answers

What is the name of the tool used for reference-free quality assessment of de novo transcriptome assemblies?

<p>TransRate</p> Signup and view all the answers

What is the primary advantage of using Kraken for taxonomic classification of NGS data?

<p>It is faster than other classification methods</p> Signup and view all the answers

What is the purpose of the de Bruijn graph in assembly?

<p>To find a Eulerian path in the graph</p> Signup and view all the answers

What is the primary limitation of the Overlap-Layout-Consensus method for assembly?

<p>It is computationally intensive</p> Signup and view all the answers

What is the purpose of the inchworm phase in the Trinity algorithm?

<p>To reconstruct linear contigs from k-mers</p> Signup and view all the answers

Study Notes

Genomics and NGS Data Analysis

1. Reads Alignment and Taxon Assignment

  • 0 reads are aligned using BLAST (alignment is performed against non-redundant, nucleotide or environmental databases)
  • Each hit is associated with a taxon and each read is assigned to the Lowest Common Ancestor (LCA) of the set of taxa identified

2. Faster Approach for Classification

  • Uses marker genes present in nearly all microbes (single copy) or specific to certain clades
  • Approach was originally implemented in MetaPhlAn (used to analyze several trillion bases of metagenomic sequences)
  • Cannot classify the entire gene content in a sample (requires comparing every read to a known gene)

3. Kraken Approach

  • Avoids alignments and uses an exact-match database built from k-mers linked to the LCA of all organisms whose genomes contain that k-mer
  • Searches each k-mer in the input read inside the pre-computed database
  • Builds a tree of taxa including only the matched LCA
  • Assigns a weight to each node equal to the number of k-mers found in the read that are linked to that taxon
  • Scores each root-to-leaf (RTL) path in the classification tree and selects the maximal RTL path
  • Assigns the read to the taxon corresponding to the leaf of the path

4. Contaminant Removal

  • Most RNA-seq libraries select and enrich for mRNAs, but may contain contaminating RNA (rRNAs)
  • SortMeRNA software is used to detect and remove contaminating RNA
  • Uses a sliding window approach to search for short similarity regions between the reads and the chosen rRNA database
  • Selects and removes reads that have enough windows (in a proportion larger than ¼) with all these values being empirical

5. Statistical Robustness

  • Coverage value is normally granted by keeping it homogeneously at 100x on average
  • In transcriptome analysis, each gene is characterized by different expression levels
  • Normalize data using the median k-mer abundance method
  • Digital normalization eliminates most of the reads, but loses some connections in the de Bruijn graph and fragments the assembly

6. Assembly Algorithms

  • Overlap-Layout-Consensus (OLC) method is used to assemble the original molecule
  • Sets up all possible all-against-all pairwise comparisons between the reads
  • Builds a graph connecting the partially overlapping reads
  • Manipulates the graph to produce a read layout and performs multiple sequence alignments to produce a consensus sequence

7. Trinity Assembly

  • Solves the issues of uneven coverage, sequencing biases, and repeat sequences
  • Composed of 3 modules: inchworm, chrysalis, and butterfly
  • Inchworm phase reconstructs the linear contigs
  • Chrysalis phase groups connected components into clusters
  • Butterfly phase reconstructs full-length linear transcripts

8. Quality Evaluation

  • Assembly statistics include: number of contigs, overall assembly size, median contig length, and N50
  • TransRate is a reference-free quality assessment tool for de novo transcriptome assemblies
  • Detects multiple common artifacts of assembly, including chimeras, structural errors, incomplete assemblies, and base errors

9. Illumina Sequencing

  • Quality control and sequencing: RNA sample quality is assessed using the RNA Integrity Number (RIN) algorithm
  • Fragmentation and retrotranscription of the RNA sample
  • Hybridization of the cDNA fragments to the flow cell
  • Amplification of the sample and addition of labelled nucleotides
  • Record of the light signal that comes from the incorporation of specific labelled nucleotides### Big Data in Genomics
  • Big data in genomics refers to complex, multi-layered data that:
    • Is obtained by integrating data from different sources
    • Cannot be easily modeled with numerical formulae due to complicated correlations
    • Is dimensionally challenging, requiring significant time and storage
  • Example of multi-layered data: Geographic Information System (GIS), which stores, transforms, integrates, and visualizes large amounts of data from different sources related to positions on Earth's surface

Omics Approach and Precise Medicine

  • The technological revolution has enabled the study of all genes simultaneously (omics approach) and multiple omic datasets simultaneously (multi-omics approach)
  • This has led to the development of precise medicine, which emphasizes the systematic use of individual patient information to select and optimize medical care
  • Example: Snyderome, a research study where a researcher measured his own omics (whole genome, transcriptome, proteome, metabolome, and clinical tests) over 14 months to provide a partial overview of his health state and observe possible predispositions to pathologies

Open-Data Agreement and Data Sharing

  • The open-data agreement between the scientific community and scientific journals was established due to patent issues arising from the Human Genome Project (HGP)
  • The agreement's 2 main principles are:
    • Data must be made freely available to the scientific community
    • Data must be released within 24 hours of generation to encourage research and maximize the benefits of the HGP for society
  • The principles have revolutionized the way science is done in life sciences and have become the foundation for all subsequent large international projects

Data Reproducibility and Standards

  • Data reproducibility is crucial, and standards and guidelines have been established for submitting data
  • Minimum Information About a Microarray Experiment (MIAME) is a set of standards for reporting microarray-based gene expression data
  • Reviewers check for compliance with guidelines and ensure data is submitted to public databases before reviewing a paper

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team
Use Quizgecko on...
Browser
Browser