Podcast
Questions and Answers
What is one of the characteristics of big data?
What is one of the characteristics of big data?
What is the main goal of the omic approach?
What is the main goal of the omic approach?
What is the name of the researcher who longitudinally measured his own omics over 14 months?
What is the name of the researcher who longitudinally measured his own omics over 14 months?
What is the name of the project that freely released genomic data without restrictions?
What is the name of the project that freely released genomic data without restrictions?
Signup and view all the answers
What is the main purpose of data reproducibility in omic experiments?
What is the main purpose of data reproducibility in omic experiments?
Signup and view all the answers
What is the typical structure of the results of an omic experiment?
What is the typical structure of the results of an omic experiment?
Signup and view all the answers
What is the name of the open database that collects data regarding different tumour types?
What is the name of the open database that collects data regarding different tumour types?
Signup and view all the answers
What is one of the principles of the open-data agreements in the scientific community?
What is one of the principles of the open-data agreements in the scientific community?
Signup and view all the answers
What is the main challenge in dealing with omic data?
What is the main challenge in dealing with omic data?
Signup and view all the answers
What is the name of the standards for submitting data in microarray experiments?
What is the name of the standards for submitting data in microarray experiments?
Signup and view all the answers
What is the significance threshold mentioned in the text?
What is the significance threshold mentioned in the text?
Signup and view all the answers
What is the limitation of the software mentioned in the text?
What is the limitation of the software mentioned in the text?
Signup and view all the answers
Why are de novo approaches preferred in certain cases?
Why are de novo approaches preferred in certain cases?
Signup and view all the answers
What is the first step in de novo approaches?
What is the first step in de novo approaches?
Signup and view all the answers
What type of sequences can be detected using the MEGAN approach?
What type of sequences can be detected using the MEGAN approach?
Signup and view all the answers
What is the advantage of working with the transcriptome compared to the genome?
What is the advantage of working with the transcriptome compared to the genome?
Signup and view all the answers
What is the median length of reconstructed transcripts mentioned in the text?
What is the median length of reconstructed transcripts mentioned in the text?
Signup and view all the answers
Why are virus and bacteria genomes easier to reconstruct?
Why are virus and bacteria genomes easier to reconstruct?
Signup and view all the answers
What is the primary goal of digital normalization in transcriptome analysis?
What is the primary goal of digital normalization in transcriptome analysis?
Signup and view all the answers
What is the approximate error rate of Illumina sequencing per base?
What is the approximate error rate of Illumina sequencing per base?
Signup and view all the answers
What is the purpose of the median k-mer abundance method in normalization?
What is the purpose of the median k-mer abundance method in normalization?
Signup and view all the answers
What is the name of the algorithm used to assemble transcripts from NGS data?
What is the name of the algorithm used to assemble transcripts from NGS data?
Signup and view all the answers
What is the purpose of the N50 metric in assembly evaluation?
What is the purpose of the N50 metric in assembly evaluation?
Signup and view all the answers
What is the name of the tool used for reference-free quality assessment of de novo transcriptome assemblies?
What is the name of the tool used for reference-free quality assessment of de novo transcriptome assemblies?
Signup and view all the answers
What is the primary advantage of using Kraken for taxonomic classification of NGS data?
What is the primary advantage of using Kraken for taxonomic classification of NGS data?
Signup and view all the answers
What is the purpose of the de Bruijn graph in assembly?
What is the purpose of the de Bruijn graph in assembly?
Signup and view all the answers
What is the primary limitation of the Overlap-Layout-Consensus method for assembly?
What is the primary limitation of the Overlap-Layout-Consensus method for assembly?
Signup and view all the answers
What is the purpose of the inchworm phase in the Trinity algorithm?
What is the purpose of the inchworm phase in the Trinity algorithm?
Signup and view all the answers
Study Notes
Genomics and NGS Data Analysis
1. Reads Alignment and Taxon Assignment
- 0 reads are aligned using BLAST (alignment is performed against non-redundant, nucleotide or environmental databases)
- Each hit is associated with a taxon and each read is assigned to the Lowest Common Ancestor (LCA) of the set of taxa identified
2. Faster Approach for Classification
- Uses marker genes present in nearly all microbes (single copy) or specific to certain clades
- Approach was originally implemented in MetaPhlAn (used to analyze several trillion bases of metagenomic sequences)
- Cannot classify the entire gene content in a sample (requires comparing every read to a known gene)
3. Kraken Approach
- Avoids alignments and uses an exact-match database built from k-mers linked to the LCA of all organisms whose genomes contain that k-mer
- Searches each k-mer in the input read inside the pre-computed database
- Builds a tree of taxa including only the matched LCA
- Assigns a weight to each node equal to the number of k-mers found in the read that are linked to that taxon
- Scores each root-to-leaf (RTL) path in the classification tree and selects the maximal RTL path
- Assigns the read to the taxon corresponding to the leaf of the path
4. Contaminant Removal
- Most RNA-seq libraries select and enrich for mRNAs, but may contain contaminating RNA (rRNAs)
- SortMeRNA software is used to detect and remove contaminating RNA
- Uses a sliding window approach to search for short similarity regions between the reads and the chosen rRNA database
- Selects and removes reads that have enough windows (in a proportion larger than ¼) with all these values being empirical
5. Statistical Robustness
- Coverage value is normally granted by keeping it homogeneously at 100x on average
- In transcriptome analysis, each gene is characterized by different expression levels
- Normalize data using the median k-mer abundance method
- Digital normalization eliminates most of the reads, but loses some connections in the de Bruijn graph and fragments the assembly
6. Assembly Algorithms
- Overlap-Layout-Consensus (OLC) method is used to assemble the original molecule
- Sets up all possible all-against-all pairwise comparisons between the reads
- Builds a graph connecting the partially overlapping reads
- Manipulates the graph to produce a read layout and performs multiple sequence alignments to produce a consensus sequence
7. Trinity Assembly
- Solves the issues of uneven coverage, sequencing biases, and repeat sequences
- Composed of 3 modules: inchworm, chrysalis, and butterfly
- Inchworm phase reconstructs the linear contigs
- Chrysalis phase groups connected components into clusters
- Butterfly phase reconstructs full-length linear transcripts
8. Quality Evaluation
- Assembly statistics include: number of contigs, overall assembly size, median contig length, and N50
- TransRate is a reference-free quality assessment tool for de novo transcriptome assemblies
- Detects multiple common artifacts of assembly, including chimeras, structural errors, incomplete assemblies, and base errors
9. Illumina Sequencing
- Quality control and sequencing: RNA sample quality is assessed using the RNA Integrity Number (RIN) algorithm
- Fragmentation and retrotranscription of the RNA sample
- Hybridization of the cDNA fragments to the flow cell
- Amplification of the sample and addition of labelled nucleotides
- Record of the light signal that comes from the incorporation of specific labelled nucleotides### Big Data in Genomics
- Big data in genomics refers to complex, multi-layered data that:
- Is obtained by integrating data from different sources
- Cannot be easily modeled with numerical formulae due to complicated correlations
- Is dimensionally challenging, requiring significant time and storage
- Example of multi-layered data: Geographic Information System (GIS), which stores, transforms, integrates, and visualizes large amounts of data from different sources related to positions on Earth's surface
Omics Approach and Precise Medicine
- The technological revolution has enabled the study of all genes simultaneously (omics approach) and multiple omic datasets simultaneously (multi-omics approach)
- This has led to the development of precise medicine, which emphasizes the systematic use of individual patient information to select and optimize medical care
- Example: Snyderome, a research study where a researcher measured his own omics (whole genome, transcriptome, proteome, metabolome, and clinical tests) over 14 months to provide a partial overview of his health state and observe possible predispositions to pathologies
Open-Data Agreement and Data Sharing
- The open-data agreement between the scientific community and scientific journals was established due to patent issues arising from the Human Genome Project (HGP)
- The agreement's 2 main principles are:
- Data must be made freely available to the scientific community
- Data must be released within 24 hours of generation to encourage research and maximize the benefits of the HGP for society
- The principles have revolutionized the way science is done in life sciences and have become the foundation for all subsequent large international projects
Data Reproducibility and Standards
- Data reproducibility is crucial, and standards and guidelines have been established for submitting data
- Minimum Information About a Microarray Experiment (MIAME) is a set of standards for reporting microarray-based gene expression data
- Reviewers check for compliance with guidelines and ensure data is submitted to public databases before reviewing a paper
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Description
This quiz covers the process of reading alignment and taxon identification, including the use of BLAST and assignment of reads to the Lowest Common Ancestor (LCA).