Podcast
Questions and Answers
What does the term 'bioinformatics' refer to?
What does the term 'bioinformatics' refer to?
Bioinformatics is a field that involves analyzing biological data, extracting patterns, and generating hypotheses based on those patterns.
What are the primary activities involved in bioinformatics? (Choose all that apply.)
What are the primary activities involved in bioinformatics? (Choose all that apply.)
Which of the following types of data are NOT a type of data commonly used in bioinformatics? (Choose all that apply.)
Which of the following types of data are NOT a type of data commonly used in bioinformatics? (Choose all that apply.)
What is meant by 'high throughput data' in the context of Biology?
What is meant by 'high throughput data' in the context of Biology?
Signup and view all the answers
Which of these advancements are contributing factors to the rise of high throughput data in Biology? (Choose all that apply.)
Which of these advancements are contributing factors to the rise of high throughput data in Biology? (Choose all that apply.)
Signup and view all the answers
What is a typical example of a high-throughput data generation activity?
What is a typical example of a high-throughput data generation activity?
Signup and view all the answers
The sequence of letters in DNA resembles a similar concept to the sequence of letters in a written language.
The sequence of letters in DNA resembles a similar concept to the sequence of letters in a written language.
Signup and view all the answers
What is 'DNA or whole genome sequencing'?
What is 'DNA or whole genome sequencing'?
Signup and view all the answers
What is Moore's Law, and why is it relevant in the context of DNA sequencing cost reduction?
What is Moore's Law, and why is it relevant in the context of DNA sequencing cost reduction?
Signup and view all the answers
What led to a significant decrease in the cost of DNA sequencing around 2008?
What led to a significant decrease in the cost of DNA sequencing around 2008?
Signup and view all the answers
How large is the amount of data generated in genomics compared to other fields, such as astronomy, Twitter or YouTube?
How large is the amount of data generated in genomics compared to other fields, such as astronomy, Twitter or YouTube?
Signup and view all the answers
How has the time and cost associated with sequencing a human genome changed over time?
How has the time and cost associated with sequencing a human genome changed over time?
Signup and view all the answers
Genomics is the only area of biology that generates 'omics data'.
Genomics is the only area of biology that generates 'omics data'.
Signup and view all the answers
What are 'omes' and '-omics' in the context of biological research?
What are 'omes' and '-omics' in the context of biological research?
Signup and view all the answers
Which of the following is NOT an example of an 'ome'?
Which of the following is NOT an example of an 'ome'?
Signup and view all the answers
What does Sydney Brenner's quote, "Drowning in a sea of data and starving for knowledge" highlight?
What does Sydney Brenner's quote, "Drowning in a sea of data and starving for knowledge" highlight?
Signup and view all the answers
What is Sir Paul Nurse's perspective on the role of data generation in scientific research?
What is Sir Paul Nurse's perspective on the role of data generation in scientific research?
Signup and view all the answers
Bioinformatics is a highly interdisciplinary field that combines various scientific disciplines.
Bioinformatics is a highly interdisciplinary field that combines various scientific disciplines.
Signup and view all the answers
What is the key role of 'domain knowledge’ in bioinformatics?
What is the key role of 'domain knowledge’ in bioinformatics?
Signup and view all the answers
What is the difference between 'explainable methods' and 'interpretable models' in the context of AI and ML?
What is the difference between 'explainable methods' and 'interpretable models' in the context of AI and ML?
Signup and view all the answers
Study Notes
Bioinformatics
- Bioinformatics is a broad, inclusive field
- Analyze biological data to find patterns and generate hypotheses
- Activities include:
- Storage, searching, and retrieving data
- Creating thematic databases (primary and/or derived data)
- Integrating diverse data types
Learning Objectives
- Big data in biology: where does data come from?
- Data generation: is it the end or a means?
- Making sense of data
- Four illustrative examples
- Concerns
Types of Data
- Image data (static or video): examples include T2*-weighted, Diffusion-weighted, T1-weighted images
- Audio data: examples include elephant calls, plant sounds, bird and insect chirps
- Text data: a variety of formats, including spatial transcriptomics techniques and gene expression data.
- Numerical data: examples include expression data sets, possibly from TCGA genetic data.
Where are we getting data from? / How are we getting data?
- This question prompts discussion of the origins of large biological datasets.
Throughput (from Lecture 13)
- Output per unit time
- Example: 2,000 chapatis per hour
High Throughput Data in Biology
- A consequence of advancements in multiple areas
- Includes algorithms, chemistry, computer hardware, cross-disciplinary studies, genetic engineering, instrumentation, microscopy, software, and spectroscopy
Genome Sequencing
- A typical high-throughput data generation activity
Text and DNA
- Analogies between text and DNA illustrating their similar sequence-based nature.
- Data content depends on the order of letters (nucleotides).
- Emphasis on the importance of sequence in both DNA and text
DNA Double Helix Features
- The two DNA strands are antiparallel
- Has specific end points (5' and 3')
DNA or Whole Genome Sequencing
- The order of nucleotides (A, C, G, and T) in DNA.
Cost of Sequencing
- Cost of sequencing (USD per Megabase of DNA) has fallen dramatically through technological advancements (Moore's Law) since 2003.
- Moore's law describes the trend of doubling computer power roughly every two years.
- A "Next Generation Sequencing" (NGS) technology development around 2008 contributed significantly to this reduction.
How Big is Genomics Data? (2015)
-
Data phase: a comparison of data sizes across fields (Astronomy, Twitter, YouTube, Genomics) by 2015.
- Genomics data is very large.
Genome Analysis (Then vs. Now)
- Cost of sequencing is significantly lower now than in the past
- Number of laboratories performing genome studies has declined greatly
- Speed of sequencing has greatly increased.
Is Genomics the Only -Omics Generating Data?
- No. There are many other -omics.
-Omics and -omes
- There are many -omes and -omics
- The genome is static
- Other -omes are spatiotemporally dynamic and condition-dependent
Mitochondrial Variations
- Descriptions of different types of mitochondrial structures
- "Low" energy vs. "high" energy states of mitochondria.
A Few -Omics (Other Than Genome)
- Several types of data besides genomes are considered here.
- Proteome
- Microbiome
- Glycome
- Metabolome
- Epigenome
- Transcriptome
- Lipidome
Summary So Far...
- Data related to biology is generated in vast amounts
- A consequence of high-throughput -omics studies.
- Advancement in various domains has driven this.
- Data generation has become more affordable
Data Generation: End or Means?
- A question about the purpose of generating data and if it's an end goal or an intermediary process.
Does Data Mean Knowledge?
- Question about whether large amounts of data equate to true understanding.
- Quotation from Sydney Brenner (Nobel Prize winner) on the distinction between data and knowledge.
Converting Data Into Knowledge
- Observations, questions, hypotheses, experiments, and predictions are essential components of the scientific process.
- This is a reiterative process not just data collection.
Inter-connected + Inter-related -Omics
- Big data and art are linked in this slide in the visual.
Bioinformatics--Making Sense of Data
- Understanding large datasets requires specific techniques.
Bioinformatics--A Multi-disciplinary Field
- Bioinformatics is built on concepts from biology, statistics, and computer science.
Statistics and Machine Learning Algorithms
- Traditional methods may assume normal data distribution
- New methods need to be developed to handle diverse biological data.
Data Science Analysis Stack
- A layered approach to analyzing biological data, highlighting various stages and components.
Where Does Domain Knowledge Come In?
- Shows a layered approach highlighting the role of different kinds of knowledge throughout the data analysis process.
Biologists' Viewpoint (the Scientific Method)
- The scientific method is shown as a cyclical process
- Beginning with observation and knowledge to generate questions, form hypotheses, and making predictions; cyclical process.
Summary So Far (Another summary)
- High-throughput -omics produces large datasets
- Making sense requires domain knowledge, advanced statistical methods (AI and ML), and computer science advances (storage, search, etc)
Illustrative Examples: 1 of 4
- Use the image data and linear support vector machine (SVM) as an illustration
A Brain Disorder
- Normal Pressure Hydrocephalus (NPH): a brain condition with symptoms.
- Accumulation of cerebrospinal fluid (CSF) in the brain cavities, but without high pressure
Problem Statement
- 3D MRI images may provide early signs of NPH before clinical symptoms arise.
- Early detection allows better treatment and management.
- Machine learning can aid in this process by assisting diagnosis, helping manage patients.
3D MRI Images
- Description of 3D MRI images of the brain for NPH and related data interpretation
Input Data
- Input data is 3D MRI images; positive (affected individuals) and negative (not affected individuals)
Data Analysis
- Data is used to train a machine learning (ML) algorithm
- This algorithm can classify new, unseen images
- Specific training requirements and skill sets are necessary to accomplish this
Outcome
- The performance of the ML algorithm is compared to senior and junior medical doctors.
- The performance appears comparable.
Illustrative Example 2 of 4
- Type of data: Text
- Algorithm: Profile hidden Markov models (HMMs)
Globin Genes are Evolutionary Related
- Myoglobin, hemoglobin, and their related forms of hemoglobin are discussed in terms of their evolution.
Problem Statement
- Problem statement regarding protein biochemistry and the implications of different amino acid sequences.
- Specifically focusing on myoglobin, hemoglobin alpha, beta, and gamma subunits.
Approach: Gather Data
- Collect sequences of globins from diverse species (myoglobin and different hemoglobins)
Sequence Comparisons
- Comparisons between sequences to identify conserved elements and differences
Outcome
- Focus on the positions in the sequence that are universal to variations of these proteins.
- Discussion of why myoglobin is monomeric, and why other forms are in oligomer forms.
- Explanation of the differences in oxygen saturation curves.
Illustrative Example 3 of 4
- Type of data: Text
- End goal is hypothesis generation
- Algorithm: Large language model( LLMs)
Background Leading to the Objective
- The objective is to find the function of an orphan protein
- DNA sequencing is the basis for the exploration.
One of the Ways to Predict Function is...
- Illustrates an approach to find function by identifying patterns associated with similar biological proteins.
Gathering Data for “Learning”
- Gathering data using biological literature and databases
- Identifying patterns to assign function.
Problem Statement (another problem statement)
- The task is finding the function of some unknown or uncharacterized proteins.
- The information required is taken from research and medical literature.
Solution
- This problem is solved by training a large language model to read and summarize countless research documents; then assign function or identify patterns
- There is a significant reduction in the time needed to accomplish this compared to a human's workload.
Illustrative Example 4 of 4
- Type of data: numeric + text
- End goal: societal benefit
- Algorithm: statistical tests of significance
Background Leading to the Objective (another background)
- Problem of diagnosing a pelvic mass
- Diagnosis depends on correctly identifying cancerous vs. non-cancerous occurrences.
- Identifying experts in the field is difficult.
Question
- Need a test to identify a cancerous pelvic mass
- High sensitivity (missing no cancerous occurrences/instances)
- High specificity (correctly identify/diagnose only cancerous instances)
- Proteomics is the approach.
Solution
- Development of a multivariate index based on biomarkers using proteomics studies.
- The use of serum levels of five proteins as biomarkers.
Concerns
- Issues with the reproducibility of scientific studies.
- Inconsistencies and insufficient details across published research can create difficulties in replicating and validating results.
Output: Trust Blindly or With Caution?
- Output quality depends on the input data
- Quality control is essential to ensure reliable results
- Input that is not well curated may yield unreliable results.
AI & ML Algorithms Are Quite Powerful
- Describes an algorithm metaphorically as a black box
- The internal workings are often complicated and not readily transparent
Ask Questions About Output
- Importance of critical examination of prediction results.
A Tethered Cow
- An analogy used to illustrate structure of a protein
- Different structural domains are analogous to parts of the cow and/or the tether
Architecture of a Protein
- Showing the components of a protein: the flexible tether, anchoring domain, and catalytic domain
- Identifying patterns within protein structure/sequences
Detecting "Patterns" in Related Proteins
- Finding patterns in the protein sequence can relate to function.
AI & ML: Explain and Interpret Results
- Methods to explain or interpret the outputs from AI/ML are needed
- A crucial step for understanding the reasoning/basis of predictions made by these algorithms
Gut Feeling/Intuition of Algorithms
- The significance of intuition or gut feeling in making decisions
- There is a value and importance to this type of thought process also.
Errors in Databases and Propagation
- Errors in biological databases are a concern
- Care should be taken when using data from these sources
- Data consistency and reliability are critical.
Clues to Questions at the Level of...
- Diagram showing the different levels of biological organization (from ecosystem to molecules).
Water, Water, Everywhere...
- Title
- No other relevant information provided.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.