Probability Models in Biological Data Analysis

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What differentiates synonymous variations from non-synonymous variations?

Synonymous variations do not alter the protein's amino acid sequence (correct)
Synonymous variations always result in gene deletions
Non-synonymous variations occur only in non-coding regions
Non-synonymous variations do not change the amino acid sequence

Which of the following best describes the purpose of the Variant Call Format (VCF)?

To summarize genomic variations across multiple samples (correct)
To provide a graphical user interface for genetics software
To encode the structure of genes and proteins
To list the regulatory networks in different organisms

Which type of genomic variation includes the exchange of DNA segments between non-homologous chromosomes?

Polymorphisms
Large duplications
Translocations (correct)
Insertions

What type of elements are included in genes but are not part of the coding region?

Introns (D) Signup and view all the answers

What does the term 'rank' refer to in the context of matrices?

The number of independent columns in a matrix (B) Signup and view all the answers

Which equation represents a system solvable through matrices?

X1 + 3X2 = 8 and X1 + 2X2 = 6 (D) Signup and view all the answers

What does 'm by n' represent in a matrix?

The rows and columns in the matrix respectively (D) Signup and view all the answers

How can matrices be used in biology?

To represent gene regulatory networks (D) Signup and view all the answers

What constitutes a column being considered independent in a matrix?

It cannot be formed by a combination of other columns (C) Signup and view all the answers

Which of the following statements about matrices is incorrect?

Matrices are only relevant in mathematics. (B) Signup and view all the answers

In the context of systems of equations, what does the matrix of coefficients represent?

The relationships between variables (A) Signup and view all the answers

Which of the following is a common application of matrices in technology?

Modeling infectious disease transmission (C) Signup and view all the answers

Which genomic variation involves a change in a single nucleotide that may or may not affect the protein coding sequence?

Single Nucleotide Variations (D) Signup and view all the answers

What is a primary characteristic of the Variant Call Format (VCF) file structure?

It includes a header region with metadata about the samples. (A) Signup and view all the answers

What type of analysis can be conducted to investigate the relationship between genetic variations and disease susceptibility?

Genome Wide Association Studies (GWAS) (B) Signup and view all the answers

Which term refers to genetic variations that do not result in a change in the amino acid sequence of a protein?

Synonymous variations (B) Signup and view all the answers

Which of the following components is NOT usually included in the sample-specific information within a VCF file?

Total gene expression (A) Signup and view all the answers

What is the main reason matrices are used in relation to equations?

They represent systems of equations. (D) Signup and view all the answers

Which statement about the rank of a matrix is true?

Rank is the count of independent columns in a matrix. (A) Signup and view all the answers

In the context of matrix representation, what can the row of '1, 0, 0, 1, 0' indicate?

It can represent connections in a graph. (A) Signup and view all the answers

What condition must be satisfied for a column to be considered independent in a matrix?

It cannot be the zero vector or a combination of other columns. (A) Signup and view all the answers

How is the term 'm by n' used in relation to matrices?

It denotes the number of rows and columns in a matrix. (A) Signup and view all the answers

Which application of matrices is specifically mentioned in the context of biology?

Gene regulatory networks. (D) Signup and view all the answers

What is a common feature of independent columns in a matrix?

They cannot be obtained from linear combinations of other columns. (D) Signup and view all the answers

What is the purpose of using matrices in graphic representations?

To visualize equations and computations. (D) Signup and view all the answers

What are the components of Quantitative PCR?

Determine Double Delta-CT (A), Reverse Transcription (B), Amplify (C), Capture fluorescence (D) Signup and view all the answers

What does qPCR stand for?

Quantitative Polymerase Chain Reaction Signup and view all the answers

Microarrays can be used for total RNA analysis.

True (A) Signup and view all the answers

What is the purpose of capturing fluorescence in qPCR?

To measure the amount of DNA during amplification Signup and view all the answers

What is one of the reasons for conducting transcriptomic experiments?

General gene discovery (A) Signup and view all the answers

What does the term Delta-CT refer to in qPCR?

The difference between the target gene's CT value and the housekeeping gene's CT value Signup and view all the answers

What does qPCR stand for?

Quantitative Polymerase Chain Reaction (B) Signup and view all the answers

What are the main steps in Quantitative PCR?

Reverse Transcription, Amplify, Capture fluorescence, Repeat steps for 29 more rounds, Determine Double Delta-CT Signup and view all the answers

The calculated Fold Change in qPCR is given by the formula 2 raised to the power of ΔΔCT.

2-ΔΔCT Signup and view all the answers

What is the primary purpose of microarrays?

To detect changes in gene expression (A) Signup and view all the answers

List some reasons for conducting transcriptomic experiments.

General gene discovery, compare experimental treatments, drug treatments, knockout/knockin studies, toxicology, developmental stages Signup and view all the answers

Next generation sequencing only uses Total RNA.

False (B) Signup and view all the answers

What does Delta-CT represent in qPCR?

The difference between Gene of Interest Experimental and House Keeping Gene Experimental (A), The difference between House Keeping Gene Control and Gene of Interest Control (C) Signup and view all the answers

PCR stands for ______.

Polymerase Chain Reaction Signup and view all the answers

What is the formula for the sample mean?

m = µ = (x1 + x2 + ... + xN) / N Signup and view all the answers

What is the formula for sample variance?

S² = [(x1 - m)² + ... + (xN - m)²] / (N - 1) Signup and view all the answers

What is the expected value formula?

m = E(x) = p1 * x1 + p2 * x2 + ... + pN * xN Signup and view all the answers

What does independence in probability indicate?

P(EF) = P(E) * P(F) Signup and view all the answers

Which of the following distributions is used to describe rare events in a large population?

Poisson (D) Signup and view all the answers

What type of probability distribution is used for modeling random events occurring over time?

Exponential (C) Signup and view all the answers

What is the t-test primarily used for?

Testing differences between means of two groups (C) Signup and view all the answers

What is the Bonferroni correction in multiple testing?

α/m Signup and view all the answers

What is the use of the Central Limit Theorem?

It states that as N goes to infinity, the sample mean will be approximately normally distributed. Signup and view all the answers

The odds ratio quantifies the strength of the association between two events in terms of the odds of exposure in _____ compared to controls.

cases Signup and view all the answers

What are the parts of a network?

All of the above (D) Signup and view all the answers

An unweighted graph has edges with non-negative weights.

False (B) Signup and view all the answers

What is the degree of a node?

The number of edges connected to it. Signup and view all the answers

What differentiates a multiplex network from a multilayer network?

Edges are different in subnetworks (B) Signup and view all the answers

A path is defined as a walk that does not __________ itself.

intersect Signup and view all the answers

Name one method for determining gene regulatory networks (GRNs).

Transcriptomic studies. Signup and view all the answers

Phylogenetic trees are used to represent evolutionary relationships.

True (A) Signup and view all the answers

What are 'walks' in a network?

A sequence of nodes connected by edges (B) Signup and view all the answers

Which of the following are factors considered in feature selection?

Correlations? (A), Are things meaningful? (B), What samples to use? (C) Signup and view all the answers

Classification categorizes data based on shared characteristics.

True (A) Signup and view all the answers

What is the primary goal of prediction in machine learning?

To determine the outcome of a future event. Signup and view all the answers

Random Forest uses a technique called ______ to construct decision trees.

ensemble method Signup and view all the answers

What does the K in K Nearest Neighbors represent?

The number of closest matches considered. Signup and view all the answers

What types of data can K Nearest Neighbors work with?

Binary data (A), Continuous data (C) Signup and view all the answers

Match the following algorithms with their primary function:

K Nearest Neighbors = Supervised classification based on nearest matches Random Forest = Ensemble method using multiple decision trees K-Means Clustering = Unsupervised classification into clusters Decision Tree = Data categorization based on feature splits Signup and view all the answers

K-Means Clustering is a type of supervised classification.

False (B) Signup and view all the answers

What do dimensions of matrices affect when multiplying them?

The ability to multiply them. Signup and view all the answers

Which of the following are parts of a gene? (Select all that apply)

Promoters (A), Regulatory elements (B), Introns (C), Exons (D) Signup and view all the answers

What are the primary tools used to determine parts of a gene?

Next-generation sequencing or SNP-arrays. Signup and view all the answers

What is the difference between PCR and qPCR?

qPCR measures the quantity of DNA in real-time, whereas PCR amplifies DNA without measuring its quantity during the process. Signup and view all the answers

What is sample space?

The set of all possible outcomes in a probability experiment. Signup and view all the answers

What is a statistic that quantifies the strength of the association between two events?

Correlation coefficient. Signup and view all the answers

Independence follows the rule: P(EF) = P(E)P(F).

True (A) Signup and view all the answers

What is the difference between a directed and undirected graph?

A directed graph has edges with a direction, while an undirected graph has edges without direction. Signup and view all the answers

What does a node's degree tell you?

It tells you the number of connections (edges) a node has. Signup and view all the answers

What is a classification in machine learning?

A process of identifying the category to which new data points belong. Signup and view all the answers

What is the difference between supervised and unsupervised learning?

Supervised learning uses labeled data while unsupervised learning uses unlabelled data. Signup and view all the answers

What are the three different times in which you can select features?

Before collecting data, during data preparation, and after model evaluation. Signup and view all the answers

During which specific situations can features be selected?

At the start of the experiment, during data review, and after experiment analysis (A) Signup and view all the answers

Which choice accurately describes a phase in which features are selected?

Feature selection can be done at various stages including pre-analysis and post-analysis. (D) Signup and view all the answers

What is one key aspect of the timing in feature selection?

Can be revisited at different points of the research process. (D) Signup and view all the answers

Which scenario might not involve feature selection timing?

Determining features solely based on literature review prior to the project. (A) Signup and view all the answers

When is it ideal to reassess feature selection?

At any transitional point in the study when new information arises. (B) Signup and view all the answers

Flashcards are hidden until you start studying

Study Notes

Course Information

This course is cross-listed as BINF5354, STAT5354, and STAT6354.
Lectures are held on Tuesdays from 9:00 AM to 10:20 AM.
Labs are held on Thursdays from 9:00 AM to 10:20 AM.
Dr. Jon Mohl is the instructor and his office hours are Tuesdays from 2:00 PM to 3:00 PM in CCSB 2.0306.
The teaching assistant's office hours are Thursdays from 10:30 AM to 11:30 AM in BE302.
Announcements and assignments will be posted on Blackboard.
The primary communication methods are email or through Teams.

Additional Resources

Recommended books include "Linear Algebra and Learning from Data" by Gilbert Strang, "Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids" by R.Durbin, S.Eddy, A.Krogh, and G.Mitchison, and "Introduction to Probability Models" by Sheldon M. Ross.
Students are encouraged to read articles related to the course topics.

Grading

The course grade is determined by homework assignments (50%), a midterm exam (20%), and a group project (30%).
The group project consists of a proposal, a final presentation, and a final report.

Attendance & Civility

Attendance is mandatory.
Students are expected to be civil, keep their phones on silent, and pay attention during lectures and labs.

Course Objectives

Students will learn practical skills including programming in R and Python, and working on Linux machines.
Students will gain theoretical knowledge in linear algebra, probability/statistics, machine learning, and bioinformatics.

Projects

Projects will utilize genomics data from VCF files.

Matrices

A matrix is a rectangular array of numbers, symbols, or expressions, arranged in rows and columns
Matrices are used to represent systems of equations, graphs, and other mathematical objects.
The number of rows and columns in a matrix is referred to as its dimensions, for example, a matrix with m rows and n columns is called an m x n matrix.
The rank of a matrix is the number of linearly independent columns.

Uses of Matrices in Biology

Matrices are used to model gene regulatory networks, metabolic networks, infectious disease models, and survival analysis

Genomics

Genomics is the study of genomes, which are the complete sets of genetic material in an organism.
Genomes are filled with DNA, including regulatory factors and repetitive elements.
Variations within a genome include large deletions, large duplications, translocations, and single nucleotide variations (SNVs).
SNVs include insertions (adding a nucleotide), deletions (removing a nucleotide), and polymorphisms (variations in DNA sequence within a population).

Variant Call Format (VCF)

VCF is a file format for storing sequence information, including DNA variants.
It is a tab-delimited file with a header region containing version information, chromosome/scaffold/contig (sequence of DNA within a chromosome) details, and information about the content of the file itself.
The VCF file can contain information on one or more samples.
It contains general position information like chromosome, position, reference allele, and alternate allele(s), and also information about the gene.
Sample-specific information included in VCF files includes allele calls, site depth, the number of counts per allele, and quality.

Group Project

Students are divided into 3 groups.
Each group will form a testable hypothesis about population analysis or disease.
The groups need to compile genetic information and determine the software necessary to analyze the data.
Students will run the analysis on “Biolinux machines” (computer systems specifically designed for biological research) and prepare a preliminary presentation outlining the results.
They will complete the project by presenting their final findings in a written report and final presentation.

Matrices

Matrices represent systems of linear equations, graphs, and biological networks.
A matrix is a rectangular array of numbers.
The dimensions of a matrix are defined by the number of rows and columns.
The rank of a matrix is the number of linearly independent columns.
Matrices are used to represent systems of equations, graphs, gene regulatory networks, metabolic networks, infectious disease models, and survival analysis.

Genomics

Genomics is the study of genomes, including the sequencing and analysis of genetic information.
Genomes contain a variety of elements, including genes, regulatory factors, repetitive elements, exons, introns, and untranscribed elements.
Variations within the genome include large deletions, large duplications, translocations, and single-nucleotide variations (SNVs).
SNVs can be insertions, deletions, or polymorphisms.

Applications of Genomics

Genomic information can be used for population and ancestry structure studies, genome-wide association studies (GWAS), and determining the effects of variants on gene function.
SNVs can be synonymous or non-synonymous, meaning they may or may not alter the amino acid sequence of a protein.
Variants can also disrupt regulatory elements affecting gene expression.

Variant Call Format (VCF)

VCF is a file format used to store sequence information, including variant calls.
VCF files are tab-delimited and contain a header region providing information about the file format, chromosome/scaffold/contig details, and content information.
VCF files can contain data for one or more samples and include general position information as well as sample-specific information.
VCF files can contain all sites or only those with variants.

Group Project

Students will work in groups to develop and test a hypothesis related to population analysis or a specific disease.
Each group will collect genetic information and determine the necessary software for their analysis.
The groups will present preliminary findings and reports, run their analyses on Biolinux machines, and provide final write-ups and presentations.

Transcriptomics

Transcriptomics studies the complete set of RNA transcripts produced by an organism.
Useful for understanding gene expression patterns and identifying genes involved in specific processes.
Common techniques include real-time PCR (qPCR), microarray analysis, and next-generation sequencing (NGS).

Real-time PCR (qPCR)

A technique used for quantifying specific DNA or RNA molecules.
Uses reverse transcription to convert RNA to cDNA followed by amplification with specific primers.
Fluorescent probes detect amplified DNA in real-time, allowing for quantification of target molecule.

Quantitative PCR

Uses a cycle threshold (CT) value to measure the amount of target molecule present in a sample.
Lower CT values indicate higher target molecule abundance.
Calculation of ΔΔCT is used to determine fold change in gene expression between experimental and control groups

Microarrays

A technology that allows for simultaneous analysis of thousands of genes.
Short DNA sequences (probes) specific to individual genes are attached to a solid surface.
Fluorescently labeled cDNA from samples is hybridized to the array.
The amount of fluorescence bound to each probe reflects the abundance of the corresponding mRNA in the sample.

Next-Generation Sequencing (NGS)

High-throughput sequencing technology that can sequence millions or billions of DNA fragments simultaneously.
Total RNA can be sequenced using NGS to obtain a snapshot of the transcriptome.
Poly-A capture is used to isolate mRNA, and specific probe sequences can be used to enrich for target transcripts.

Why do transcriptomic experiments?

Gene discovery: Identify new genes involved in specific processes or cellular functions.
Compare experimental treatments: Determine the effect of treatments (drugs, knockouts/knockins, toxins) on gene expression.
Toxicology: Assess the impact of toxins on gene expression and cellular pathways.
Developmental stages: Study changes in gene expression during development.

Transcriptomics

Transcriptomics is the study of the complete set of RNA transcripts in a cell or organism
It involves studying the structure, function, and regulation of RNA molecules
Three common techniques for transcriptomic analysis: real-time PCR (qPCR), Microarrays, and Next Generation Sequencing

Real-Time PCR (qPCR)

Amplifies DNA by using primers to specifically target DNA segments
Is used for detection, cloning, and sequencing of specific regions of DNA
Quantification of PCR products is achieved by measuring fluorescence during the reaction

Quantitative PCR

Reverse Transcription: Conversion of RNA to cDNA to be used in PCR reaction
Amplify: Amplifies the DNA from the reverse transcription step
Capture Fluorescence: Fluorescence is measured, which is proportional to the amount of cDNA amplified
Repeat: Step 2 and 3 are repeated for 29 cycles
Determine the Double Delta-CT: Formula used to calculate fold change of the gene of interest
- Gene of Interest Experimental (TE)
- Gene of Interest Control (TC)
- House Keeping Gene Experimental (HE)
- House Keeping Gene Control (HC)

Calculating Fold Change

Delta-CT (ΔCT): Difference in cycle threshold (CT) between the gene of interest and the housekeeping gene
- ΔCTE: TE-HE
- ΔCTC: TC-HC
Double Delta-CT (ΔΔCT): Used to calculate the relative fold change in gene expression
- ΔCTE – ΔCTEC= ΔΔCT
Fold Change: Represented by 2-ΔΔCT, which gives a value of the difference in gene expression

Microarrays

Use a solid surface with thousands of probes that represent different genes
Samples of cDNA, labeled with fluorescent dyes, are hybridized to the microarray
The amount of fluorescence detected at each probe indicates the level of expression of the corresponding gene
Advantages of Microarrays:
- Allows the simultaneous assessment of the expression of thousands of genes
- Provides high-throughput analysis of gene expression patterns
- Identifies gene expression profiles that correlate with specific biological states or conditions
Limitations of Microarrays:
- Can be expensive and technically challenging
- May not be as sensitive or specific as qPCR
- Limited to known, pre-selected target genes

Next-generation Sequencing

Involves sequencing millions or billions of DNA fragments simultaneously
Provides a more comprehensive view of the transcriptome, uncovering novel transcripts, gene fusions, and splice variants
The technology enables researchers to identify the abundance and expression of all transcripts in a sample
This approach allows for the discovery of previously unknown genes and transcripts, and identify changes in gene expression in response to various stimuli
Steps involved in Next Generation Sequencing:
- Isolating total RNA from a sample
- Poly-A Capture: Selectively isolates mRNA molecules that are polyadenylated
- Specific Probes: Used to target and amplify specific sequences of interest

Why Conduct Transcriptomic Experiments?

Gene Discovery: Identifies novel genes or transcripts
Comparing Experimental Treatments: To study the effects of different treatments on gene expression
- Drug treatments
- Knockout/knockin studies
- Toxicology
- Developmental stages

Designing a Transcriptomic Experiment

Assumptions: Understand the specific hypothesis being tested and have data for the control group
Expected Results: Define the types of changes in gene expression expected to be observed in the experiment based on the hypothesis and its potential significance
Potential Pitfalls: Identify factors that could affect the outcome of the experiment
- Variations in RNA isolation, contamination, and technical errors during PCR amplification

Mean and Variance

Sample Mean: The average of a set of data points, denoted by 'm' or 'µ'.
Sample Variance: Measures how spread out the data is from the mean, denoted by 'S²', calculated by averaging the squared deviations of each data point from the mean.

Expected Value

The average value of a random variable, denoted by 'E(x)', calculated by taking the weighted average of all possible values of the variable, where the weights are the probabilities of each value.

Variance

A measure of how spread out the data is from the expected value, denoted by 'σ²'.
Calculated by averaging the squared deviations of each data point from the expected value.

Standard Deviation

The square root of the variance, a measure of dispersion that is in the same units as the data.

Sample Space

The set of all possible outcomes of an experiment.
Examples:
- A coin toss: {Heads, Tails}
- Rolling a die: {1, 2, 3, 4, 5, 6}
- Two coin tosses: {HH, HT, TH, TT}

Probability

The likelihood of an event occurring.
Example:
- Tossing a fair coin: 1/2 probability of getting heads
- Rolling a specific number on a fair die: 1/6 probability

Independence

Two events are independent if the occurrence of one does not affect the probability of the other.
P(E and F) = P(E) * P(F)

Conditional Probabilities

The probability of an event happening given that another event has already occurred.
P(E|F) = P(E and F) / P(F)

Probability Distributions

Describe the probability of each possible outcome of a random variable.
Examples:
- Binomial: For events with two possible outcomes (e.g., coin toss).
- Poisson: For rare events (e.g., mutations in a cell population).
- Exponential: For events occurring at a constant rate over time (e.g., protein decay).
- Gaussian (Normal): For averages of many trials (e.g., height).
- Log-normal: When the logarithm of a variable has a normal distribution.
- Chi-squared: For the distance squared in multiple dimensions.
- Multivariable Gaussian: For probabilities of a vector of variables.

Normal Distribution

A continuous probability distribution that is bell-shaped and symmetrical.
Many biological and physiological measurements follow a normal distribution.
Central Limit Theorem: As the sample size increases, the distribution of the sample mean will tend towards a normal distribution.
Skewness: A measure of the asymmetry of the distribution. Positive skewness indicates a longer right tail, while negative skewness indicates a longer left tail.

Poisson Distribution

A discrete probability distribution describing the probability of a given number of events occurring in a fixed interval of time or space.
Used for rare events.

Exponential Distribution

A continuous probability distribution that describes the time until the next event.
Used for random events that occur at a constant rate over time.

Binomial Distribution

A discrete probability distribution used to calculate the probability of a certain number of successes in a given number of trials.
Each trial has two possible outcomes: success or failure.

Odds Ratio

A statistic used to quantify the strength of the association between two events.
Measures the odds of exposure in cases versus controls.

T-test

A statistical test used to determine if there is a significant difference between the means of two groups.
Paired (Dependent) T-test: Used when the two groups are related (e.g., before and after treatment).
Equal Variance (Independent) T-test: Used when two groups have the same variance.
Unequal Variance (Independent) T-test: Used when two groups have different variances.

Multiple Hypothesis Testing

The problem of testing multiple hypotheses simultaneously, leading to an increased risk of Type I errors (false positives).

Multiple Testing Correction

Methods used to control the risk of false positives when testing multiple hypotheses.
Bonferroni correction: Divides the significance level (alpha) by the number of tests.
Benjamini-Hochberg procedure: Adjusts the p-values based on their rank.

Network Terms

Nodes: Points or entities in a network
Edges: Connections between nodes
Leaves: Nodes with only one connection

Weighted vs. Unweighted Networks

Unweighted: All edges have equal importance
Weighted: Edges have different values or strengths

Directed vs. Undirected Networks

Directed: Edges have a direction, one-way flow
Undirected: Edges are bidirectional, two-way flow

Cycles and Acyclic Networks

Cycle: A closed path in a network, starting and ending at the same node
Acyclic: Networks with no cycles

Trees

Tree: A connected acyclic network

Multilayer and Multiplex Networks

Multilayer: Multiple networks connected by edges between the subnetworks
Multiplex: Same nodes, but edges represent different relationships or contexts

Degree of a Node

The number of edges connected to a node
Indicates the node's influence or connectivity

Walks and Paths

Walks: Sequence of nodes with connected edges
Self-avoiding walks: Walks that don't repeat nodes
Paths: Walks that don't intersect themselves
Shortest path: Algorithms to find the most efficient route between nodes

Phylogenetic Trees

Represents evolutionary relationships between organisms

Gene Regulatory Networks (GRNs)

Nodes are genes or regulatory elements
Edges represent regulatory interactions between genes

Determine GRNs

Transcriptomic studies: Analyze gene expression levels
Knockout: Deactivate a gene to study its effect
Knockin: Insert a gene to observe its influence
Drugs: Study drug interactions with specific genes

Multiplex Networks in Different Contexts

Different cell types or conditions can be represented in a multiplex network
Allows analysis of complex interactions across different contexts

Protein-Protein Interactions

Nodes represent proteins
Edges signify interactions between proteins, such as binding or complex formation
Important for understanding cellular processes and signaling pathways

Machine Learning Intro

Machine learning is a field that uses computer science, mathematics, statistics, and biology to solve problems.

Defining the Problem

Machine learning solves problems through classification and prediction.
Classification categorizes data into groups based on shared characteristics.
Prediction determines the outcome of a future event.
Key considerations when defining a problem:
- Accuracy, sensitivity, and specificity.
- Throughput, especially whether the solution is limited to a small number of use cases.
- Which features are necessary.
- How many and which samples to use.

Feature Selection

Feature selection determines which features are important for machine learning.
Feature selection helps to:
- Identify meaningful features.
- Determine correlations between features.
Filter methods test for correlations, such as univariate and multivariate analysis.
Wrapper methods select and test groups of features to identify the best combination.
Embedded methods are part of the machine learning algorithm itself, where feature selection is integrated into the learning process.

Decision Tree

Decision trees can work with continuous, discrete, and categorical data.
Steps:
- Determine the best split to separate data based on a specific feature.
- Move samples along the tree based on the split criteria.
- Repeat the process until a decision is reached.

Random Forest

Random forest is a ensemble method that combines multiple decision trees for improved prediction.
Steps:
- Subset the various features with replacement, meaning features are chosen randomly with possible duplicates.
- Construct decision trees using the subset of features.
- Ensemble method combines the predictions from all the decision trees to determine the final prediction.

Random Forest Variable Importance

Mean Decrease Gini: A measure of how much each feature contributes to reducing the impurity in the decision tree.
- Higher Mean Decrease Gini indicates a more important feature.
Random forest can be used to identify the most important variables based on the Mean Decrease Gini score.

K Nearest Neighbors (KNN)

KNN is a supervised classification method that groups samples based on their similarity to known samples.
How it works:
- Determines the class of an unknown sample by considering its K nearest neighbors, those with the smallest distance in multi-dimensional space.
KNN requires data to be binary or continuous, with the option to use principal component analysis to transform data if necessary.
Distance matrices measure the similarity between samples:
- Euclidean distance (cartesian distance between two points).
- Manhattan distance (absolute difference between coordinates in multiple dimensions).
- Jaccard similarity coefficient (presence/absence between two sets).

K-Means Clustering

K-means clustering is an unsupervised classification method that groups samples into K clusters.
How it works:
- Randomly selects points in the data as initial centroids (representatives of each cluster).
- Assigns each sample to the closest centroid.
- Re-calculates the centroids based on the assigned samples.
- Repeats the process until the cluster assignments stabilize.

Matrices

Dimensions matter when multiplying matrices
Used to represent systems of equations
Used to represent networks

Genomics

Parts of a gene:
- Exons and Introns
- Untranslated regions (UTRs)
- Regulatory elements:
  - Promoters
Determined via next-generation sequencing or SNP-arrays

Transcriptomics

PCR = Polymerase Chain Reaction
qPCR = Quantitative PCR
qPCR, microarrays, and RNAseq experiments are all used to analyze gene expression
Microarrays use probes to detect mRNA, while RNAseq sequences the entire transcriptome
RNAseq is considered the most accurate method, but it is also the most expensive

Probability

Sample space: set of all possible outcomes of an experiment
Types of probability distributions:
- Normal distribution
- Poisson distribution
- Binomial distribution
Statistic quantifying association between events:
- Correlation coefficient
Multiple testing correction:
- Correcting for the increased probability of false positives when conducting multiple statistical tests

Independence

P(E and F) = P(E) * P(F)

Conditional Probabilities

P(E|F) = Probability of Event E, given that Event F has already occurred
Formula: P(F|E)P(E) / P(F)

Networks

Multilayer: Different networks connected by edges
Multiplex: Nodes are the same, but edges are different in the subnetworks

Networks

Directed graph: Edges have direction
Undirected graph: Edges have no direction
Node's degree: Number of edges connected to a node
Path: Sequence of nodes connected by edges
Networks can be used to model and explain biological concepts, such as protein-protein interactions or gene regulatory networks

Machine Learning

Classification: Categorize data into groups
Prediction: Estimate the value of a variable
Three different times to select features:
- During data collection
- During data preprocessing
- During model training
Decision tree: Model used for classification and regression that uses a tree-like structure to make decisions
Supervised learning: Train a model on labeled data
Unsupervised learning: Train a model on unlabeled data

Mid-Term Layout

Part 1: Knowledge base (50 points)
- In-class
Part 2: Critical Review of a paper (50 points)
- Take home (Due Oct 17)

Synonymous vs. Non-Synonymous Variations

Synonymous variations do not change the amino acid sequence, while non-synonymous variations do.

Variant Call Format (VCF)

The Variant Call Format (VCF) is a standardized file format used to store and exchange genetic variation data.

Genomic Variations

Translocations involve the exchange of DNA segments between non-homologous chromosomes.

Gene Components

Introns are elements included in genes but not part of the coding region.

Matrix Terminology

The term "rank" in matrices refers to the number of linearly independent rows or columns in a matrix.

Matrix Equations

A system of equations can be solved using matrices if it can be represented in the form Ax = b, where A is the matrix of coefficients, x is the vector of unknowns, and b is the vector of constants.

Matrix Dimensions

"m by n" in a matrix represents its dimensions, indicating that it has m rows and n columns.

Matrices in Biology

Matrices can be used in biology to model genetic relationships, analyze protein interactions, and study population dynamics.

Independent Columns

A column in a matrix is considered independent if it cannot be expressed as a linear combination of the other columns.

Matrix Properties

A matrix can have more columns than rows, but not vice versa.

Matrix of Coefficients

In the context of systems of equations, the matrix of coefficients represents the coefficients of the variables in each equation.

Matrices in Technology

Matrices have applications in computer graphics, data analysis, and machine learning.

Single Nucleotide Variation (SNV)

SNV involves a change in a single nucleotide in a DNA sequence.

VCF File Structure

A primary characteristic of VCF file structure is its use of tab-delimited text format for storing variant information.

Genetic Variation and Disease

Genome-wide association studies (GWAS) can be conducted to investigate the relationship between genetic variations and disease susceptibility.

Silent Mutations

Silent mutations are genetic variations that do not result in a change in the amino acid sequence of a protein.

VCF File Content

Genotype information is not usually included in the sample-specific information within a VCF file.

Matrices and Equations

Matrices are used in relation to equations to simplify and solve systems of linear equations.

Matrix Rank

The rank of a matrix is always less than or equal to the number of rows and columns in the matrix.

Matrix Row Representation

A row of '1, 0, 0, 1, 0' in a matrix can indicate that a specific entity or variable is present in only the first and fourth positions.

Column Independence

For a column to be considered independent in a matrix, it cannot be expressed as a linear combination of the other columns, meaning it doesn't have a direct linear relationship with any other column.

Matrix Notation

"m by n" is used to describe the dimensions of a matrix, indicating that it has "m" rows and "n" columns.

Matrices in Biology (Application)

Matrices are specifically applied in bioinformatics for sequence alignment analysis, where they can represent DNA or protein sequences.

Independent Column Feature

Independent columns in a matrix often contain information about distinct variables or characteristics.

Matrices in Graphics

Matrices are used in graphic representations to perform transformations, such as rotations, translations, and scaling of objects.

Quantitative PCR Components

Quantitative PCR (qPCR) components include:
- DNA template
- Primers
- PCR master mix
- Fluorescent dye or probe

qPCR Definition

qPCR stands for Quantitative Polymerase Chain Reaction.

Microarray RNA Analysis

Microarrays can be used for total RNA analysis to study gene expression patterns across a large number of genes.

qPCR Fluorescence

Capturing fluorescence in qPCR is used to quantify the amount of target DNA present in the reaction.
- Increased fluorescence indicates higher amounts of amplified target DNA.

Transcriptomic Experiments

Transcriptomic experiments are conducted to investigate changes in gene expression, which can provide insights into biological processes, diseases, and drug response.

Delta-CT (qPCR)

Delta-CT in qPCR represents the difference in cycle thresholds (CT) between the target gene and a reference gene.

qPCR Steps

The main steps in Quantitative PCR include:
- Denaturation (separation of DNA strands)
- Annealing (primers bind to DNA)
- Extension (new DNA strands are synthesized)

Fold Change (qPCR)

The calculated Fold Change in qPCR is given by the formula 2 raised to the power of ΔΔCT, indicating the relative expression level of the target gene.

Microarray Purpose

The primary purpose of microarrays is to measure the expression levels of thousands of genes simultaneously, allowing for comprehensive gene expression profiling.

Transcriptomic Experiment Reasons

Reasons for conducting transcriptomic experiments include:
- Understanding gene expression patterns in different conditions
- Identifying biomarkers for disease diagnosis
- Studying drug response and toxicity

Next Generation Sequencing

Next generation sequencing (NGS) does not always use only total RNA; it can also be used for whole genome sequencing, exome sequencing, and other applications.

Delta-CT Significance

Delta-CT in qPCR represents the difference in cycle thresholds (CT) between the target gene and a reference gene, quantifying the relative expression level of the target gene.
- A smaller Delta-CT indicates higher target gene expression.

PCR Definition

PCR stands for Polymerase Chain Reaction.

Sample Mean Formula

The formula for the sample mean (denoted by $\bar{x}$) is:
- $\bar{x}$ = (Σx) / n, where Σx represents the sum of all values in the sample, and n is the sample size.

Sample Variance Formula

The formula for sample variance (denoted by s²) is:
- s² = Σ(x - $\bar{x}$)² / (n - 1), where x represents each data point, $\bar{x}$ is the sample mean, and n is the sample size.

Expected Value Formula

The expected value (denoted by E(X)) is calculated by:
- E(X) = ΣxP(x), where x represents each possible value of the random variable X, and P(x) is the probability of that value occurring.

Independence in Probability

Independence in probability means that the occurrence of one event does not affect the probability of another event occurring.

Rare Events Distribution

The Poisson distribution is used to describe rare events in a large population.

Events Over Time

The exponential distribution is used for modeling random events occurring over time.

T-Test Purpose

The t-test is primarily used to compare the means of two groups.

Bonferroni Correction

The Bonferroni correction in multiple testing adjusts the significance threshold to account for the increased chance of false positives when performing multiple statistical tests.

Central Limit Theorem

The Central Limit Theorem states that the distribution of sample means from a population will approach a normal distribution as the sample size increases.

Odds Ratio

The odds ratio quantifies the strength of the association between two events in terms of the odds of exposure in one group compared to the controls.

Network Components

The parts of a network are:
- Nodes (represent entities)
- Edges (represent connections between entities)

Unweighted Graph

An unweighted graph has edges without any weights assigned to them.

Node Degree

The degree of a node refers to the number of edges connected to it.

Multiplex vs. Multilayer Network

A multiplex network has different types of interactions within the same layer, while a multilayer network has different types of interactions across multiple layers.

Path Definition

A path is defined as a walk that does not visit any node more than once.

GRN Determination

One method for determining gene regulatory networks (GRNs) is Bayesian network inference, which uses probabilistic relationships between genes to reconstruct the network structure.

Phylogenetic Trees

Phylogenetic trees are used to represent evolutionary relationships between organisms or genes.

Network Walks

'Walks' in a network are sequences of nodes and edges, where the node at the end of one edge is the beginning of the next, and each edge is traversed only once.

Feature Selection Considerations

Factors considered in feature selection include:
- Relevance (degree of association with the target)
- Redundancy (how much overlap exists between features)
- Cost (of obtaining and processing features)

Classification Definition

Classification categorizes data into predetermined categories based on shared characteristics.

Prediction in Machine Learning

The primary goal of prediction in machine learning is to build models that can accurately predict future outcomes based on historical data.

Random Forest Technique

Random Forest uses a technique called bagging (bootstrap aggregating) to construct decision trees.

K Nearest Neighbors (K)

The K in K Nearest Neighbors represents the number of nearest neighbors to consider when classifying a new data point.

KNN Data Types

K Nearest Neighbors can work with various data types, including numerical, categorical, and mixed data.

Algorithm Functions

Algorithm & Primary Function:
- K-Means Clustering: Unsupervised clustering algorithm that groups data points into clusters based on their similarity.
- K Nearest Neighbors: Supervised classification algorithm that classifies a new data point based on its proximity to known labeled data points.
- Decision Tree: Supervised classification and regression algorithm that uses a tree-like structure to make predictions.
- Random Forest: Ensemble learning method that combines multiple decision trees to improve prediction accuracy and reduce variance.

Matrix Dimension Impact

Dimensions of matrices affect the resulting matrix after multiplication:
- If the number of columns in the first matrix is not equal to the number of rows in the second matrix, multiplication is not possible.
- The resulting matrix will have the same number of rows as the first matrix and the same number of columns as the second matrix.

Gene Components

Parts of a gene include:
- Promoter (regulates gene expression)
- Exons (coding sequences)
- Introns (non-coding sequences)
- 5' untranslated region (UTR)
- 3' untranslated region (UTR)
- Polyadenylation signal (signals end of gene)

Gene Determination Tools

Primary tools used to determine parts of a gene include:
- DNA sequencing
- Gene prediction algorithms
- RNA sequencing

PCR vs. qPCR

PCR (Polymerase Chain Reaction) amplifies DNA, while qPCR (Quantitative Polymerase Chain Reaction) quantifies the amount of DNA amplified.

Sample Space

Sample space is the set of all possible outcomes of an experiment or random phenomenon.

Association Strength

A statistic that quantifies the strength of the association between two events is the correlation coefficient.

Independence Rule

Independence follows the rule: P(EF) = P(E)P(F), where P(EF) is the probability of both events E and F occurring, P(E) is the probability of event E occurring, and P(F) is the probability of event F occurring.

Directed vs. Undirected Graph

A directed graph has edges with a specific direction, while an undirected graph has edges without a specific direction.

Node Degree

A node's degree tells you the number of connections it has to other nodes in a network.

Classification

Classification in machine learning is the process of assigning data points to predefined categories based on their characteristics.

Supervised vs. Unsupervised Learning

Supervised learning uses labeled data to train models, while unsupervised learning uses unlabeled data to discover patterns.

Feature Selection Timing

Three different times in which you can select features:
- Pre-processing: Features are selected before training the model.
- During training: Features are selected during the model training process.
- Post-processing: Features are selected after the model has been trained.

Feature Selection Situations

Features are selected during specific situations:
- High dimensionality: When there are many features and only a few are relevant.
- Overfitting: When the model is too complex and performs well on the training data but poorly on new data.
- Computational efficiency: When reducing the number of features can speed up the training process.

Feature Selection Phase

A phase in which features are selected is the feature engineering phase. This phase involves selecting, transforming, and creating new features that improve the performance of machine learning models.

Feature Selection Timing Aspect

One key aspect of the timing in feature selection is that it can influence the model's performance. If features are selected before training, the model might miss out on valuable information. If features are selected after training, the model might not perform well on new data.

Feature Selection Scenario

A scenario that might not involve feature selection timing is when the data is already clean and relevant, with a few features that are well-defined and contribute directly to the model's performance.

Feature Selection Reassessment

It is ideal to reassess feature selection when the data distribution changes significantly, when new data becomes available, or when there are changes in the problem that require a different set of features.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Probability Models in Biological Data Analysis

Choose a study mode

Podcast

Questions and Answers

What differentiates synonymous variations from non-synonymous variations?

Which of the following best describes the purpose of the Variant Call Format (VCF)?

Which type of genomic variation includes the exchange of DNA segments between non-homologous chromosomes?

What type of elements are included in genes but are not part of the coding region?

What does the term 'rank' refer to in the context of matrices?

Which equation represents a system solvable through matrices?

What does 'm by n' represent in a matrix?

How can matrices be used in biology?

What constitutes a column being considered independent in a matrix?

Which of the following statements about matrices is incorrect?

In the context of systems of equations, what does the matrix of coefficients represent?

Which of the following is a common application of matrices in technology?

Which genomic variation involves a change in a single nucleotide that may or may not affect the protein coding sequence?

What is a primary characteristic of the Variant Call Format (VCF) file structure?

What type of analysis can be conducted to investigate the relationship between genetic variations and disease susceptibility?

Which term refers to genetic variations that do not result in a change in the amino acid sequence of a protein?

Which of the following components is NOT usually included in the sample-specific information within a VCF file?

What is the main reason matrices are used in relation to equations?

Which statement about the rank of a matrix is true?

In the context of matrix representation, what can the row of '1, 0, 0, 1, 0' indicate?

What condition must be satisfied for a column to be considered independent in a matrix?

How is the term 'm by n' used in relation to matrices?

Which application of matrices is specifically mentioned in the context of biology?

What is a common feature of independent columns in a matrix?

What is the purpose of using matrices in graphic representations?

What are the components of Quantitative PCR?

What does qPCR stand for?

Microarrays can be used for total RNA analysis.

What is the purpose of capturing fluorescence in qPCR?

What is one of the reasons for conducting transcriptomic experiments?

What does the term Delta-CT refer to in qPCR?

What does qPCR stand for?

What are the main steps in Quantitative PCR?

The calculated Fold Change in qPCR is given by the formula 2 raised to the power of ΔΔCT.

What is the primary purpose of microarrays?

List some reasons for conducting transcriptomic experiments.

Next generation sequencing only uses Total RNA.

What does Delta-CT represent in qPCR?

PCR stands for ______.

What is the formula for the sample mean?

What is the formula for sample variance?

What is the expected value formula?

What does independence in probability indicate?

Which of the following distributions is used to describe rare events in a large population?

What type of probability distribution is used for modeling random events occurring over time?

What is the t-test primarily used for?

What is the Bonferroni correction in multiple testing?

What is the use of the Central Limit Theorem?

The odds ratio quantifies the strength of the association between two events in terms of the odds of exposure in _____ compared to controls.

What are the parts of a network?

An unweighted graph has edges with non-negative weights.

What is the degree of a node?

What differentiates a multiplex network from a multilayer network?

A path is defined as a walk that does not __________ itself.

Name one method for determining gene regulatory networks (GRNs).

Phylogenetic trees are used to represent evolutionary relationships.

What are 'walks' in a network?

Which of the following are factors considered in feature selection?

Classification categorizes data based on shared characteristics.

What is the primary goal of prediction in machine learning?

Random Forest uses a technique called ______ to construct decision trees.

What does the K in K Nearest Neighbors represent?

What types of data can K Nearest Neighbors work with?

Match the following algorithms with their primary function:

K-Means Clustering is a type of supervised classification.

What do dimensions of matrices affect when multiplying them?

Which of the following are parts of a gene? (Select all that apply)

What are the primary tools used to determine parts of a gene?

What is the difference between PCR and qPCR?

What is sample space?

What is a statistic that quantifies the strength of the association between two events?

Independence follows the rule: P(EF) = P(E)P(F).

What is the difference between a directed and undirected graph?

What does a node's degree tell you?

What is a classification in machine learning?

What is the difference between supervised and unsupervised learning?