Podcast
Questions and Answers
What differentiates synonymous variations from non-synonymous variations?
What differentiates synonymous variations from non-synonymous variations?
Which of the following best describes the purpose of the Variant Call Format (VCF)?
Which of the following best describes the purpose of the Variant Call Format (VCF)?
Which type of genomic variation includes the exchange of DNA segments between non-homologous chromosomes?
Which type of genomic variation includes the exchange of DNA segments between non-homologous chromosomes?
What type of elements are included in genes but are not part of the coding region?
What type of elements are included in genes but are not part of the coding region?
Signup and view all the answers
What does the term 'rank' refer to in the context of matrices?
What does the term 'rank' refer to in the context of matrices?
Signup and view all the answers
Which equation represents a system solvable through matrices?
Which equation represents a system solvable through matrices?
Signup and view all the answers
What does 'm by n' represent in a matrix?
What does 'm by n' represent in a matrix?
Signup and view all the answers
How can matrices be used in biology?
How can matrices be used in biology?
Signup and view all the answers
What constitutes a column being considered independent in a matrix?
What constitutes a column being considered independent in a matrix?
Signup and view all the answers
Which of the following statements about matrices is incorrect?
Which of the following statements about matrices is incorrect?
Signup and view all the answers
In the context of systems of equations, what does the matrix of coefficients represent?
In the context of systems of equations, what does the matrix of coefficients represent?
Signup and view all the answers
Which of the following is a common application of matrices in technology?
Which of the following is a common application of matrices in technology?
Signup and view all the answers
Which genomic variation involves a change in a single nucleotide that may or may not affect the protein coding sequence?
Which genomic variation involves a change in a single nucleotide that may or may not affect the protein coding sequence?
Signup and view all the answers
What is a primary characteristic of the Variant Call Format (VCF) file structure?
What is a primary characteristic of the Variant Call Format (VCF) file structure?
Signup and view all the answers
What type of analysis can be conducted to investigate the relationship between genetic variations and disease susceptibility?
What type of analysis can be conducted to investigate the relationship between genetic variations and disease susceptibility?
Signup and view all the answers
Which term refers to genetic variations that do not result in a change in the amino acid sequence of a protein?
Which term refers to genetic variations that do not result in a change in the amino acid sequence of a protein?
Signup and view all the answers
Which of the following components is NOT usually included in the sample-specific information within a VCF file?
Which of the following components is NOT usually included in the sample-specific information within a VCF file?
Signup and view all the answers
What is the main reason matrices are used in relation to equations?
What is the main reason matrices are used in relation to equations?
Signup and view all the answers
Which statement about the rank of a matrix is true?
Which statement about the rank of a matrix is true?
Signup and view all the answers
In the context of matrix representation, what can the row of '1, 0, 0, 1, 0' indicate?
In the context of matrix representation, what can the row of '1, 0, 0, 1, 0' indicate?
Signup and view all the answers
What condition must be satisfied for a column to be considered independent in a matrix?
What condition must be satisfied for a column to be considered independent in a matrix?
Signup and view all the answers
How is the term 'm by n' used in relation to matrices?
How is the term 'm by n' used in relation to matrices?
Signup and view all the answers
Which application of matrices is specifically mentioned in the context of biology?
Which application of matrices is specifically mentioned in the context of biology?
Signup and view all the answers
What is a common feature of independent columns in a matrix?
What is a common feature of independent columns in a matrix?
Signup and view all the answers
What is the purpose of using matrices in graphic representations?
What is the purpose of using matrices in graphic representations?
Signup and view all the answers
What are the components of Quantitative PCR?
What are the components of Quantitative PCR?
Signup and view all the answers
What does qPCR stand for?
What does qPCR stand for?
Signup and view all the answers
Microarrays can be used for total RNA analysis.
Microarrays can be used for total RNA analysis.
Signup and view all the answers
What is the purpose of capturing fluorescence in qPCR?
What is the purpose of capturing fluorescence in qPCR?
Signup and view all the answers
What is one of the reasons for conducting transcriptomic experiments?
What is one of the reasons for conducting transcriptomic experiments?
Signup and view all the answers
What does the term Delta-CT refer to in qPCR?
What does the term Delta-CT refer to in qPCR?
Signup and view all the answers
What does qPCR stand for?
What does qPCR stand for?
Signup and view all the answers
What are the main steps in Quantitative PCR?
What are the main steps in Quantitative PCR?
Signup and view all the answers
The calculated Fold Change in qPCR is given by the formula 2 raised to the power of ΔΔCT.
The calculated Fold Change in qPCR is given by the formula 2 raised to the power of ΔΔCT.
Signup and view all the answers
What is the primary purpose of microarrays?
What is the primary purpose of microarrays?
Signup and view all the answers
List some reasons for conducting transcriptomic experiments.
List some reasons for conducting transcriptomic experiments.
Signup and view all the answers
Next generation sequencing only uses Total RNA.
Next generation sequencing only uses Total RNA.
Signup and view all the answers
What does Delta-CT represent in qPCR?
What does Delta-CT represent in qPCR?
Signup and view all the answers
PCR stands for ______.
PCR stands for ______.
Signup and view all the answers
What is the formula for the sample mean?
What is the formula for the sample mean?
Signup and view all the answers
What is the formula for sample variance?
What is the formula for sample variance?
Signup and view all the answers
What is the expected value formula?
What is the expected value formula?
Signup and view all the answers
What does independence in probability indicate?
What does independence in probability indicate?
Signup and view all the answers
Which of the following distributions is used to describe rare events in a large population?
Which of the following distributions is used to describe rare events in a large population?
Signup and view all the answers
What type of probability distribution is used for modeling random events occurring over time?
What type of probability distribution is used for modeling random events occurring over time?
Signup and view all the answers
What is the t-test primarily used for?
What is the t-test primarily used for?
Signup and view all the answers
What is the Bonferroni correction in multiple testing?
What is the Bonferroni correction in multiple testing?
Signup and view all the answers
What is the use of the Central Limit Theorem?
What is the use of the Central Limit Theorem?
Signup and view all the answers
The odds ratio quantifies the strength of the association between two events in terms of the odds of exposure in _____ compared to controls.
The odds ratio quantifies the strength of the association between two events in terms of the odds of exposure in _____ compared to controls.
Signup and view all the answers
What are the parts of a network?
What are the parts of a network?
Signup and view all the answers
An unweighted graph has edges with non-negative weights.
An unweighted graph has edges with non-negative weights.
Signup and view all the answers
What is the degree of a node?
What is the degree of a node?
Signup and view all the answers
What differentiates a multiplex network from a multilayer network?
What differentiates a multiplex network from a multilayer network?
Signup and view all the answers
A path is defined as a walk that does not __________ itself.
A path is defined as a walk that does not __________ itself.
Signup and view all the answers
Name one method for determining gene regulatory networks (GRNs).
Name one method for determining gene regulatory networks (GRNs).
Signup and view all the answers
Phylogenetic trees are used to represent evolutionary relationships.
Phylogenetic trees are used to represent evolutionary relationships.
Signup and view all the answers
What are 'walks' in a network?
What are 'walks' in a network?
Signup and view all the answers
Which of the following are factors considered in feature selection?
Which of the following are factors considered in feature selection?
Signup and view all the answers
Classification categorizes data based on shared characteristics.
Classification categorizes data based on shared characteristics.
Signup and view all the answers
What is the primary goal of prediction in machine learning?
What is the primary goal of prediction in machine learning?
Signup and view all the answers
Random Forest uses a technique called ______ to construct decision trees.
Random Forest uses a technique called ______ to construct decision trees.
Signup and view all the answers
What does the K in K Nearest Neighbors represent?
What does the K in K Nearest Neighbors represent?
Signup and view all the answers
What types of data can K Nearest Neighbors work with?
What types of data can K Nearest Neighbors work with?
Signup and view all the answers
Match the following algorithms with their primary function:
Match the following algorithms with their primary function:
Signup and view all the answers
K-Means Clustering is a type of supervised classification.
K-Means Clustering is a type of supervised classification.
Signup and view all the answers
What do dimensions of matrices affect when multiplying them?
What do dimensions of matrices affect when multiplying them?
Signup and view all the answers
Which of the following are parts of a gene? (Select all that apply)
Which of the following are parts of a gene? (Select all that apply)
Signup and view all the answers
What are the primary tools used to determine parts of a gene?
What are the primary tools used to determine parts of a gene?
Signup and view all the answers
What is the difference between PCR and qPCR?
What is the difference between PCR and qPCR?
Signup and view all the answers
What is sample space?
What is sample space?
Signup and view all the answers
What is a statistic that quantifies the strength of the association between two events?
What is a statistic that quantifies the strength of the association between two events?
Signup and view all the answers
Independence follows the rule: P(EF) = P(E)P(F).
Independence follows the rule: P(EF) = P(E)P(F).
Signup and view all the answers
What is the difference between a directed and undirected graph?
What is the difference between a directed and undirected graph?
Signup and view all the answers
What does a node's degree tell you?
What does a node's degree tell you?
Signup and view all the answers
What is a classification in machine learning?
What is a classification in machine learning?
Signup and view all the answers
What is the difference between supervised and unsupervised learning?
What is the difference between supervised and unsupervised learning?
Signup and view all the answers
What are the three different times in which you can select features?
What are the three different times in which you can select features?
Signup and view all the answers
During which specific situations can features be selected?
During which specific situations can features be selected?
Signup and view all the answers
Which choice accurately describes a phase in which features are selected?
Which choice accurately describes a phase in which features are selected?
Signup and view all the answers
What is one key aspect of the timing in feature selection?
What is one key aspect of the timing in feature selection?
Signup and view all the answers
Which scenario might not involve feature selection timing?
Which scenario might not involve feature selection timing?
Signup and view all the answers
When is it ideal to reassess feature selection?
When is it ideal to reassess feature selection?
Signup and view all the answers
Study Notes
Course Information
- This course is cross-listed as BINF5354, STAT5354, and STAT6354.
- Lectures are held on Tuesdays from 9:00 AM to 10:20 AM.
- Labs are held on Thursdays from 9:00 AM to 10:20 AM.
- Dr. Jon Mohl is the instructor and his office hours are Tuesdays from 2:00 PM to 3:00 PM in CCSB 2.0306.
- The teaching assistant's office hours are Thursdays from 10:30 AM to 11:30 AM in BE302.
- Announcements and assignments will be posted on Blackboard.
- The primary communication methods are email or through Teams.
Additional Resources
- Recommended books include "Linear Algebra and Learning from Data" by Gilbert Strang, "Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids" by R.Durbin, S.Eddy, A.Krogh, and G.Mitchison, and "Introduction to Probability Models" by Sheldon M. Ross.
- Students are encouraged to read articles related to the course topics.
Grading
- The course grade is determined by homework assignments (50%), a midterm exam (20%), and a group project (30%).
- The group project consists of a proposal, a final presentation, and a final report.
Attendance & Civility
- Attendance is mandatory.
- Students are expected to be civil, keep their phones on silent, and pay attention during lectures and labs.
Course Objectives
- Students will learn practical skills including programming in R and Python, and working on Linux machines.
- Students will gain theoretical knowledge in linear algebra, probability/statistics, machine learning, and bioinformatics.
Projects
- Projects will utilize genomics data from VCF files.
Matrices
- A matrix is a rectangular array of numbers, symbols, or expressions, arranged in rows and columns
- Matrices are used to represent systems of equations, graphs, and other mathematical objects.
- The number of rows and columns in a matrix is referred to as its dimensions, for example, a matrix with m rows and n columns is called an m x n matrix.
- The rank of a matrix is the number of linearly independent columns.
Uses of Matrices in Biology
- Matrices are used to model gene regulatory networks, metabolic networks, infectious disease models, and survival analysis
Genomics
- Genomics is the study of genomes, which are the complete sets of genetic material in an organism.
- Genomes are filled with DNA, including regulatory factors and repetitive elements.
- Variations within a genome include large deletions, large duplications, translocations, and single nucleotide variations (SNVs).
- SNVs include insertions (adding a nucleotide), deletions (removing a nucleotide), and polymorphisms (variations in DNA sequence within a population).
Variant Call Format (VCF)
- VCF is a file format for storing sequence information, including DNA variants.
- It is a tab-delimited file with a header region containing version information, chromosome/scaffold/contig (sequence of DNA within a chromosome) details, and information about the content of the file itself.
- The VCF file can contain information on one or more samples.
- It contains general position information like chromosome, position, reference allele, and alternate allele(s), and also information about the gene.
- Sample-specific information included in VCF files includes allele calls, site depth, the number of counts per allele, and quality.
Group Project
- Students are divided into 3 groups.
- Each group will form a testable hypothesis about population analysis or disease.
- The groups need to compile genetic information and determine the software necessary to analyze the data.
- Students will run the analysis on “Biolinux machines” (computer systems specifically designed for biological research) and prepare a preliminary presentation outlining the results.
- They will complete the project by presenting their final findings in a written report and final presentation.
Matrices
- Matrices represent systems of linear equations, graphs, and biological networks.
- A matrix is a rectangular array of numbers.
- The dimensions of a matrix are defined by the number of rows and columns.
- The rank of a matrix is the number of linearly independent columns.
- Matrices are used to represent systems of equations, graphs, gene regulatory networks, metabolic networks, infectious disease models, and survival analysis.
Genomics
- Genomics is the study of genomes, including the sequencing and analysis of genetic information.
- Genomes contain a variety of elements, including genes, regulatory factors, repetitive elements, exons, introns, and untranscribed elements.
- Variations within the genome include large deletions, large duplications, translocations, and single-nucleotide variations (SNVs).
- SNVs can be insertions, deletions, or polymorphisms.
Applications of Genomics
- Genomic information can be used for population and ancestry structure studies, genome-wide association studies (GWAS), and determining the effects of variants on gene function.
- SNVs can be synonymous or non-synonymous, meaning they may or may not alter the amino acid sequence of a protein.
- Variants can also disrupt regulatory elements affecting gene expression.
Variant Call Format (VCF)
- VCF is a file format used to store sequence information, including variant calls.
- VCF files are tab-delimited and contain a header region providing information about the file format, chromosome/scaffold/contig details, and content information.
- VCF files can contain data for one or more samples and include general position information as well as sample-specific information.
- VCF files can contain all sites or only those with variants.
Group Project
- Students will work in groups to develop and test a hypothesis related to population analysis or a specific disease.
- Each group will collect genetic information and determine the necessary software for their analysis.
- The groups will present preliminary findings and reports, run their analyses on Biolinux machines, and provide final write-ups and presentations.
Transcriptomics
- Transcriptomics studies the complete set of RNA transcripts produced by an organism.
- Useful for understanding gene expression patterns and identifying genes involved in specific processes.
- Common techniques include real-time PCR (qPCR), microarray analysis, and next-generation sequencing (NGS).
Real-time PCR (qPCR)
- A technique used for quantifying specific DNA or RNA molecules.
- Uses reverse transcription to convert RNA to cDNA followed by amplification with specific primers.
- Fluorescent probes detect amplified DNA in real-time, allowing for quantification of target molecule.
Quantitative PCR
- Uses a cycle threshold (CT) value to measure the amount of target molecule present in a sample.
- Lower CT values indicate higher target molecule abundance.
- Calculation of ΔΔCT is used to determine fold change in gene expression between experimental and control groups
Microarrays
- A technology that allows for simultaneous analysis of thousands of genes.
- Short DNA sequences (probes) specific to individual genes are attached to a solid surface.
- Fluorescently labeled cDNA from samples is hybridized to the array.
- The amount of fluorescence bound to each probe reflects the abundance of the corresponding mRNA in the sample.
Next-Generation Sequencing (NGS)
- High-throughput sequencing technology that can sequence millions or billions of DNA fragments simultaneously.
- Total RNA can be sequenced using NGS to obtain a snapshot of the transcriptome.
- Poly-A capture is used to isolate mRNA, and specific probe sequences can be used to enrich for target transcripts.
Why do transcriptomic experiments?
- Gene discovery: Identify new genes involved in specific processes or cellular functions.
- Compare experimental treatments: Determine the effect of treatments (drugs, knockouts/knockins, toxins) on gene expression.
- Toxicology: Assess the impact of toxins on gene expression and cellular pathways.
- Developmental stages: Study changes in gene expression during development.
Transcriptomics
- Transcriptomics is the study of the complete set of RNA transcripts in a cell or organism
- It involves studying the structure, function, and regulation of RNA molecules
- Three common techniques for transcriptomic analysis: real-time PCR (qPCR), Microarrays, and Next Generation Sequencing
Real-Time PCR (qPCR)
- Amplifies DNA by using primers to specifically target DNA segments
- Is used for detection, cloning, and sequencing of specific regions of DNA
- Quantification of PCR products is achieved by measuring fluorescence during the reaction
Quantitative PCR
- Reverse Transcription: Conversion of RNA to cDNA to be used in PCR reaction
- Amplify: Amplifies the DNA from the reverse transcription step
- Capture Fluorescence: Fluorescence is measured, which is proportional to the amount of cDNA amplified
- Repeat: Step 2 and 3 are repeated for 29 cycles
-
Determine the Double Delta-CT: Formula used to calculate fold change of the gene of interest
- Gene of Interest Experimental (TE)
- Gene of Interest Control (TC)
- House Keeping Gene Experimental (HE)
- House Keeping Gene Control (HC)
Calculating Fold Change
-
Delta-CT (ΔCT): Difference in cycle threshold (CT) between the gene of interest and the housekeeping gene
- ΔCTE: TE-HE
- ΔCTC: TC-HC
-
Double Delta-CT (ΔΔCT): Used to calculate the relative fold change in gene expression
- ΔCTE – ΔCTEC= ΔΔCT
- Fold Change: Represented by 2-ΔΔCT, which gives a value of the difference in gene expression
Microarrays
- Use a solid surface with thousands of probes that represent different genes
- Samples of cDNA, labeled with fluorescent dyes, are hybridized to the microarray
- The amount of fluorescence detected at each probe indicates the level of expression of the corresponding gene
- Advantages of Microarrays:
- Allows the simultaneous assessment of the expression of thousands of genes
- Provides high-throughput analysis of gene expression patterns
- Identifies gene expression profiles that correlate with specific biological states or conditions
- Limitations of Microarrays:
- Can be expensive and technically challenging
- May not be as sensitive or specific as qPCR
- Limited to known, pre-selected target genes
Next-generation Sequencing
- Involves sequencing millions or billions of DNA fragments simultaneously
- Provides a more comprehensive view of the transcriptome, uncovering novel transcripts, gene fusions, and splice variants
- The technology enables researchers to identify the abundance and expression of all transcripts in a sample
- This approach allows for the discovery of previously unknown genes and transcripts, and identify changes in gene expression in response to various stimuli
- Steps involved in Next Generation Sequencing:
- Isolating total RNA from a sample
- Poly-A Capture: Selectively isolates mRNA molecules that are polyadenylated
- Specific Probes: Used to target and amplify specific sequences of interest
Why Conduct Transcriptomic Experiments?
- Gene Discovery: Identifies novel genes or transcripts
-
Comparing Experimental Treatments: To study the effects of different treatments on gene expression
- Drug treatments
- Knockout/knockin studies
- Toxicology
- Developmental stages
Designing a Transcriptomic Experiment
- Assumptions: Understand the specific hypothesis being tested and have data for the control group
- Expected Results: Define the types of changes in gene expression expected to be observed in the experiment based on the hypothesis and its potential significance
-
Potential Pitfalls: Identify factors that could affect the outcome of the experiment
- Variations in RNA isolation, contamination, and technical errors during PCR amplification
Mean and Variance
- Sample Mean: The average of a set of data points, denoted by 'm' or 'µ'.
- Sample Variance: Measures how spread out the data is from the mean, denoted by 'S²', calculated by averaging the squared deviations of each data point from the mean.
Expected Value
- The average value of a random variable, denoted by 'E(x)', calculated by taking the weighted average of all possible values of the variable, where the weights are the probabilities of each value.
Variance
- A measure of how spread out the data is from the expected value, denoted by 'σ²'.
- Calculated by averaging the squared deviations of each data point from the expected value.
Standard Deviation
- The square root of the variance, a measure of dispersion that is in the same units as the data.
Sample Space
- The set of all possible outcomes of an experiment.
- Examples:
- A coin toss: {Heads, Tails}
- Rolling a die: {1, 2, 3, 4, 5, 6}
- Two coin tosses: {HH, HT, TH, TT}
Probability
- The likelihood of an event occurring.
- Example:
- Tossing a fair coin: 1/2 probability of getting heads
- Rolling a specific number on a fair die: 1/6 probability
Independence
- Two events are independent if the occurrence of one does not affect the probability of the other.
- P(E and F) = P(E) * P(F)
Conditional Probabilities
- The probability of an event happening given that another event has already occurred.
- P(E|F) = P(E and F) / P(F)
Probability Distributions
- Describe the probability of each possible outcome of a random variable.
- Examples:
- Binomial: For events with two possible outcomes (e.g., coin toss).
- Poisson: For rare events (e.g., mutations in a cell population).
- Exponential: For events occurring at a constant rate over time (e.g., protein decay).
- Gaussian (Normal): For averages of many trials (e.g., height).
- Log-normal: When the logarithm of a variable has a normal distribution.
- Chi-squared: For the distance squared in multiple dimensions.
- Multivariable Gaussian: For probabilities of a vector of variables.
Normal Distribution
- A continuous probability distribution that is bell-shaped and symmetrical.
- Many biological and physiological measurements follow a normal distribution.
- Central Limit Theorem: As the sample size increases, the distribution of the sample mean will tend towards a normal distribution.
- Skewness: A measure of the asymmetry of the distribution. Positive skewness indicates a longer right tail, while negative skewness indicates a longer left tail.
Poisson Distribution
- A discrete probability distribution describing the probability of a given number of events occurring in a fixed interval of time or space.
- Used for rare events.
Exponential Distribution
- A continuous probability distribution that describes the time until the next event.
- Used for random events that occur at a constant rate over time.
Binomial Distribution
- A discrete probability distribution used to calculate the probability of a certain number of successes in a given number of trials.
- Each trial has two possible outcomes: success or failure.
Odds Ratio
- A statistic used to quantify the strength of the association between two events.
- Measures the odds of exposure in cases versus controls.
T-test
- A statistical test used to determine if there is a significant difference between the means of two groups.
- Paired (Dependent) T-test: Used when the two groups are related (e.g., before and after treatment).
- Equal Variance (Independent) T-test: Used when two groups have the same variance.
- Unequal Variance (Independent) T-test: Used when two groups have different variances.
Multiple Hypothesis Testing
- The problem of testing multiple hypotheses simultaneously, leading to an increased risk of Type I errors (false positives).
Multiple Testing Correction
- Methods used to control the risk of false positives when testing multiple hypotheses.
- Bonferroni correction: Divides the significance level (alpha) by the number of tests.
- Benjamini-Hochberg procedure: Adjusts the p-values based on their rank.
Network Terms
- Nodes: Points or entities in a network
- Edges: Connections between nodes
- Leaves: Nodes with only one connection
Weighted vs. Unweighted Networks
- Unweighted: All edges have equal importance
- Weighted: Edges have different values or strengths
Directed vs. Undirected Networks
- Directed: Edges have a direction, one-way flow
- Undirected: Edges are bidirectional, two-way flow
Cycles and Acyclic Networks
- Cycle: A closed path in a network, starting and ending at the same node
- Acyclic: Networks with no cycles
Trees
- Tree: A connected acyclic network
Multilayer and Multiplex Networks
- Multilayer: Multiple networks connected by edges between the subnetworks
- Multiplex: Same nodes, but edges represent different relationships or contexts
Degree of a Node
- The number of edges connected to a node
- Indicates the node's influence or connectivity
Walks and Paths
- Walks: Sequence of nodes with connected edges
- Self-avoiding walks: Walks that don't repeat nodes
- Paths: Walks that don't intersect themselves
- Shortest path: Algorithms to find the most efficient route between nodes
Phylogenetic Trees
- Represents evolutionary relationships between organisms
Gene Regulatory Networks (GRNs)
- Nodes are genes or regulatory elements
- Edges represent regulatory interactions between genes
Determine GRNs
- Transcriptomic studies: Analyze gene expression levels
- Knockout: Deactivate a gene to study its effect
- Knockin: Insert a gene to observe its influence
- Drugs: Study drug interactions with specific genes
Multiplex Networks in Different Contexts
- Different cell types or conditions can be represented in a multiplex network
- Allows analysis of complex interactions across different contexts
Protein-Protein Interactions
- Nodes represent proteins
- Edges signify interactions between proteins, such as binding or complex formation
- Important for understanding cellular processes and signaling pathways
Machine Learning Intro
- Machine learning is a field that uses computer science, mathematics, statistics, and biology to solve problems.
Defining the Problem
- Machine learning solves problems through classification and prediction.
- Classification categorizes data into groups based on shared characteristics.
- Prediction determines the outcome of a future event.
- Key considerations when defining a problem:
- Accuracy, sensitivity, and specificity.
- Throughput, especially whether the solution is limited to a small number of use cases.
- Which features are necessary.
- How many and which samples to use.
Feature Selection
- Feature selection determines which features are important for machine learning.
- Feature selection helps to:
- Identify meaningful features.
- Determine correlations between features.
- Filter methods test for correlations, such as univariate and multivariate analysis.
- Wrapper methods select and test groups of features to identify the best combination.
- Embedded methods are part of the machine learning algorithm itself, where feature selection is integrated into the learning process.
Decision Tree
- Decision trees can work with continuous, discrete, and categorical data.
-
Steps:
- Determine the best split to separate data based on a specific feature.
- Move samples along the tree based on the split criteria.
- Repeat the process until a decision is reached.
Random Forest
- Random forest is a ensemble method that combines multiple decision trees for improved prediction.
-
Steps:
- Subset the various features with replacement, meaning features are chosen randomly with possible duplicates.
- Construct decision trees using the subset of features.
- Ensemble method combines the predictions from all the decision trees to determine the final prediction.
Random Forest Variable Importance
-
Mean Decrease Gini: A measure of how much each feature contributes to reducing the impurity in the decision tree.
- Higher Mean Decrease Gini indicates a more important feature.
- Random forest can be used to identify the most important variables based on the Mean Decrease Gini score.
K Nearest Neighbors (KNN)
- KNN is a supervised classification method that groups samples based on their similarity to known samples.
-
How it works:
- Determines the class of an unknown sample by considering its K nearest neighbors, those with the smallest distance in multi-dimensional space.
- KNN requires data to be binary or continuous, with the option to use principal component analysis to transform data if necessary.
-
Distance matrices measure the similarity between samples:
- Euclidean distance (cartesian distance between two points).
- Manhattan distance (absolute difference between coordinates in multiple dimensions).
- Jaccard similarity coefficient (presence/absence between two sets).
K-Means Clustering
- K-means clustering is an unsupervised classification method that groups samples into K clusters.
-
How it works:
- Randomly selects points in the data as initial centroids (representatives of each cluster).
- Assigns each sample to the closest centroid.
- Re-calculates the centroids based on the assigned samples.
- Repeats the process until the cluster assignments stabilize.
Matrices
- Dimensions matter when multiplying matrices
- Used to represent systems of equations
- Used to represent networks
Genomics
- Parts of a gene:
- Exons and Introns
- Untranslated regions (UTRs)
- Regulatory elements:
- Promoters
- Determined via next-generation sequencing or SNP-arrays
Transcriptomics
- PCR = Polymerase Chain Reaction
- qPCR = Quantitative PCR
- qPCR, microarrays, and RNAseq experiments are all used to analyze gene expression
- Microarrays use probes to detect mRNA, while RNAseq sequences the entire transcriptome
- RNAseq is considered the most accurate method, but it is also the most expensive
Probability
- Sample space: set of all possible outcomes of an experiment
- Types of probability distributions:
- Normal distribution
- Poisson distribution
- Binomial distribution
- Statistic quantifying association between events:
- Correlation coefficient
- Multiple testing correction:
- Correcting for the increased probability of false positives when conducting multiple statistical tests
Independence
- P(E and F) = P(E) * P(F)
Conditional Probabilities
- P(E|F) = Probability of Event E, given that Event F has already occurred
- Formula: P(F|E)P(E) / P(F)
Networks
- Multilayer: Different networks connected by edges
- Multiplex: Nodes are the same, but edges are different in the subnetworks
Networks
- Directed graph: Edges have direction
- Undirected graph: Edges have no direction
- Node's degree: Number of edges connected to a node
- Path: Sequence of nodes connected by edges
- Networks can be used to model and explain biological concepts, such as protein-protein interactions or gene regulatory networks
Machine Learning
- Classification: Categorize data into groups
- Prediction: Estimate the value of a variable
- Three different times to select features:
- During data collection
- During data preprocessing
- During model training
- Decision tree: Model used for classification and regression that uses a tree-like structure to make decisions
- Supervised learning: Train a model on labeled data
- Unsupervised learning: Train a model on unlabeled data
Mid-Term Layout
- Part 1: Knowledge base (50 points)
- In-class
- Part 2: Critical Review of a paper (50 points)
- Take home (Due Oct 17)
Synonymous vs. Non-Synonymous Variations
- Synonymous variations do not change the amino acid sequence, while non-synonymous variations do.
Variant Call Format (VCF)
- The Variant Call Format (VCF) is a standardized file format used to store and exchange genetic variation data.
Genomic Variations
- Translocations involve the exchange of DNA segments between non-homologous chromosomes.
Gene Components
- Introns are elements included in genes but not part of the coding region.
Matrix Terminology
- The term "rank" in matrices refers to the number of linearly independent rows or columns in a matrix.
Matrix Equations
- A system of equations can be solved using matrices if it can be represented in the form Ax = b, where A is the matrix of coefficients, x is the vector of unknowns, and b is the vector of constants.
Matrix Dimensions
- "m by n" in a matrix represents its dimensions, indicating that it has m rows and n columns.
Matrices in Biology
- Matrices can be used in biology to model genetic relationships, analyze protein interactions, and study population dynamics.
Independent Columns
- A column in a matrix is considered independent if it cannot be expressed as a linear combination of the other columns.
Matrix Properties
- A matrix can have more columns than rows, but not vice versa.
Matrix of Coefficients
- In the context of systems of equations, the matrix of coefficients represents the coefficients of the variables in each equation.
Matrices in Technology
- Matrices have applications in computer graphics, data analysis, and machine learning.
Single Nucleotide Variation (SNV)
- SNV involves a change in a single nucleotide in a DNA sequence.
VCF File Structure
- A primary characteristic of VCF file structure is its use of tab-delimited text format for storing variant information.
Genetic Variation and Disease
- Genome-wide association studies (GWAS) can be conducted to investigate the relationship between genetic variations and disease susceptibility.
Silent Mutations
- Silent mutations are genetic variations that do not result in a change in the amino acid sequence of a protein.
VCF File Content
- Genotype information is not usually included in the sample-specific information within a VCF file.
Matrices and Equations
- Matrices are used in relation to equations to simplify and solve systems of linear equations.
Matrix Rank
- The rank of a matrix is always less than or equal to the number of rows and columns in the matrix.
Matrix Row Representation
- A row of '1, 0, 0, 1, 0' in a matrix can indicate that a specific entity or variable is present in only the first and fourth positions.
Column Independence
- For a column to be considered independent in a matrix, it cannot be expressed as a linear combination of the other columns, meaning it doesn't have a direct linear relationship with any other column.
Matrix Notation
- "m by n" is used to describe the dimensions of a matrix, indicating that it has "m" rows and "n" columns.
Matrices in Biology (Application)
- Matrices are specifically applied in bioinformatics for sequence alignment analysis, where they can represent DNA or protein sequences.
Independent Column Feature
- Independent columns in a matrix often contain information about distinct variables or characteristics.
Matrices in Graphics
- Matrices are used in graphic representations to perform transformations, such as rotations, translations, and scaling of objects.
Quantitative PCR Components
- Quantitative PCR (qPCR) components include:
- DNA template
- Primers
- PCR master mix
- Fluorescent dye or probe
qPCR Definition
- qPCR stands for Quantitative Polymerase Chain Reaction.
Microarray RNA Analysis
- Microarrays can be used for total RNA analysis to study gene expression patterns across a large number of genes.
qPCR Fluorescence
- Capturing fluorescence in qPCR is used to quantify the amount of target DNA present in the reaction.
- Increased fluorescence indicates higher amounts of amplified target DNA.
Transcriptomic Experiments
- Transcriptomic experiments are conducted to investigate changes in gene expression, which can provide insights into biological processes, diseases, and drug response.
Delta-CT (qPCR)
- Delta-CT in qPCR represents the difference in cycle thresholds (CT) between the target gene and a reference gene.
qPCR Steps
- The main steps in Quantitative PCR include:
- Denaturation (separation of DNA strands)
- Annealing (primers bind to DNA)
- Extension (new DNA strands are synthesized)
Fold Change (qPCR)
- The calculated Fold Change in qPCR is given by the formula 2 raised to the power of ΔΔCT, indicating the relative expression level of the target gene.
Microarray Purpose
- The primary purpose of microarrays is to measure the expression levels of thousands of genes simultaneously, allowing for comprehensive gene expression profiling.
Transcriptomic Experiment Reasons
- Reasons for conducting transcriptomic experiments include:
- Understanding gene expression patterns in different conditions
- Identifying biomarkers for disease diagnosis
- Studying drug response and toxicity
Next Generation Sequencing
- Next generation sequencing (NGS) does not always use only total RNA; it can also be used for whole genome sequencing, exome sequencing, and other applications.
Delta-CT Significance
- Delta-CT in qPCR represents the difference in cycle thresholds (CT) between the target gene and a reference gene, quantifying the relative expression level of the target gene.
- A smaller Delta-CT indicates higher target gene expression.
PCR Definition
- PCR stands for Polymerase Chain Reaction.
Sample Mean Formula
- The formula for the sample mean (denoted by $\bar{x}$) is:
- $\bar{x}$ = (Σx) / n, where Σx represents the sum of all values in the sample, and n is the sample size.
Sample Variance Formula
- The formula for sample variance (denoted by s²) is:
- s² = Σ(x - $\bar{x}$)² / (n - 1), where x represents each data point, $\bar{x}$ is the sample mean, and n is the sample size.
Expected Value Formula
- The expected value (denoted by E(X)) is calculated by:
- E(X) = ΣxP(x), where x represents each possible value of the random variable X, and P(x) is the probability of that value occurring.
Independence in Probability
- Independence in probability means that the occurrence of one event does not affect the probability of another event occurring.
Rare Events Distribution
- The Poisson distribution is used to describe rare events in a large population.
Events Over Time
- The exponential distribution is used for modeling random events occurring over time.
T-Test Purpose
- The t-test is primarily used to compare the means of two groups.
Bonferroni Correction
- The Bonferroni correction in multiple testing adjusts the significance threshold to account for the increased chance of false positives when performing multiple statistical tests.
Central Limit Theorem
- The Central Limit Theorem states that the distribution of sample means from a population will approach a normal distribution as the sample size increases.
Odds Ratio
- The odds ratio quantifies the strength of the association between two events in terms of the odds of exposure in one group compared to the controls.
Network Components
- The parts of a network are:
- Nodes (represent entities)
- Edges (represent connections between entities)
Unweighted Graph
- An unweighted graph has edges without any weights assigned to them.
Node Degree
- The degree of a node refers to the number of edges connected to it.
Multiplex vs. Multilayer Network
- A multiplex network has different types of interactions within the same layer, while a multilayer network has different types of interactions across multiple layers.
Path Definition
- A path is defined as a walk that does not visit any node more than once.
GRN Determination
- One method for determining gene regulatory networks (GRNs) is Bayesian network inference, which uses probabilistic relationships between genes to reconstruct the network structure.
Phylogenetic Trees
- Phylogenetic trees are used to represent evolutionary relationships between organisms or genes.
Network Walks
- 'Walks' in a network are sequences of nodes and edges, where the node at the end of one edge is the beginning of the next, and each edge is traversed only once.
Feature Selection Considerations
- Factors considered in feature selection include:
- Relevance (degree of association with the target)
- Redundancy (how much overlap exists between features)
- Cost (of obtaining and processing features)
Classification Definition
- Classification categorizes data into predetermined categories based on shared characteristics.
Prediction in Machine Learning
- The primary goal of prediction in machine learning is to build models that can accurately predict future outcomes based on historical data.
Random Forest Technique
- Random Forest uses a technique called bagging (bootstrap aggregating) to construct decision trees.
K Nearest Neighbors (K)
- The K in K Nearest Neighbors represents the number of nearest neighbors to consider when classifying a new data point.
KNN Data Types
- K Nearest Neighbors can work with various data types, including numerical, categorical, and mixed data.
Algorithm Functions
- Algorithm & Primary Function:
- K-Means Clustering: Unsupervised clustering algorithm that groups data points into clusters based on their similarity.
- K Nearest Neighbors: Supervised classification algorithm that classifies a new data point based on its proximity to known labeled data points.
- Decision Tree: Supervised classification and regression algorithm that uses a tree-like structure to make predictions.
- Random Forest: Ensemble learning method that combines multiple decision trees to improve prediction accuracy and reduce variance.
Matrix Dimension Impact
- Dimensions of matrices affect the resulting matrix after multiplication:
- If the number of columns in the first matrix is not equal to the number of rows in the second matrix, multiplication is not possible.
- The resulting matrix will have the same number of rows as the first matrix and the same number of columns as the second matrix.
Gene Components
- Parts of a gene include:
- Promoter (regulates gene expression)
- Exons (coding sequences)
- Introns (non-coding sequences)
- 5' untranslated region (UTR)
- 3' untranslated region (UTR)
- Polyadenylation signal (signals end of gene)
Gene Determination Tools
- Primary tools used to determine parts of a gene include:
- DNA sequencing
- Gene prediction algorithms
- RNA sequencing
PCR vs. qPCR
- PCR (Polymerase Chain Reaction) amplifies DNA, while qPCR (Quantitative Polymerase Chain Reaction) quantifies the amount of DNA amplified.
Sample Space
- Sample space is the set of all possible outcomes of an experiment or random phenomenon.
Association Strength
- A statistic that quantifies the strength of the association between two events is the correlation coefficient.
Independence Rule
- Independence follows the rule: P(EF) = P(E)P(F), where P(EF) is the probability of both events E and F occurring, P(E) is the probability of event E occurring, and P(F) is the probability of event F occurring.
Directed vs. Undirected Graph
- A directed graph has edges with a specific direction, while an undirected graph has edges without a specific direction.
Node Degree
- A node's degree tells you the number of connections it has to other nodes in a network.
Classification
- Classification in machine learning is the process of assigning data points to predefined categories based on their characteristics.
Supervised vs. Unsupervised Learning
- Supervised learning uses labeled data to train models, while unsupervised learning uses unlabeled data to discover patterns.
Feature Selection Timing
- Three different times in which you can select features:
- Pre-processing: Features are selected before training the model.
- During training: Features are selected during the model training process.
- Post-processing: Features are selected after the model has been trained.
Feature Selection Situations
- Features are selected during specific situations:
- High dimensionality: When there are many features and only a few are relevant.
- Overfitting: When the model is too complex and performs well on the training data but poorly on new data.
- Computational efficiency: When reducing the number of features can speed up the training process.
Feature Selection Phase
- A phase in which features are selected is the feature engineering phase. This phase involves selecting, transforming, and creating new features that improve the performance of machine learning models.
Feature Selection Timing Aspect
- One key aspect of the timing in feature selection is that it can influence the model's performance. If features are selected before training, the model might miss out on valuable information. If features are selected after training, the model might not perform well on new data.
Feature Selection Scenario
- A scenario that might not involve feature selection timing is when the data is already clean and relevant, with a few features that are well-defined and contribute directly to the model's performance.
Feature Selection Reassessment
- It is ideal to reassess feature selection when the data distribution changes significantly, when new data becomes available, or when there are changes in the problem that require a different set of features.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
This quiz explores concepts in probability models and their applications in biological data analysis. The content is based on the interdisciplinary course material, drawing from topics covered in lectures and recommended readings. Test your understanding of the key principles and methodologies discussed in the course.