Probability Models in Biological Data Analysis
82 Questions
3 Views

Probability Models in Biological Data Analysis

Created by
@BelovedSulfur

Podcast Beta

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What differentiates synonymous variations from non-synonymous variations?

  • Synonymous variations do not alter the protein's amino acid sequence (correct)
  • Synonymous variations always result in gene deletions
  • Non-synonymous variations occur only in non-coding regions
  • Non-synonymous variations do not change the amino acid sequence
  • Which of the following best describes the purpose of the Variant Call Format (VCF)?

  • To summarize genomic variations across multiple samples (correct)
  • To provide a graphical user interface for genetics software
  • To encode the structure of genes and proteins
  • To list the regulatory networks in different organisms
  • Which type of genomic variation includes the exchange of DNA segments between non-homologous chromosomes?

  • Polymorphisms
  • Large duplications
  • Translocations (correct)
  • Insertions
  • What type of elements are included in genes but are not part of the coding region?

    <p>Introns</p> Signup and view all the answers

    What does the term 'rank' refer to in the context of matrices?

    <p>The number of independent columns in a matrix</p> Signup and view all the answers

    Which equation represents a system solvable through matrices?

    <p>X1 + 3X2 = 8 and X1 + 2X2 = 6</p> Signup and view all the answers

    What does 'm by n' represent in a matrix?

    <p>The rows and columns in the matrix respectively</p> Signup and view all the answers

    How can matrices be used in biology?

    <p>To represent gene regulatory networks</p> Signup and view all the answers

    What constitutes a column being considered independent in a matrix?

    <p>It cannot be formed by a combination of other columns</p> Signup and view all the answers

    Which of the following statements about matrices is incorrect?

    <p>Matrices are only relevant in mathematics.</p> Signup and view all the answers

    In the context of systems of equations, what does the matrix of coefficients represent?

    <p>The relationships between variables</p> Signup and view all the answers

    Which of the following is a common application of matrices in technology?

    <p>Modeling infectious disease transmission</p> Signup and view all the answers

    Which genomic variation involves a change in a single nucleotide that may or may not affect the protein coding sequence?

    <p>Single Nucleotide Variations</p> Signup and view all the answers

    What is a primary characteristic of the Variant Call Format (VCF) file structure?

    <p>It includes a header region with metadata about the samples.</p> Signup and view all the answers

    What type of analysis can be conducted to investigate the relationship between genetic variations and disease susceptibility?

    <p>Genome Wide Association Studies (GWAS)</p> Signup and view all the answers

    Which term refers to genetic variations that do not result in a change in the amino acid sequence of a protein?

    <p>Synonymous variations</p> Signup and view all the answers

    Which of the following components is NOT usually included in the sample-specific information within a VCF file?

    <p>Total gene expression</p> Signup and view all the answers

    What is the main reason matrices are used in relation to equations?

    <p>They represent systems of equations.</p> Signup and view all the answers

    Which statement about the rank of a matrix is true?

    <p>Rank is the count of independent columns in a matrix.</p> Signup and view all the answers

    In the context of matrix representation, what can the row of '1, 0, 0, 1, 0' indicate?

    <p>It can represent connections in a graph.</p> Signup and view all the answers

    What condition must be satisfied for a column to be considered independent in a matrix?

    <p>It cannot be the zero vector or a combination of other columns.</p> Signup and view all the answers

    How is the term 'm by n' used in relation to matrices?

    <p>It denotes the number of rows and columns in a matrix.</p> Signup and view all the answers

    Which application of matrices is specifically mentioned in the context of biology?

    <p>Gene regulatory networks.</p> Signup and view all the answers

    What is a common feature of independent columns in a matrix?

    <p>They cannot be obtained from linear combinations of other columns.</p> Signup and view all the answers

    What is the purpose of using matrices in graphic representations?

    <p>To visualize equations and computations.</p> Signup and view all the answers

    What are the components of Quantitative PCR?

    <p>Determine Double Delta-CT</p> Signup and view all the answers

    What does qPCR stand for?

    <p>Quantitative Polymerase Chain Reaction</p> Signup and view all the answers

    Microarrays can be used for total RNA analysis.

    <p>True</p> Signup and view all the answers

    What is the purpose of capturing fluorescence in qPCR?

    <p>To measure the amount of DNA during amplification</p> Signup and view all the answers

    What is one of the reasons for conducting transcriptomic experiments?

    <p>General gene discovery</p> Signup and view all the answers

    What does the term Delta-CT refer to in qPCR?

    <p>The difference between the target gene's CT value and the housekeeping gene's CT value</p> Signup and view all the answers

    What does qPCR stand for?

    <p>Quantitative Polymerase Chain Reaction</p> Signup and view all the answers

    What are the main steps in Quantitative PCR?

    <p>Reverse Transcription, Amplify, Capture fluorescence, Repeat steps for 29 more rounds, Determine Double Delta-CT</p> Signup and view all the answers

    The calculated Fold Change in qPCR is given by the formula 2 raised to the power of ΔΔCT.

    <p>2-ΔΔCT</p> Signup and view all the answers

    What is the primary purpose of microarrays?

    <p>To detect changes in gene expression</p> Signup and view all the answers

    List some reasons for conducting transcriptomic experiments.

    <p>General gene discovery, compare experimental treatments, drug treatments, knockout/knockin studies, toxicology, developmental stages</p> Signup and view all the answers

    Next generation sequencing only uses Total RNA.

    <p>False</p> Signup and view all the answers

    What does Delta-CT represent in qPCR?

    <p>The difference between Gene of Interest Experimental and House Keeping Gene Experimental</p> Signup and view all the answers

    PCR stands for ______.

    <p>Polymerase Chain Reaction</p> Signup and view all the answers

    What is the formula for the sample mean?

    <p>m = µ = (x1 + x2 + ... + xN) / N</p> Signup and view all the answers

    What is the formula for sample variance?

    <p>S² = [(x1 - m)² + ... + (xN - m)²] / (N - 1)</p> Signup and view all the answers

    What is the expected value formula?

    <p>m = E(x) = p1 * x1 + p2 * x2 + ... + pN * xN</p> Signup and view all the answers

    What does independence in probability indicate?

    <p>P(EF) = P(E) * P(F)</p> Signup and view all the answers

    Which of the following distributions is used to describe rare events in a large population?

    <p>Poisson</p> Signup and view all the answers

    What type of probability distribution is used for modeling random events occurring over time?

    <p>Exponential</p> Signup and view all the answers

    What is the t-test primarily used for?

    <p>Testing differences between means of two groups</p> Signup and view all the answers

    What is the Bonferroni correction in multiple testing?

    <p>α/m</p> Signup and view all the answers

    What is the use of the Central Limit Theorem?

    <p>It states that as N goes to infinity, the sample mean will be approximately normally distributed.</p> Signup and view all the answers

    The odds ratio quantifies the strength of the association between two events in terms of the odds of exposure in _____ compared to controls.

    <p>cases</p> Signup and view all the answers

    What are the parts of a network?

    <p>All of the above</p> Signup and view all the answers

    An unweighted graph has edges with non-negative weights.

    <p>False</p> Signup and view all the answers

    What is the degree of a node?

    <p>The number of edges connected to it.</p> Signup and view all the answers

    What differentiates a multiplex network from a multilayer network?

    <p>Edges are different in subnetworks</p> Signup and view all the answers

    A path is defined as a walk that does not __________ itself.

    <p>intersect</p> Signup and view all the answers

    Name one method for determining gene regulatory networks (GRNs).

    <p>Transcriptomic studies.</p> Signup and view all the answers

    Phylogenetic trees are used to represent evolutionary relationships.

    <p>True</p> Signup and view all the answers

    What are 'walks' in a network?

    <p>A sequence of nodes connected by edges</p> Signup and view all the answers

    Which of the following are factors considered in feature selection?

    <p>Correlations?</p> Signup and view all the answers

    Classification categorizes data based on shared characteristics.

    <p>True</p> Signup and view all the answers

    What is the primary goal of prediction in machine learning?

    <p>To determine the outcome of a future event.</p> Signup and view all the answers

    Random Forest uses a technique called ______ to construct decision trees.

    <p>ensemble method</p> Signup and view all the answers

    What does the K in K Nearest Neighbors represent?

    <p>The number of closest matches considered.</p> Signup and view all the answers

    What types of data can K Nearest Neighbors work with?

    <p>Binary data</p> Signup and view all the answers

    Match the following algorithms with their primary function:

    <p>K Nearest Neighbors = Supervised classification based on nearest matches Random Forest = Ensemble method using multiple decision trees K-Means Clustering = Unsupervised classification into clusters Decision Tree = Data categorization based on feature splits</p> Signup and view all the answers

    K-Means Clustering is a type of supervised classification.

    <p>False</p> Signup and view all the answers

    What do dimensions of matrices affect when multiplying them?

    <p>The ability to multiply them.</p> Signup and view all the answers

    Which of the following are parts of a gene? (Select all that apply)

    <p>Promoters</p> Signup and view all the answers

    What are the primary tools used to determine parts of a gene?

    <p>Next-generation sequencing or SNP-arrays.</p> Signup and view all the answers

    What is the difference between PCR and qPCR?

    <p>qPCR measures the quantity of DNA in real-time, whereas PCR amplifies DNA without measuring its quantity during the process.</p> Signup and view all the answers

    What is sample space?

    <p>The set of all possible outcomes in a probability experiment.</p> Signup and view all the answers

    What is a statistic that quantifies the strength of the association between two events?

    <p>Correlation coefficient.</p> Signup and view all the answers

    Independence follows the rule: P(EF) = P(E)P(F).

    <p>True</p> Signup and view all the answers

    What is the difference between a directed and undirected graph?

    <p>A directed graph has edges with a direction, while an undirected graph has edges without direction.</p> Signup and view all the answers

    What does a node's degree tell you?

    <p>It tells you the number of connections (edges) a node has.</p> Signup and view all the answers

    What is a classification in machine learning?

    <p>A process of identifying the category to which new data points belong.</p> Signup and view all the answers

    What is the difference between supervised and unsupervised learning?

    <p>Supervised learning uses labeled data while unsupervised learning uses unlabelled data.</p> Signup and view all the answers

    What are the three different times in which you can select features?

    <p>Before collecting data, during data preparation, and after model evaluation.</p> Signup and view all the answers

    During which specific situations can features be selected?

    <p>At the start of the experiment, during data review, and after experiment analysis</p> Signup and view all the answers

    Which choice accurately describes a phase in which features are selected?

    <p>Feature selection can be done at various stages including pre-analysis and post-analysis.</p> Signup and view all the answers

    What is one key aspect of the timing in feature selection?

    <p>Can be revisited at different points of the research process.</p> Signup and view all the answers

    Which scenario might not involve feature selection timing?

    <p>Determining features solely based on literature review prior to the project.</p> Signup and view all the answers

    When is it ideal to reassess feature selection?

    <p>At any transitional point in the study when new information arises.</p> Signup and view all the answers

    Study Notes

    Course Information

    • This course is cross-listed as BINF5354, STAT5354, and STAT6354.
    • Lectures are held on Tuesdays from 9:00 AM to 10:20 AM.
    • Labs are held on Thursdays from 9:00 AM to 10:20 AM.
    • Dr. Jon Mohl is the instructor and his office hours are Tuesdays from 2:00 PM to 3:00 PM in CCSB 2.0306.
    • The teaching assistant's office hours are Thursdays from 10:30 AM to 11:30 AM in BE302.
    • Announcements and assignments will be posted on Blackboard.
    • The primary communication methods are email or through Teams.

    Additional Resources

    • Recommended books include "Linear Algebra and Learning from Data" by Gilbert Strang, "Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids" by R.Durbin, S.Eddy, A.Krogh, and G.Mitchison, and "Introduction to Probability Models" by Sheldon M. Ross.
    • Students are encouraged to read articles related to the course topics.

    Grading

    • The course grade is determined by homework assignments (50%), a midterm exam (20%), and a group project (30%).
    • The group project consists of a proposal, a final presentation, and a final report.

    Attendance & Civility

    • Attendance is mandatory.
    • Students are expected to be civil, keep their phones on silent, and pay attention during lectures and labs.

    Course Objectives

    • Students will learn practical skills including programming in R and Python, and working on Linux machines.
    • Students will gain theoretical knowledge in linear algebra, probability/statistics, machine learning, and bioinformatics.

    Projects

    • Projects will utilize genomics data from VCF files.

    Matrices

    • A matrix is a rectangular array of numbers, symbols, or expressions, arranged in rows and columns
    • Matrices are used to represent systems of equations, graphs, and other mathematical objects.
    • The number of rows and columns in a matrix is referred to as its dimensions, for example, a matrix with m rows and n columns is called an m x n matrix.
    • The rank of a matrix is the number of linearly independent columns.

    Uses of Matrices in Biology

    • Matrices are used to model gene regulatory networks, metabolic networks, infectious disease models, and survival analysis

    Genomics

    • Genomics is the study of genomes, which are the complete sets of genetic material in an organism.
    • Genomes are filled with DNA, including regulatory factors and repetitive elements.
    • Variations within a genome include large deletions, large duplications, translocations, and single nucleotide variations (SNVs).
    • SNVs include insertions (adding a nucleotide), deletions (removing a nucleotide), and polymorphisms (variations in DNA sequence within a population).

    Variant Call Format (VCF)

    • VCF is a file format for storing sequence information, including DNA variants.
    • It is a tab-delimited file with a header region containing version information, chromosome/scaffold/contig (sequence of DNA within a chromosome) details, and information about the content of the file itself.
    • The VCF file can contain information on one or more samples.
    • It contains general position information like chromosome, position, reference allele, and alternate allele(s), and also information about the gene.
    • Sample-specific information included in VCF files includes allele calls, site depth, the number of counts per allele, and quality.

    Group Project

    • Students are divided into 3 groups.
    • Each group will form a testable hypothesis about population analysis or disease.
    • The groups need to compile genetic information and determine the software necessary to analyze the data.
    • Students will run the analysis on “Biolinux machines” (computer systems specifically designed for biological research) and prepare a preliminary presentation outlining the results.
    • They will complete the project by presenting their final findings in a written report and final presentation.

    Matrices

    • Matrices represent systems of linear equations, graphs, and biological networks.
    • A matrix is a rectangular array of numbers.
    • The dimensions of a matrix are defined by the number of rows and columns.
    • The rank of a matrix is the number of linearly independent columns.
    • Matrices are used to represent systems of equations, graphs, gene regulatory networks, metabolic networks, infectious disease models, and survival analysis.

    Genomics

    • Genomics is the study of genomes, including the sequencing and analysis of genetic information.
    • Genomes contain a variety of elements, including genes, regulatory factors, repetitive elements, exons, introns, and untranscribed elements.
    • Variations within the genome include large deletions, large duplications, translocations, and single-nucleotide variations (SNVs).
    • SNVs can be insertions, deletions, or polymorphisms.

    Applications of Genomics

    • Genomic information can be used for population and ancestry structure studies, genome-wide association studies (GWAS), and determining the effects of variants on gene function.
    • SNVs can be synonymous or non-synonymous, meaning they may or may not alter the amino acid sequence of a protein.
    • Variants can also disrupt regulatory elements affecting gene expression.

    Variant Call Format (VCF)

    • VCF is a file format used to store sequence information, including variant calls.
    • VCF files are tab-delimited and contain a header region providing information about the file format, chromosome/scaffold/contig details, and content information.
    • VCF files can contain data for one or more samples and include general position information as well as sample-specific information.
    • VCF files can contain all sites or only those with variants.

    Group Project

    • Students will work in groups to develop and test a hypothesis related to population analysis or a specific disease.
    • Each group will collect genetic information and determine the necessary software for their analysis.
    • The groups will present preliminary findings and reports, run their analyses on Biolinux machines, and provide final write-ups and presentations.

    Transcriptomics

    • Transcriptomics studies the complete set of RNA transcripts produced by an organism.
    • Useful for understanding gene expression patterns and identifying genes involved in specific processes.
    • Common techniques include real-time PCR (qPCR), microarray analysis, and next-generation sequencing (NGS).

    Real-time PCR (qPCR)

    • A technique used for quantifying specific DNA or RNA molecules.
    • Uses reverse transcription to convert RNA to cDNA followed by amplification with specific primers.
    • Fluorescent probes detect amplified DNA in real-time, allowing for quantification of target molecule.

    Quantitative PCR

    • Uses a cycle threshold (CT) value to measure the amount of target molecule present in a sample.
    • Lower CT values indicate higher target molecule abundance.
    • Calculation of ΔΔCT is used to determine fold change in gene expression between experimental and control groups

    Microarrays

    • A technology that allows for simultaneous analysis of thousands of genes.
    • Short DNA sequences (probes) specific to individual genes are attached to a solid surface.
    • Fluorescently labeled cDNA from samples is hybridized to the array.
    • The amount of fluorescence bound to each probe reflects the abundance of the corresponding mRNA in the sample.

    Next-Generation Sequencing (NGS)

    • High-throughput sequencing technology that can sequence millions or billions of DNA fragments simultaneously.
    • Total RNA can be sequenced using NGS to obtain a snapshot of the transcriptome.
    • Poly-A capture is used to isolate mRNA, and specific probe sequences can be used to enrich for target transcripts.

    Why do transcriptomic experiments?

    • Gene discovery: Identify new genes involved in specific processes or cellular functions.
    • Compare experimental treatments: Determine the effect of treatments (drugs, knockouts/knockins, toxins) on gene expression.
    • Toxicology: Assess the impact of toxins on gene expression and cellular pathways.
    • Developmental stages: Study changes in gene expression during development.

    Transcriptomics

    • Transcriptomics is the study of the complete set of RNA transcripts in a cell or organism
    • It involves studying the structure, function, and regulation of RNA molecules
    • Three common techniques for transcriptomic analysis: real-time PCR (qPCR), Microarrays, and Next Generation Sequencing

    Real-Time PCR (qPCR)

    • Amplifies DNA by using primers to specifically target DNA segments
    • Is used for detection, cloning, and sequencing of specific regions of DNA
    • Quantification of PCR products is achieved by measuring fluorescence during the reaction

    Quantitative PCR

    • Reverse Transcription: Conversion of RNA to cDNA to be used in PCR reaction
    • Amplify: Amplifies the DNA from the reverse transcription step
    • Capture Fluorescence: Fluorescence is measured, which is proportional to the amount of cDNA amplified
    • Repeat: Step 2 and 3 are repeated for 29 cycles
    • Determine the Double Delta-CT: Formula used to calculate fold change of the gene of interest
      • Gene of Interest Experimental (TE)
      • Gene of Interest Control (TC)
      • House Keeping Gene Experimental (HE)
      • House Keeping Gene Control (HC)

    Calculating Fold Change

    • Delta-CT (ΔCT): Difference in cycle threshold (CT) between the gene of interest and the housekeeping gene
      • ΔCTE: TE-HE
      • ΔCTC: TC-HC
    • Double Delta-CT (ΔΔCT): Used to calculate the relative fold change in gene expression
      • ΔCTE – ΔCTEC= ΔΔCT
    • Fold Change: Represented by 2-ΔΔCT, which gives a value of the difference in gene expression

    Microarrays

    • Use a solid surface with thousands of probes that represent different genes
    • Samples of cDNA, labeled with fluorescent dyes, are hybridized to the microarray
    • The amount of fluorescence detected at each probe indicates the level of expression of the corresponding gene
    • Advantages of Microarrays:
      • Allows the simultaneous assessment of the expression of thousands of genes
      • Provides high-throughput analysis of gene expression patterns
      • Identifies gene expression profiles that correlate with specific biological states or conditions
    • Limitations of Microarrays:
      • Can be expensive and technically challenging
      • May not be as sensitive or specific as qPCR
      • Limited to known, pre-selected target genes

    Next-generation Sequencing

    • Involves sequencing millions or billions of DNA fragments simultaneously
    • Provides a more comprehensive view of the transcriptome, uncovering novel transcripts, gene fusions, and splice variants
    • The technology enables researchers to identify the abundance and expression of all transcripts in a sample
    • This approach allows for the discovery of previously unknown genes and transcripts, and identify changes in gene expression in response to various stimuli
    • Steps involved in Next Generation Sequencing:
      • Isolating total RNA from a sample
      • Poly-A Capture: Selectively isolates mRNA molecules that are polyadenylated
      • Specific Probes: Used to target and amplify specific sequences of interest

    Why Conduct Transcriptomic Experiments?

    • Gene Discovery: Identifies novel genes or transcripts
    • Comparing Experimental Treatments: To study the effects of different treatments on gene expression
      • Drug treatments
      • Knockout/knockin studies
      • Toxicology
      • Developmental stages

    Designing a Transcriptomic Experiment

    • Assumptions: Understand the specific hypothesis being tested and have data for the control group
    • Expected Results: Define the types of changes in gene expression expected to be observed in the experiment based on the hypothesis and its potential significance
    • Potential Pitfalls: Identify factors that could affect the outcome of the experiment
      • Variations in RNA isolation, contamination, and technical errors during PCR amplification

    Mean and Variance

    • Sample Mean: The average of a set of data points, denoted by 'm' or 'µ'.
    • Sample Variance: Measures how spread out the data is from the mean, denoted by 'S²', calculated by averaging the squared deviations of each data point from the mean.

    Expected Value

    • The average value of a random variable, denoted by 'E(x)', calculated by taking the weighted average of all possible values of the variable, where the weights are the probabilities of each value.

    Variance

    • A measure of how spread out the data is from the expected value, denoted by 'σ²'.
    • Calculated by averaging the squared deviations of each data point from the expected value.

    Standard Deviation

    • The square root of the variance, a measure of dispersion that is in the same units as the data.

    Sample Space

    • The set of all possible outcomes of an experiment.
    • Examples:
      • A coin toss: {Heads, Tails}
      • Rolling a die: {1, 2, 3, 4, 5, 6}
      • Two coin tosses: {HH, HT, TH, TT}

    Probability

    • The likelihood of an event occurring.
    • Example:
      • Tossing a fair coin: 1/2 probability of getting heads
      • Rolling a specific number on a fair die: 1/6 probability

    Independence

    • Two events are independent if the occurrence of one does not affect the probability of the other.
    • P(E and F) = P(E) * P(F)

    Conditional Probabilities

    • The probability of an event happening given that another event has already occurred.
    • P(E|F) = P(E and F) / P(F)

    Probability Distributions

    • Describe the probability of each possible outcome of a random variable.
    • Examples:
      • Binomial: For events with two possible outcomes (e.g., coin toss).
      • Poisson: For rare events (e.g., mutations in a cell population).
      • Exponential: For events occurring at a constant rate over time (e.g., protein decay).
      • Gaussian (Normal): For averages of many trials (e.g., height).
      • Log-normal: When the logarithm of a variable has a normal distribution.
      • Chi-squared: For the distance squared in multiple dimensions.
      • Multivariable Gaussian: For probabilities of a vector of variables.

    Normal Distribution

    • A continuous probability distribution that is bell-shaped and symmetrical.
    • Many biological and physiological measurements follow a normal distribution.
    • Central Limit Theorem: As the sample size increases, the distribution of the sample mean will tend towards a normal distribution.
    • Skewness: A measure of the asymmetry of the distribution. Positive skewness indicates a longer right tail, while negative skewness indicates a longer left tail.

    Poisson Distribution

    • A discrete probability distribution describing the probability of a given number of events occurring in a fixed interval of time or space.
    • Used for rare events.

    Exponential Distribution

    • A continuous probability distribution that describes the time until the next event.
    • Used for random events that occur at a constant rate over time.

    Binomial Distribution

    • A discrete probability distribution used to calculate the probability of a certain number of successes in a given number of trials.
    • Each trial has two possible outcomes: success or failure.

    Odds Ratio

    • A statistic used to quantify the strength of the association between two events.
    • Measures the odds of exposure in cases versus controls.

    T-test

    • A statistical test used to determine if there is a significant difference between the means of two groups.
    • Paired (Dependent) T-test: Used when the two groups are related (e.g., before and after treatment).
    • Equal Variance (Independent) T-test: Used when two groups have the same variance.
    • Unequal Variance (Independent) T-test: Used when two groups have different variances.

    Multiple Hypothesis Testing

    • The problem of testing multiple hypotheses simultaneously, leading to an increased risk of Type I errors (false positives).

    Multiple Testing Correction

    • Methods used to control the risk of false positives when testing multiple hypotheses.
    • Bonferroni correction: Divides the significance level (alpha) by the number of tests.
    • Benjamini-Hochberg procedure: Adjusts the p-values based on their rank.

    Network Terms

    • Nodes: Points or entities in a network
    • Edges: Connections between nodes
    • Leaves: Nodes with only one connection

    Weighted vs. Unweighted Networks

    • Unweighted: All edges have equal importance
    • Weighted: Edges have different values or strengths

    Directed vs. Undirected Networks

    • Directed: Edges have a direction, one-way flow
    • Undirected: Edges are bidirectional, two-way flow

    Cycles and Acyclic Networks

    • Cycle: A closed path in a network, starting and ending at the same node
    • Acyclic: Networks with no cycles

    Trees

    • Tree: A connected acyclic network

    Multilayer and Multiplex Networks

    • Multilayer: Multiple networks connected by edges between the subnetworks
    • Multiplex: Same nodes, but edges represent different relationships or contexts

    Degree of a Node

    • The number of edges connected to a node
    • Indicates the node's influence or connectivity

    Walks and Paths

    • Walks: Sequence of nodes with connected edges
    • Self-avoiding walks: Walks that don't repeat nodes
    • Paths: Walks that don't intersect themselves
    • Shortest path: Algorithms to find the most efficient route between nodes

    Phylogenetic Trees

    • Represents evolutionary relationships between organisms

    Gene Regulatory Networks (GRNs)

    • Nodes are genes or regulatory elements
    • Edges represent regulatory interactions between genes

    Determine GRNs

    • Transcriptomic studies: Analyze gene expression levels
    • Knockout: Deactivate a gene to study its effect
    • Knockin: Insert a gene to observe its influence
    • Drugs: Study drug interactions with specific genes

    Multiplex Networks in Different Contexts

    • Different cell types or conditions can be represented in a multiplex network
    • Allows analysis of complex interactions across different contexts

    Protein-Protein Interactions

    • Nodes represent proteins
    • Edges signify interactions between proteins, such as binding or complex formation
    • Important for understanding cellular processes and signaling pathways

    Machine Learning Intro

    • Machine learning is a field that uses computer science, mathematics, statistics, and biology to solve problems.

    Defining the Problem

    • Machine learning solves problems through classification and prediction.
    • Classification categorizes data into groups based on shared characteristics.
    • Prediction determines the outcome of a future event.
    • Key considerations when defining a problem:
      • Accuracy, sensitivity, and specificity.
      • Throughput, especially whether the solution is limited to a small number of use cases.
      • Which features are necessary.
      • How many and which samples to use.

    Feature Selection

    • Feature selection determines which features are important for machine learning.
    • Feature selection helps to:
      • Identify meaningful features.
      • Determine correlations between features.
    • Filter methods test for correlations, such as univariate and multivariate analysis.
    • Wrapper methods select and test groups of features to identify the best combination.
    • Embedded methods are part of the machine learning algorithm itself, where feature selection is integrated into the learning process.

    Decision Tree

    • Decision trees can work with continuous, discrete, and categorical data.
    • Steps:
      • Determine the best split to separate data based on a specific feature.
      • Move samples along the tree based on the split criteria.
      • Repeat the process until a decision is reached.

    Random Forest

    • Random forest is a ensemble method that combines multiple decision trees for improved prediction.
    • Steps:
      • Subset the various features with replacement, meaning features are chosen randomly with possible duplicates.
      • Construct decision trees using the subset of features.
      • Ensemble method combines the predictions from all the decision trees to determine the final prediction.

    Random Forest Variable Importance

    • Mean Decrease Gini: A measure of how much each feature contributes to reducing the impurity in the decision tree.
      • Higher Mean Decrease Gini indicates a more important feature.
    • Random forest can be used to identify the most important variables based on the Mean Decrease Gini score.

    K Nearest Neighbors (KNN)

    • KNN is a supervised classification method that groups samples based on their similarity to known samples.
    • How it works:
      • Determines the class of an unknown sample by considering its K nearest neighbors, those with the smallest distance in multi-dimensional space.
    • KNN requires data to be binary or continuous, with the option to use principal component analysis to transform data if necessary.
    • Distance matrices measure the similarity between samples:
      • Euclidean distance (cartesian distance between two points).
      • Manhattan distance (absolute difference between coordinates in multiple dimensions).
      • Jaccard similarity coefficient (presence/absence between two sets).

    K-Means Clustering

    • K-means clustering is an unsupervised classification method that groups samples into K clusters.
    • How it works:
      • Randomly selects points in the data as initial centroids (representatives of each cluster).
      • Assigns each sample to the closest centroid.
      • Re-calculates the centroids based on the assigned samples.
      • Repeats the process until the cluster assignments stabilize.

    Matrices

    • Dimensions matter when multiplying matrices
    • Used to represent systems of equations
    • Used to represent networks

    Genomics

    • Parts of a gene:
      • Exons and Introns
      • Untranslated regions (UTRs)
      • Regulatory elements:
        • Promoters
    • Determined via next-generation sequencing or SNP-arrays

    Transcriptomics

    • PCR = Polymerase Chain Reaction
    • qPCR = Quantitative PCR
    • qPCR, microarrays, and RNAseq experiments are all used to analyze gene expression
    • Microarrays use probes to detect mRNA, while RNAseq sequences the entire transcriptome
    • RNAseq is considered the most accurate method, but it is also the most expensive

    Probability

    • Sample space: set of all possible outcomes of an experiment
    • Types of probability distributions:
      • Normal distribution
      • Poisson distribution
      • Binomial distribution
    • Statistic quantifying association between events:
      • Correlation coefficient
    • Multiple testing correction:
      • Correcting for the increased probability of false positives when conducting multiple statistical tests

    Independence

    • P(E and F) = P(E) * P(F)

    Conditional Probabilities

    • P(E|F) = Probability of Event E, given that Event F has already occurred
    • Formula: P(F|E)P(E) / P(F)

    Networks

    • Multilayer: Different networks connected by edges
    • Multiplex: Nodes are the same, but edges are different in the subnetworks

    Networks

    • Directed graph: Edges have direction
    • Undirected graph: Edges have no direction
    • Node's degree: Number of edges connected to a node
    • Path: Sequence of nodes connected by edges
    • Networks can be used to model and explain biological concepts, such as protein-protein interactions or gene regulatory networks

    Machine Learning

    • Classification: Categorize data into groups
    • Prediction: Estimate the value of a variable
    • Three different times to select features:
      • During data collection
      • During data preprocessing
      • During model training
    • Decision tree: Model used for classification and regression that uses a tree-like structure to make decisions
    • Supervised learning: Train a model on labeled data
    • Unsupervised learning: Train a model on unlabeled data

    Mid-Term Layout

    • Part 1: Knowledge base (50 points)
      • In-class
    • Part 2: Critical Review of a paper (50 points)
      • Take home (Due Oct 17)

    Synonymous vs. Non-Synonymous Variations

    • Synonymous variations do not change the amino acid sequence, while non-synonymous variations do.

    Variant Call Format (VCF)

    • The Variant Call Format (VCF) is a standardized file format used to store and exchange genetic variation data.

    Genomic Variations

    • Translocations involve the exchange of DNA segments between non-homologous chromosomes.

    Gene Components

    • Introns are elements included in genes but not part of the coding region.

    Matrix Terminology

    • The term "rank" in matrices refers to the number of linearly independent rows or columns in a matrix.

    Matrix Equations

    • A system of equations can be solved using matrices if it can be represented in the form Ax = b, where A is the matrix of coefficients, x is the vector of unknowns, and b is the vector of constants.

    Matrix Dimensions

    • "m by n" in a matrix represents its dimensions, indicating that it has m rows and n columns.

    Matrices in Biology

    • Matrices can be used in biology to model genetic relationships, analyze protein interactions, and study population dynamics.

    Independent Columns

    • A column in a matrix is considered independent if it cannot be expressed as a linear combination of the other columns.

    Matrix Properties

    • A matrix can have more columns than rows, but not vice versa.

    Matrix of Coefficients

    • In the context of systems of equations, the matrix of coefficients represents the coefficients of the variables in each equation.

    Matrices in Technology

    • Matrices have applications in computer graphics, data analysis, and machine learning.

    Single Nucleotide Variation (SNV)

    • SNV involves a change in a single nucleotide in a DNA sequence.

    VCF File Structure

    • A primary characteristic of VCF file structure is its use of tab-delimited text format for storing variant information.

    Genetic Variation and Disease

    • Genome-wide association studies (GWAS) can be conducted to investigate the relationship between genetic variations and disease susceptibility.

    Silent Mutations

    • Silent mutations are genetic variations that do not result in a change in the amino acid sequence of a protein.

    VCF File Content

    • Genotype information is not usually included in the sample-specific information within a VCF file.

    Matrices and Equations

    • Matrices are used in relation to equations to simplify and solve systems of linear equations.

    Matrix Rank

    • The rank of a matrix is always less than or equal to the number of rows and columns in the matrix.

    Matrix Row Representation

    • A row of '1, 0, 0, 1, 0' in a matrix can indicate that a specific entity or variable is present in only the first and fourth positions.

    Column Independence

    • For a column to be considered independent in a matrix, it cannot be expressed as a linear combination of the other columns, meaning it doesn't have a direct linear relationship with any other column.

    Matrix Notation

    • "m by n" is used to describe the dimensions of a matrix, indicating that it has "m" rows and "n" columns.

    Matrices in Biology (Application)

    • Matrices are specifically applied in bioinformatics for sequence alignment analysis, where they can represent DNA or protein sequences.

    Independent Column Feature

    • Independent columns in a matrix often contain information about distinct variables or characteristics.

    Matrices in Graphics

    • Matrices are used in graphic representations to perform transformations, such as rotations, translations, and scaling of objects.

    Quantitative PCR Components

    • Quantitative PCR (qPCR) components include:
      • DNA template
      • Primers
      • PCR master mix
      • Fluorescent dye or probe

    qPCR Definition

    • qPCR stands for Quantitative Polymerase Chain Reaction.

    Microarray RNA Analysis

    • Microarrays can be used for total RNA analysis to study gene expression patterns across a large number of genes.

    qPCR Fluorescence

    • Capturing fluorescence in qPCR is used to quantify the amount of target DNA present in the reaction.
      • Increased fluorescence indicates higher amounts of amplified target DNA.

    Transcriptomic Experiments

    • Transcriptomic experiments are conducted to investigate changes in gene expression, which can provide insights into biological processes, diseases, and drug response.

    Delta-CT (qPCR)

    • Delta-CT in qPCR represents the difference in cycle thresholds (CT) between the target gene and a reference gene.

    qPCR Steps

    • The main steps in Quantitative PCR include:
      • Denaturation (separation of DNA strands)
      • Annealing (primers bind to DNA)
      • Extension (new DNA strands are synthesized)

    Fold Change (qPCR)

    • The calculated Fold Change in qPCR is given by the formula 2 raised to the power of ΔΔCT, indicating the relative expression level of the target gene.

    Microarray Purpose

    • The primary purpose of microarrays is to measure the expression levels of thousands of genes simultaneously, allowing for comprehensive gene expression profiling.

    Transcriptomic Experiment Reasons

    • Reasons for conducting transcriptomic experiments include:
      • Understanding gene expression patterns in different conditions
      • Identifying biomarkers for disease diagnosis
      • Studying drug response and toxicity

    Next Generation Sequencing

    • Next generation sequencing (NGS) does not always use only total RNA; it can also be used for whole genome sequencing, exome sequencing, and other applications.

    Delta-CT Significance

    • Delta-CT in qPCR represents the difference in cycle thresholds (CT) between the target gene and a reference gene, quantifying the relative expression level of the target gene.
      • A smaller Delta-CT indicates higher target gene expression.

    PCR Definition

    • PCR stands for Polymerase Chain Reaction.

    Sample Mean Formula

    • The formula for the sample mean (denoted by $\bar{x}$) is:
      • $\bar{x}$ = (Σx) / n, where Σx represents the sum of all values in the sample, and n is the sample size.

    Sample Variance Formula

    • The formula for sample variance (denoted by s²) is:
      • s² = Σ(x - $\bar{x}$)² / (n - 1), where x represents each data point, $\bar{x}$ is the sample mean, and n is the sample size.

    Expected Value Formula

    • The expected value (denoted by E(X)) is calculated by:
      • E(X) = ΣxP(x), where x represents each possible value of the random variable X, and P(x) is the probability of that value occurring.

    Independence in Probability

    • Independence in probability means that the occurrence of one event does not affect the probability of another event occurring.

    Rare Events Distribution

    • The Poisson distribution is used to describe rare events in a large population.

    Events Over Time

    • The exponential distribution is used for modeling random events occurring over time.

    T-Test Purpose

    • The t-test is primarily used to compare the means of two groups.

    Bonferroni Correction

    • The Bonferroni correction in multiple testing adjusts the significance threshold to account for the increased chance of false positives when performing multiple statistical tests.

    Central Limit Theorem

    • The Central Limit Theorem states that the distribution of sample means from a population will approach a normal distribution as the sample size increases.

    Odds Ratio

    • The odds ratio quantifies the strength of the association between two events in terms of the odds of exposure in one group compared to the controls.

    Network Components

    • The parts of a network are:
      • Nodes (represent entities)
      • Edges (represent connections between entities)

    Unweighted Graph

    • An unweighted graph has edges without any weights assigned to them.

    Node Degree

    • The degree of a node refers to the number of edges connected to it.

    Multiplex vs. Multilayer Network

    • A multiplex network has different types of interactions within the same layer, while a multilayer network has different types of interactions across multiple layers.

    Path Definition

    • A path is defined as a walk that does not visit any node more than once.

    GRN Determination

    • One method for determining gene regulatory networks (GRNs) is Bayesian network inference, which uses probabilistic relationships between genes to reconstruct the network structure.

    Phylogenetic Trees

    • Phylogenetic trees are used to represent evolutionary relationships between organisms or genes.

    Network Walks

    • 'Walks' in a network are sequences of nodes and edges, where the node at the end of one edge is the beginning of the next, and each edge is traversed only once.

    Feature Selection Considerations

    • Factors considered in feature selection include:
      • Relevance (degree of association with the target)
      • Redundancy (how much overlap exists between features)
      • Cost (of obtaining and processing features)

    Classification Definition

    • Classification categorizes data into predetermined categories based on shared characteristics.

    Prediction in Machine Learning

    • The primary goal of prediction in machine learning is to build models that can accurately predict future outcomes based on historical data.

    Random Forest Technique

    • Random Forest uses a technique called bagging (bootstrap aggregating) to construct decision trees.

    K Nearest Neighbors (K)

    • The K in K Nearest Neighbors represents the number of nearest neighbors to consider when classifying a new data point.

    KNN Data Types

    • K Nearest Neighbors can work with various data types, including numerical, categorical, and mixed data.

    Algorithm Functions

    • Algorithm & Primary Function:
      • K-Means Clustering: Unsupervised clustering algorithm that groups data points into clusters based on their similarity.
      • K Nearest Neighbors: Supervised classification algorithm that classifies a new data point based on its proximity to known labeled data points.
      • Decision Tree: Supervised classification and regression algorithm that uses a tree-like structure to make predictions.
      • Random Forest: Ensemble learning method that combines multiple decision trees to improve prediction accuracy and reduce variance.

    Matrix Dimension Impact

    • Dimensions of matrices affect the resulting matrix after multiplication:
      • If the number of columns in the first matrix is not equal to the number of rows in the second matrix, multiplication is not possible.
      • The resulting matrix will have the same number of rows as the first matrix and the same number of columns as the second matrix.

    Gene Components

    • Parts of a gene include:
      • Promoter (regulates gene expression)
      • Exons (coding sequences)
      • Introns (non-coding sequences)
      • 5' untranslated region (UTR)
      • 3' untranslated region (UTR)
      • Polyadenylation signal (signals end of gene)

    Gene Determination Tools

    • Primary tools used to determine parts of a gene include:
      • DNA sequencing
      • Gene prediction algorithms
      • RNA sequencing

    PCR vs. qPCR

    • PCR (Polymerase Chain Reaction) amplifies DNA, while qPCR (Quantitative Polymerase Chain Reaction) quantifies the amount of DNA amplified.

    Sample Space

    • Sample space is the set of all possible outcomes of an experiment or random phenomenon.

    Association Strength

    • A statistic that quantifies the strength of the association between two events is the correlation coefficient.

    Independence Rule

    • Independence follows the rule: P(EF) = P(E)P(F), where P(EF) is the probability of both events E and F occurring, P(E) is the probability of event E occurring, and P(F) is the probability of event F occurring.

    Directed vs. Undirected Graph

    • A directed graph has edges with a specific direction, while an undirected graph has edges without a specific direction.

    Node Degree

    • A node's degree tells you the number of connections it has to other nodes in a network.

    Classification

    • Classification in machine learning is the process of assigning data points to predefined categories based on their characteristics.

    Supervised vs. Unsupervised Learning

    • Supervised learning uses labeled data to train models, while unsupervised learning uses unlabeled data to discover patterns.

    Feature Selection Timing

    • Three different times in which you can select features:
      • Pre-processing: Features are selected before training the model.
      • During training: Features are selected during the model training process.
      • Post-processing: Features are selected after the model has been trained.

    Feature Selection Situations

    • Features are selected during specific situations:
      • High dimensionality: When there are many features and only a few are relevant.
      • Overfitting: When the model is too complex and performs well on the training data but poorly on new data.
      • Computational efficiency: When reducing the number of features can speed up the training process.

    Feature Selection Phase

    • A phase in which features are selected is the feature engineering phase. This phase involves selecting, transforming, and creating new features that improve the performance of machine learning models.

    Feature Selection Timing Aspect

    • One key aspect of the timing in feature selection is that it can influence the model's performance. If features are selected before training, the model might miss out on valuable information. If features are selected after training, the model might not perform well on new data.

    Feature Selection Scenario

    • A scenario that might not involve feature selection timing is when the data is already clean and relevant, with a few features that are well-defined and contribute directly to the model's performance.

    Feature Selection Reassessment

    • It is ideal to reassess feature selection when the data distribution changes significantly, when new data becomes available, or when there are changes in the problem that require a different set of features.

    Studying That Suits You

    Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

    Quiz Team

    Related Documents

    lecture7 (1).pdf
    Post-Genomics Lecture Notes PDF
    lecture05.pdf
    lecture06.pdf

    Description

    This quiz explores concepts in probability models and their applications in biological data analysis. The content is based on the interdisciplinary course material, drawing from topics covered in lectures and recommended readings. Test your understanding of the key principles and methodologies discussed in the course.

    More Like This

    Use Quizgecko on...
    Browser
    Browser