Podcast
Questions and Answers
Which of the following is a primary use of mixture models?
Which of the following is a primary use of mixture models?
- To simplify complex data into single distributions
- To ensure all statistical samples come from one population
- To eliminate the need for data normalization
- To detect heterogeneity due to hidden variables (correct)
Constructing a mixture of two normal distributions always results in a distribution that is also normal.
Constructing a mixture of two normal distributions always results in a distribution that is also normal.
False (B)
What type of models are useful for ChIP-Seq data analysis?
What type of models are useful for ChIP-Seq data analysis?
Zero-inflated models
The goal of ______ is to group objects into clusters such that each cluster contains similar objects.
The goal of ______ is to group objects into clusters such that each cluster contains similar objects.
Match each distance metric with its description:
Match each distance metric with its description:
In the context of mixture models, what does heterogeneity refer to?
In the context of mixture models, what does heterogeneity refer to?
In k-means clustering, the number of clusters does not need to be specified by the user.
In k-means clustering, the number of clusters does not need to be specified by the user.
What is the purpose of the Expectation-Maximization (EM) algorithm in the context of mixture models?
What is the purpose of the Expectation-Maximization (EM) algorithm in the context of mixture models?
A Gamma-Poisson mixture distribution is also known as the ______ distribution.
A Gamma-Poisson mixture distribution is also known as the ______ distribution.
Why are zero-inflated models used?
Why are zero-inflated models used?
Clustering can be interpreted as finding a latent variable that determines the group or class to which an observation belongs.
Clustering can be interpreted as finding a latent variable that determines the group or class to which an observation belongs.
What is the significance of CpG islands in genomics?
What is the significance of CpG islands in genomics?
In a Poisson distribution, the expectation and the ______ are equal.
In a Poisson distribution, the expectation and the ______ are equal.
What information does ChIP-Seq data provide?
What information does ChIP-Seq data provide?
Euclidean distance is calculated by summing the absolute differences between coordinates.
Euclidean distance is calculated by summing the absolute differences between coordinates.
What does the term 'finite' refer to in the context of mixture models?
What does the term 'finite' refer to in the context of mixture models?
K-means clustering strives to minimize the ______ sum of squares.
K-means clustering strives to minimize the ______ sum of squares.
Name the libraries that may be needed to run the code for mixture models.
Name the libraries that may be needed to run the code for mixture models.
What assumption does k-means clustering make about the distance metric used?
What assumption does k-means clustering make about the distance metric used?
The generative model for CpG island sequence data is called a Poisson distribution.
The generative model for CpG island sequence data is called a Poisson distribution.
Which distance metric is particularly useful when analyzing co-occurrence of traits in ecological data?
Which distance metric is particularly useful when analyzing co-occurrence of traits in ecological data?
The ______ distance is the sum of absolute differences of coordinates.
The ______ distance is the sum of absolute differences of coordinates.
Why is the sample mean and variance expected to be close in a Poisson distribution?
Why is the sample mean and variance expected to be close in a Poisson distribution?
The outcome of k-means clustering is independent of how the clusters are initialized.
The outcome of k-means clustering is independent of how the clusters are initialized.
What distinguishes a zero-inflated model from a standard statistical model?
What distinguishes a zero-inflated model from a standard statistical model?
In Hamming distance, we count only the number of unequal ______.
In Hamming distance, we count only the number of unequal ______.
In the context of clustering, define what is meant by '(dis)similarity'.
In the context of clustering, define what is meant by '(dis)similarity'.
When comparing mutation patterns in HIV, co-existence is more important than co-absence.
When comparing mutation patterns in HIV, co-existence is more important than co-absence.
Suppose we are analyzing genomic data and we want to see where cytosine is followed by guanine. Which of the following options are we most likely analyzing?
Suppose we are analyzing genomic data and we want to see where cytosine is followed by guanine. Which of the following options are we most likely analyzing?
In a two-component equally likely mixture model, if one component is a normal distribution with a mean of 120 and the other has a mean of 160, the ______-Maximization algorithm will estimate the the means of each component to be 120 and 160.
In a two-component equally likely mixture model, if one component is a normal distribution with a mean of 120 and the other has a mean of 160, the ______-Maximization algorithm will estimate the the means of each component to be 120 and 160.
What is the formula for calculating Euclidean distance?
What is the formula for calculating Euclidean distance?
Hamming distance calculates dissimilarity based on the number of matching feature co-occurrences.
Hamming distance calculates dissimilarity based on the number of matching feature co-occurrences.
Which of the following is a reason to use mixture models?
Which of the following is a reason to use mixture models?
For k-means clustering one of the disadvantages is that the users need to specify the number of ______.
For k-means clustering one of the disadvantages is that the users need to specify the number of ______.
Explain Jaccard Dissimilarity.
Explain Jaccard Dissimilarity.
The generative model used for CpG island sequence data is called the Kernel method.
The generative model used for CpG island sequence data is called the Kernel method.
What is the downside of k-means clustering?
What is the downside of k-means clustering?
The absolute difference |x - y| is also known as the _______________ distance metric.
The absolute difference |x - y| is also known as the _______________ distance metric.
In the simple example where we flip a fair coin, what does heads mean?
In the simple example where we flip a fair coin, what does heads mean?
The negative binomial distribution has 3 parameters.
The negative binomial distribution has 3 parameters.
What does $p$ mean in negative binomial distribution?
What does $p$ mean in negative binomial distribution?
Flashcards
What are CpG sites?
What are CpG sites?
Regions of DNA where a cytosine nucleotide (C) is followed by a guanine nucleotide (G).
What are CpG islands?
What are CpG islands?
Genome regions where CpG sites occur at a high frequency.
What is a mixture model?
What is a mixture model?
A statistical model that combines multiple probability distributions.
Why use mixture models?
Why use mixture models?
Signup and view all the flashcards
What samples call for mixture models?
What samples call for mixture models?
Signup and view all the flashcards
What happens when we flip heads in our mixture?
What happens when we flip heads in our mixture?
Signup and view all the flashcards
What is the EM Algorithm?
What is the EM Algorithm?
Signup and view all the flashcards
What is a Poisson distribution used for?
What is a Poisson distribution used for?
Signup and view all the flashcards
What is lambda in Poisson?
What is lambda in Poisson?
Signup and view all the flashcards
Gamma-Poisson intuition?
Gamma-Poisson intuition?
Signup and view all the flashcards
Gamma-Poisson's nickname?
Gamma-Poisson's nickname?
Signup and view all the flashcards
What are zero inflated models?
What are zero inflated models?
Signup and view all the flashcards
What is ChIP-Seq data?
What is ChIP-Seq data?
Signup and view all the flashcards
What is clustering?
What is clustering?
Signup and view all the flashcards
What is the goal of clustering?
What is the goal of clustering?
Signup and view all the flashcards
What is a distance measure?
What is a distance measure?
Signup and view all the flashcards
What is Euclidean distance?
What is Euclidean distance?
Signup and view all the flashcards
What is Hamming distance?
What is Hamming distance?
Signup and view all the flashcards
Where does Jaccard distance apply?
Where does Jaccard distance apply?
Signup and view all the flashcards
Clustering algorithms?
Clustering algorithms?
Signup and view all the flashcards
What is K-means clustering?
What is K-means clustering?
Signup and view all the flashcards
Study Notes
- Lecture 8 covers mixture models and clustering
- Emergency financial aid information can be found on the university website
Libraries Needed
- The code chunks require 3 libraries
- mixtools
- philentropy
- cowplot
Goals of the Chapter
- Understand mixture models
- Construct mixtures of two normal distributions and gamma-Poisson distributions
- Use zero-inflated models for data like ChIP-Seq data
- Understand clustering
- Learn measures of (dis)similarity and distance
- Perform k-means clustering in R
CpG Islands: Example
- CpG sites are DNA regions where cytosine is followed by guanine
- The direction needs to be 5' → 3'
- CpG islands are genome regions with a high frequency of CpG sites
- Given an unknown genomic region sequence, a question arises on how to identify it as a CpG island
- The method involves collecting DNA sequence data from known CpG islands and non-islands
- Generative probabilistic model is fit for both CpG islands and non-islands
- Fitted models compute probabilities P(s | CpG island) and P(s | non-island) for arbitrary DNA sequence s
- A statistic called score is defined for a sequence s
- Score(s) = log[P(s | CpG island) / P(s | non-island)]
- Sequences from CpG islands tend to have large scores
- The generative model used for sequence data is a Markov chain
Mixture Models
- The histogram of scores will show the distributions of CpG islands vs. non-islands
- Mixture models detect heterogeneity due to hidden variables
- Statistical samples do not come from only one population
- Two types of mixture models exist
- Finite
- Infinite
Simple Example
- To generate two equally likely components, decompose the process into two steps
- Flip a fair coin
- The coin lands on heads, generate a random number from a normal distribution
- Mean of 120 and Standard Deviation of 10
- The coin lands on tails, generate a random number from a normal distribution
- Mean of 160 and Standard Deviation of 10
Expectation-Maximization (EM) Algorithm
- The EM algorithm is used to infer hidden groups of data
- The algorithm uses a library called mixtools
- EM is an iterative method
Gamma-Poisson Mixture
- Poisson distribution is used to model count data
- It has only one parameter λ
- Expectation of the Poisson distribution is λ
- Variance of the Poisson distribution is λ
- Samples from a Poisson distribution exhibit close sample mean and variance
- The Gamma-Poisson mixture distribution is also known as the negative binomial distribution, used to model count data
- A negative binomial distribution has two parameters
- r
- p
- It models the number of failures in a sequence of independent trials before a specified number of successes
- p represents the success probability of each trial, similar to the binomial distribution
- r means the number of successes until the experiment is stopped
Zero-Inflated Models
- In real data analysis, missing values are often encountered
- Example, in a clinical trial with Alzheimer's patients, some refuse to disclose personal biometric information
- Zero-inflated models are mixture models with a distribution mixed with zeroes
- Examples of Zero-inflated Models are
- zero-inflated Poisson
- hurdle models
- zero-inflated negative binomial
ChIP-Seq Data
- ChIP-Seq data consists of DNA sequences from chromatin immunoprecipitation (ChIP)
- The technology maps locations along genomic DNA of transcription factors, nucleosomes, histone modifications, etc.
- Data measured on chromosome 22 from ChIP-Seq of antibodies for the STAT1 protein and the H3K4me3 histone modification
Clustering
- This is another way to interpret the mixture model:
- Observe a variable X
- Assume a hidden/latent categorical variable g determines the cluster / group / class to which X belongs
- Clustering is to find this latent variable g
- Clustering groups objects into clusters such that each cluster has similar objects
Clustering Measures
- Clustering involve the measurement of distance matrix
- Euclidean distance, i.e., SQRT(Sums of squares of coordinates)
- Mahalanobis distance (unequal weight per direction)
- Weighted Euclidean distance, x², ...
- Manhattan/Hamming distance
- Measurements of co-occurrence in ecological/sociological data
- Jaccard dissimilarity
One Dimensional Case
- Distance between two numbers x and y is given by |x − y|
- The absolute difference |x - y| is also known as the Euclidean distance metric
- Euclidean distance measures the dissimilarity of our objects and then group "similar" objects into a cluster
Higher Dimensions
- If the data points have higher dimensions, then use the absolute value as the Euclidean distance
- Example:
- Clinical trial with numeric measurements of age, body mass index, blood sugar
- Each data point is a vector with length equal to 3
- Association study uses DNA microarray to find mutations at genetic loci
- Each data point is a binary vector of given length
- Before clustering, the distances between the individuals need to be computed
Euclidean Distance
- The Euclidean distance between A = (a1, ..., ap) and B = (b1, ..., bp) in a p-dimensional space is
- d(A, B) = √(a1 - b1)² + (a2 - b2)² + ... + (ap - bp)²
- When p = 1, we have d(A, B) = |a1 - b1|
Hamming Distance
- Consider two binary vectors of length equal to p
- A = (a1, a2, . . ., ap) and B = (b1, b2, ..., bp) defines the hamming distance
- d(A, B) = |a1 - b1| + ... + |ap - bp|
- d(A, B) counts the number of coordinates where A and B are different
Jaccard Distance
- Traits / features' occurrence is translated into presence/absence encoded as 1's and 0's
- Co-occurrence is more informative than co-absence
- When comparing mutation patterns in HIV, co-existence of a mutation in two different strains is a more important observation
- Jaccard index is used for this reason
- J(A, B) = f11 / (f01 + f10 + f11)
- Jaccard dissimilarity: dJ(S, T) = 1 - J(S, T) = (f01 + f10) / (f01 + f10 + f11)
Clustering Algorithms
- Use:
- Hierarchical clustering
- K-means clustering (or centroid-based clustering)
- Hierarchical clustering is based on the distance matrix
- K-means clustering always assumes that Euclidean distance is used
K-Means Clustering
- K-means clustering minimizes the within-cluster sum of squares
- Advantages of Clustering includes
- Easy to understand
- Computationally fast
- Widely applicable
- Disadvantages of Clustering include
- Users need to specify the number of clusters
- Output depends on the initialization
One Dimensional Clustering
- One can check that clustering is optimal, which means the sum of within-cluster sum of squares is minimized
- K-means groups them into two clusters
- The result is consistent: 1, 2, 3 are in one cluster and 11, 12, 13 are in the other
Two Dimensional Clustering
- Numbers in plot represents a data point which is a vector with given length
- Value stands for the cluster ID
- K-means is an iterative procedure that starts from the guess for centers then iteratively updates the guess
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.