Mixture Models, Clustering and CpG Islands

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

Which of the following is a primary use of mixture models?

  • To simplify complex data into single distributions
  • To ensure all statistical samples come from one population
  • To eliminate the need for data normalization
  • To detect heterogeneity due to hidden variables (correct)

Constructing a mixture of two normal distributions always results in a distribution that is also normal.

False (B)

What type of models are useful for ChIP-Seq data analysis?

Zero-inflated models

The goal of ______ is to group objects into clusters such that each cluster contains similar objects.

<p>clustering</p> Signup and view all the answers

Match each distance metric with its description:

<p>Euclidean distance = The straight-line distance between two points. Manhattan distance = The sum of the absolute differences of their coordinates. Hamming distance = Number of positions where the bit vectors differ. Jaccard distance = Measure of dissimilarity based on the ratio of shared to total features.</p> Signup and view all the answers

In the context of mixture models, what does heterogeneity refer to?

<p>The variability in data due to underlying, unobserved factors (A)</p> Signup and view all the answers

In k-means clustering, the number of clusters does not need to be specified by the user.

<p>False (B)</p> Signup and view all the answers

What is the purpose of the Expectation-Maximization (EM) algorithm in the context of mixture models?

<p>To infer the hidden groups of data</p> Signup and view all the answers

A Gamma-Poisson mixture distribution is also known as the ______ distribution.

<p>negative binomial</p> Signup and view all the answers

Why are zero-inflated models used?

<p>To address the issue of having more zeros than expected in a dataset (D)</p> Signup and view all the answers

Clustering can be interpreted as finding a latent variable that determines the group or class to which an observation belongs.

<p>True (A)</p> Signup and view all the answers

What is the significance of CpG islands in genomics?

<p>Genome regions where CpG sites occur at a high frequency</p> Signup and view all the answers

In a Poisson distribution, the expectation and the ______ are equal.

<p>variance</p> Signup and view all the answers

What information does ChIP-Seq data provide?

<p>The locations along genomic DNA where certain proteins interact (B)</p> Signup and view all the answers

Euclidean distance is calculated by summing the absolute differences between coordinates.

<p>False (B)</p> Signup and view all the answers

What does the term 'finite' refer to in the context of mixture models?

<p>the number of components (A)</p> Signup and view all the answers

K-means clustering strives to minimize the ______ sum of squares.

<p>within-cluster</p> Signup and view all the answers

Name the libraries that may be needed to run the code for mixture models.

<p><code>mixtools</code>, <code>philentropy</code>, <code>cowplot</code></p> Signup and view all the answers

What assumption does k-means clustering make about the distance metric used?

<p>It always assumes that the Euclidean distance is used. (D)</p> Signup and view all the answers

The generative model for CpG island sequence data is called a Poisson distribution.

<p>False (B)</p> Signup and view all the answers

Which distance metric is particularly useful when analyzing co-occurrence of traits in ecological data?

<p>Jaccard distance (B)</p> Signup and view all the answers

The ______ distance is the sum of absolute differences of coordinates.

<p>Manhattan</p> Signup and view all the answers

Why is the sample mean and variance expected to be close in a Poisson distribution?

<p>Poisson distribution has only one parameter λ</p> Signup and view all the answers

The outcome of k-means clustering is independent of how the clusters are initialized.

<p>False (B)</p> Signup and view all the answers

What distinguishes a zero-inflated model from a standard statistical model?

<p>It accounts for an excess of zero values beyond what's typically expected. (C)</p> Signup and view all the answers

In Hamming distance, we count only the number of unequal ______.

<p>coordinates</p> Signup and view all the answers

In the context of clustering, define what is meant by '(dis)similarity'.

<p>How alike or unalike different data objects are</p> Signup and view all the answers

When comparing mutation patterns in HIV, co-existence is more important than co-absence.

<p>True (A)</p> Signup and view all the answers

Suppose we are analyzing genomic data and we want to see where cytosine is followed by guanine. Which of the following options are we most likely analyzing?

<p>CpG Sites (D)</p> Signup and view all the answers

In a two-component equally likely mixture model, if one component is a normal distribution with a mean of 120 and the other has a mean of 160, the ______-Maximization algorithm will estimate the the means of each component to be 120 and 160.

<p>Expectation</p> Signup and view all the answers

What is the formula for calculating Euclidean distance?

<p>$d(A, B) = \sqrt{(a_1 – b_1)^2 + (a_2 - b_2)^2 + ... + (a_p – b_p)^2}$</p> Signup and view all the answers

Hamming distance calculates dissimilarity based on the number of matching feature co-occurrences.

<p>False (B)</p> Signup and view all the answers

Which of the following is a reason to use mixture models?

<p>The statistical sample does not come from only one population. (C)</p> Signup and view all the answers

For k-means clustering one of the disadvantages is that the users need to specify the number of ______.

<p>clusters</p> Signup and view all the answers

Explain Jaccard Dissimilarity.

<p>The Jaccard dissimiliarity measures how dissimilar data sets or clusters are.</p> Signup and view all the answers

The generative model used for CpG island sequence data is called the Kernel method.

<p>False (B)</p> Signup and view all the answers

What is the downside of k-means clustering?

<p>Users need to specify the number of clusters (B)</p> Signup and view all the answers

The absolute difference |x - y| is also known as the _______________ distance metric.

<p>Euclidean</p> Signup and view all the answers

In the simple example where we flip a fair coin, what does heads mean?

<p>it has a normal distribution with mean 120 and SD 10</p> Signup and view all the answers

The negative binomial distribution has 3 parameters.

<p>False (B)</p> Signup and view all the answers

What does $p$ mean in negative binomial distribution?

<p>the success probability of each trial (B)</p> Signup and view all the answers

Flashcards

What are CpG sites?

Regions of DNA where a cytosine nucleotide (C) is followed by a guanine nucleotide (G).

What are CpG islands?

Genome regions where CpG sites occur at a high frequency.

What is a mixture model?

A statistical model that combines multiple probability distributions.

Why use mixture models?

Used to detect heterogeneity due to hidden variables.

Signup and view all the flashcards

What samples call for mixture models?

Statistical samples that do not come from a single population.

Signup and view all the flashcards

What happens when we flip heads in our mixture?

A number generated from a normal distribution with a given mean and standard deviation.

Signup and view all the flashcards

What is the EM Algorithm?

An algorithm used to infer hidden groups in mixture distributions.

Signup and view all the flashcards

What is a Poisson distribution used for?

A distribution used to model count data where the mean and variance are equal.

Signup and view all the flashcards

What is lambda in Poisson?

The expectation and variance of a Poisson distribution.

Signup and view all the flashcards

Gamma-Poisson intuition?

Sampling random numbers from Poisson with varying lambda values.

Signup and view all the flashcards

Gamma-Poisson's nickname?

A distribution known as the negative binomial distribution used to model count data.

Signup and view all the flashcards

What are zero inflated models?

Models that mix a distribution with zeroes, often used when faced with missing values.

Signup and view all the flashcards

What is ChIP-Seq data?

Sequences of DNA obtained from chromatin immunoprecipitation.

Signup and view all the flashcards

What is clustering?

Grouping objects into clusters with similar objects in each one.

Signup and view all the flashcards

What is the goal of clustering?

Finding a latent variable that determines which cluster an observation belongs to.

Signup and view all the flashcards

What is a distance measure?

A way to measure how similar or dissimilar two objects are.

Signup and view all the flashcards

What is Euclidean distance?

distance = sqrt(sum of squares of coordinates)

Signup and view all the flashcards

What is Hamming distance?

Counts number of coordinates where two vectors differ.

Signup and view all the flashcards

Where does Jaccard distance apply?

Traits or features in ecological or mutation data.

Signup and view all the flashcards

Clustering algorithms?

Most common clustering algorithms.

Signup and view all the flashcards

What is K-means clustering?

Divides samples into clusters, minimizing within-cluster sum of squares.

Signup and view all the flashcards

Study Notes

  • Lecture 8 covers mixture models and clustering
  • Emergency financial aid information can be found on the university website

Libraries Needed

  • The code chunks require 3 libraries
  • mixtools
  • philentropy
  • cowplot

Goals of the Chapter

  • Understand mixture models
  • Construct mixtures of two normal distributions and gamma-Poisson distributions
  • Use zero-inflated models for data like ChIP-Seq data
  • Understand clustering
  • Learn measures of (dis)similarity and distance
  • Perform k-means clustering in R

CpG Islands: Example

  • CpG sites are DNA regions where cytosine is followed by guanine
  • The direction needs to be 5' → 3'
  • CpG islands are genome regions with a high frequency of CpG sites
  • Given an unknown genomic region sequence, a question arises on how to identify it as a CpG island
  • The method involves collecting DNA sequence data from known CpG islands and non-islands
  • Generative probabilistic model is fit for both CpG islands and non-islands
  • Fitted models compute probabilities P(s | CpG island) and P(s | non-island) for arbitrary DNA sequence s
  • A statistic called score is defined for a sequence s
  • Score(s) = log[P(s | CpG island) / P(s | non-island)]
  • Sequences from CpG islands tend to have large scores
  • The generative model used for sequence data is a Markov chain

Mixture Models

  • The histogram of scores will show the distributions of CpG islands vs. non-islands
  • Mixture models detect heterogeneity due to hidden variables
  • Statistical samples do not come from only one population
  • Two types of mixture models exist
  • Finite
  • Infinite

Simple Example

  • To generate two equally likely components, decompose the process into two steps
  • Flip a fair coin
  • The coin lands on heads, generate a random number from a normal distribution
  • Mean of 120 and Standard Deviation of 10
  • The coin lands on tails, generate a random number from a normal distribution
  • Mean of 160 and Standard Deviation of 10

Expectation-Maximization (EM) Algorithm

  • The EM algorithm is used to infer hidden groups of data
  • The algorithm uses a library called mixtools
  • EM is an iterative method

Gamma-Poisson Mixture

  • Poisson distribution is used to model count data
  • It has only one parameter λ
  • Expectation of the Poisson distribution is λ
  • Variance of the Poisson distribution is λ
  • Samples from a Poisson distribution exhibit close sample mean and variance
  • The Gamma-Poisson mixture distribution is also known as the negative binomial distribution, used to model count data
  • A negative binomial distribution has two parameters
  • r
  • p
  • It models the number of failures in a sequence of independent trials before a specified number of successes
  • p represents the success probability of each trial, similar to the binomial distribution
  • r means the number of successes until the experiment is stopped

Zero-Inflated Models

  • In real data analysis, missing values are often encountered
  • Example, in a clinical trial with Alzheimer's patients, some refuse to disclose personal biometric information
  • Zero-inflated models are mixture models with a distribution mixed with zeroes
  • Examples of Zero-inflated Models are
  • zero-inflated Poisson
  • hurdle models
  • zero-inflated negative binomial

ChIP-Seq Data

  • ChIP-Seq data consists of DNA sequences from chromatin immunoprecipitation (ChIP)
  • The technology maps locations along genomic DNA of transcription factors, nucleosomes, histone modifications, etc.
  • Data measured on chromosome 22 from ChIP-Seq of antibodies for the STAT1 protein and the H3K4me3 histone modification

Clustering

  • This is another way to interpret the mixture model:
  • Observe a variable X
  • Assume a hidden/latent categorical variable g determines the cluster / group / class to which X belongs
  • Clustering is to find this latent variable g
  • Clustering groups objects into clusters such that each cluster has similar objects

Clustering Measures

  • Clustering involve the measurement of distance matrix
  • Euclidean distance, i.e., SQRT(Sums of squares of coordinates)
  • Mahalanobis distance (unequal weight per direction)
  • Weighted Euclidean distance, x², ...
  • Manhattan/Hamming distance
  • Measurements of co-occurrence in ecological/sociological data
  • Jaccard dissimilarity

One Dimensional Case

  • Distance between two numbers x and y is given by |x − y|
  • The absolute difference |x - y| is also known as the Euclidean distance metric
  • Euclidean distance measures the dissimilarity of our objects and then group "similar" objects into a cluster

Higher Dimensions

  • If the data points have higher dimensions, then use the absolute value as the Euclidean distance
  • Example:
  • Clinical trial with numeric measurements of age, body mass index, blood sugar
  • Each data point is a vector with length equal to 3
  • Association study uses DNA microarray to find mutations at genetic loci
  • Each data point is a binary vector of given length
  • Before clustering, the distances between the individuals need to be computed

Euclidean Distance

  • The Euclidean distance between A = (a1, ..., ap) and B = (b1, ..., bp) in a p-dimensional space is
  • d(A, B) = √(a1 - b1)² + (a2 - b2)² + ... + (ap - bp)²
  • When p = 1, we have d(A, B) = |a1 - b1|

Hamming Distance

  • Consider two binary vectors of length equal to p
  • A = (a1, a2, . . ., ap) and B = (b1, b2, ..., bp) defines the hamming distance
  • d(A, B) = |a1 - b1| + ... + |ap - bp|
  • d(A, B) counts the number of coordinates where A and B are different

Jaccard Distance

  • Traits / features' occurrence is translated into presence/absence encoded as 1's and 0's
  • Co-occurrence is more informative than co-absence
  • When comparing mutation patterns in HIV, co-existence of a mutation in two different strains is a more important observation
  • Jaccard index is used for this reason
  • J(A, B) = f11 / (f01 + f10 + f11)
  • Jaccard dissimilarity: dJ(S, T) = 1 - J(S, T) = (f01 + f10) / (f01 + f10 + f11)

Clustering Algorithms

  • Use:
  • Hierarchical clustering
  • K-means clustering (or centroid-based clustering)
  • Hierarchical clustering is based on the distance matrix
  • K-means clustering always assumes that Euclidean distance is used

K-Means Clustering

  • K-means clustering minimizes the within-cluster sum of squares
  • Advantages of Clustering includes
  • Easy to understand
  • Computationally fast
  • Widely applicable
  • Disadvantages of Clustering include
  • Users need to specify the number of clusters
  • Output depends on the initialization

One Dimensional Clustering

  • One can check that clustering is optimal, which means the sum of within-cluster sum of squares is minimized
  • K-means groups them into two clusters
  • The result is consistent: 1, 2, 3 are in one cluster and 11, 12, 13 are in the other

Two Dimensional Clustering

  • Numbers in plot represents a data point which is a vector with given length
  • Value stands for the cluster ID
  • K-means is an iterative procedure that starts from the guess for centers then iteratively updates the guess

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Use Quizgecko on...
Browser
Browser