Machine Learning: Clustering and PCA Concepts

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

What is the primary goal of supervised learning?

  • To discover similarities in data without any target attribute.
  • To find and predict patterns related to a target attribute. (correct)
  • To organize data based on a priori grouping.
  • To segment data instances into clusters.

Clustering is considered a supervised learning task.

False (B)

Name one application of clustering in marketing.

Segment customers according to their similarities.

In clustering, data instances that are similar are grouped into one ______.

<p>cluster</p>
Signup and view all the answers

Match the following applications with their clustering examples:

<p>T-Shirt sizing = Grouping people according to size Marketing = Segmenting customers based on similarities Document organization = Producing a topic hierarchy Medicine = Grouping patients with similar conditions</p>
Signup and view all the answers

What is one key characteristic of unsupervised learning?

<p>It explores data to find intrinsic structures. (D)</p>
Signup and view all the answers

Clustering only applies to fields related to technology.

<p>False (B)</p>
Signup and view all the answers

Why is clustering important in the context of online documents?

<p>To organize them according to content similarities.</p>
Signup and view all the answers

What is the primary function of the first principal component in PCA?

<p>It defines the line in p-dimensional space that is closest to the n observations. (A)</p>
Signup and view all the answers

The proportions of variance explained (PVE) by each principal component sum to more than one.

<p>False (B)</p>
Signup and view all the answers

What is the recommended action if the variables in PCA are in different units?

<p>Scale each variable to have a standard deviation equal to one.</p>
Signup and view all the answers

The first principal component explains _____% of the variance in the data according to the information provided.

<p>62.0</p>
Signup and view all the answers

How much variance is explained by the next principal component after the first?

<p>24.7% (D)</p>
Signup and view all the answers

Match the following terms related to PCA with their correct descriptions:

<p>First Principal Component = Defines the closest line to the n observations Variance Explained = Proportion of total variance attributed to a component Scaling = Ensures variables are on the same scale Cumulative PVE = Total variance explained by multiple components</p>
Signup and view all the answers

The last two principal components explain a high percentage of the variance in the data.

<p>False (B)</p>
Signup and view all the answers

What is the main purpose of principal component analysis (PCA)?

<p>To reduce the dimensionality of data while retaining as much variance as possible.</p>
Signup and view all the answers

What is a significant disadvantage of K-means clustering?

<p>It requires pre-specifying the number of clusters K. (C)</p>
Signup and view all the answers

Agglomerative clustering starts with all data points in one cluster.

<p>False (B)</p>
Signup and view all the answers

What is a dendrogram?

<p>A tree-like structure that represents a nested sequence of clusters.</p>
Signup and view all the answers

In hierarchical clustering, the __________ method builds the dendrogram from the bottom level.

<p>Agglomerative</p>
Signup and view all the answers

Match the following clustering types with their descriptions:

<p>Agglomerative Clustering = Merges the most similar clusters from the bottom up Divisive Clustering = Starts with one cluster and splits into smaller clusters Dendrogram = A tree structure showing cluster relationships Singleton Clusters = Clusters consisting of a single data point</p>
Signup and view all the answers

In a dendrogram, cutting it at a certain height results in how many clusters?

<p>Different clusters based on the cut height (C)</p>
Signup and view all the answers

Divisive clustering results in all data points merged into one cluster.

<p>False (B)</p>
Signup and view all the answers

How does agglomerative clustering determine which clusters to merge?

<p>It merges clusters with the least distance between them.</p>
Signup and view all the answers

Which of the following best describes the K-means clustering algorithm?

<p>It assigns data points based on user-defined clusters. (B)</p>
Signup and view all the answers

In K-means clustering, the centroids are recalculated after every assignment of data points.

<p>True (A)</p>
Signup and view all the answers

What is the primary goal of clustering in a data set?

<p>To group similar data points together based on certain characteristics.</p>
Signup and view all the answers

The K-means algorithm aims to minimize the sum of squared error (SSE) across all clusters, defined as SSE = $\sum_{j=1}^{k} \sum_{x \in C_j} dist(x, m_j)^2$ where $C_j$ is the ________ of the jth cluster.

<p>centroid</p>
Signup and view all the answers

Match the aspects of clustering with their correct descriptions:

<p>Partitional clustering = Divides data into non-overlapping subsets Hierarchical clustering = Creates a tree of clusters Distance function = Measures similarity or dissimilarity Clustering quality = Determines the effectiveness of the clusters formed</p>
Signup and view all the answers

Which of the following is NOT a stopping criterion for K-means clustering?

<p>Maximum number of clusters specified by the user (C)</p>
Signup and view all the answers

Clustering algorithms can be used in various fields such as medicine, sociology, and biology.

<p>True (A)</p>
Signup and view all the answers

What characteristic should be maximized and minimized for clustering quality?

<p>Maximize inter-cluster distance and minimize intra-cluster distance.</p>
Signup and view all the answers

What is the time complexity of the k-means algorithm?

<p>O(tkn) (B)</p>
Signup and view all the answers

K-means is considered a non-linear algorithm.

<p>False (B)</p>
Signup and view all the answers

What is one method to deal with outliers in k-means clustering?

<p>Remove outliers or perform random sampling</p>
Signup and view all the answers

K-means clustering is sensitive to __________.

<p>outliers</p>
Signup and view all the answers

Which of the following is a strength of k-means clustering?

<p>It is one of the most popular clustering algorithms. (D)</p>
Signup and view all the answers

The k-means algorithm always finds the global optimum regardless of the initialization.

<p>False (B)</p>
Signup and view all the answers

What technique can be used to solve the optimization problem in PCA?

<p>Singular-value decomposition (D)</p>
Signup and view all the answers

Match the following terms related to k-means with their descriptions:

<p>Centroid = A point that represents the average position of all points in a cluster SSE = Sum of squared errors used to measure clustering goodness Outlier = A data point that is significantly different from the rest Random Sampling = Selecting a small subset of data points to reduce the chance of outliers</p>
Signup and view all the answers

What user input is necessary for k-means clustering to function?

<p>The number of clusters (k)</p>
Signup and view all the answers

The first principal component is uncorrelated with the second principal component.

<p>True (A)</p>
Signup and view all the answers

What are the principal component scores for the first principal component generally referred to as?

<p>z11, z21, ..., zn1</p>
Signup and view all the answers

The principal component directions are the ordered sequence of ______ of the matrix $X^TX$.

<p>eigenvectors</p>
Signup and view all the answers

Match the following terms with their definitions:

<p>Principal Component = A linear combination of variables that retains maximum variance Singular-value Decomposition (SVD) = A technique for solving PCA optimization problems Loading Vector = Defines the direction in feature space Eigenvalue = Represents the variance associated with a principal component</p>
Signup and view all the answers

In PCA, how many principal components can be at most generated?

<p>min(n - 1, p) (C)</p>
Signup and view all the answers

The second principal component can be correlated with the first principal component.

<p>False (B)</p>
Signup and view all the answers

List the three crimes featured in the USAarrests data set.

<p>Assault, Murder, Rape</p>
Signup and view all the answers

Flashcards

Supervised Learning

Discovering patterns in data where attributes are related to a target attribute. These patterns predict future values of the target attribute.

Unsupervised Learning

Exploring data without a target attribute to discover intrinsic structures.

Clustering

A technique used to find similarity groups (clusters) in data by grouping similar data instances together and dissimilar instances into different clusters.

Clustering and Unsupervised Learning

Clustering is often considered synonymous with unsupervised learning because it groups data without pre-defined labels.

Signup and view all the flashcards

Clustering in Marketing

Clustering in marketing helps group customers based on similarities for targeted advertising and product recommendations.

Signup and view all the flashcards

Clustering Text Documents

Clustering helps organize a collection of text documents based on content similarity, creating a topic hierarchy.

Signup and view all the flashcards

Clustering in Various Fields

Clustering has applications in diverse fields, including medicine, psychology, botany, sociology, and more.

Signup and view all the flashcards

Clustering in Text Data

The rapid growth of online documents has made clustering important for organizing and understanding text data.

Signup and view all the flashcards

K-means clustering

A specific method for clustering data points into groups, with each group centered around a 'centroid'.

Signup and view all the flashcards

Centroid

The center of a cluster in K-means clustering, calculated as the average of all data points in that cluster.

Signup and view all the flashcards

K in K-means

The number of clusters (groups) that the K-means algorithm will create.

Signup and view all the flashcards

Seeds in K-means

The initial data points chosen to represent the centroids at the beginning of the K-means algorithm.

Signup and view all the flashcards

Data point assignment in K-means

The process of assigning each data point to the closest centroid, based on a chosen distance metric.

Signup and view all the flashcards

Centroid re-computation in K-means

The recalculation of centroids after data points have been assigned to clusters.

Signup and view all the flashcards

Convergence criterion in K-means

A condition used to stop the K-means algorithm, ensuring that the clusters have stabilized.

Signup and view all the flashcards

Hierarchical Clustering

A clustering technique where you don't need to predefine the number of clusters (K). Instead, it builds a hierarchical tree structure called a dendrogram.

Signup and view all the flashcards

Agglomerative Clustering

Hierarchical clustering where you start with individual data points and gradually merge them into larger clusters based on similarity.

Signup and view all the flashcards

Merging in Agglomerative Clustering

The process of merging the two most similar clusters in agglomerative clustering. This process continues until all data points are in one cluster.

Signup and view all the flashcards

Dendrogram

A visual representation of the hierarchical clustering process. The dendrogram shows the relationships and merges between clusters.

Signup and view all the flashcards

Complete Linkage Distance

The distance between two clusters is calculated based on all data points in each cluster. This method considers every point in the cluster, ensuring the maximum distance is accounted for.

Signup and view all the flashcards

Distance Between Clusters

A measure of the closeness between two clusters. Different distance metrics can be used to calculate similarity between clusters.

Signup and view all the flashcards

2-Dimensional Data Visualization

A common visual representation of data points in two dimensions. Each point represents an observation.

Signup and view all the flashcards

Divisive Clustering

A method used to split data into child clusters based on dissimilarity. The process starts with a single cluster and subdivides it.

Signup and view all the flashcards

K-means Time Complexity

The time complexity of k-means is O(tkn), where:

  • n is the number of data points
  • k is the number of clusters
  • t is the number of iterations. Because k and t are typically small, k-means is considered a linear algorithm.
Signup and view all the flashcards

K-means Sensitivity to Seeds

K-means is sensitive to the initial placement of cluster centroids (seeds). A poor initial seeding can lead to suboptimal clustering results.

Signup and view all the flashcards

Outliers in K-means

Outliers are data points that are significantly different from other data points in the dataset. In k-means, outliers can skew the cluster centroids, leading to inaccurate clustering results.

Signup and view all the flashcards

Addressing Outliers in K-means

One approach to address outliers in k-means is to remove data points that are far away from the centroids during the clustering process. This can help improve the accuracy of the results by removing noisy or irrelevant data.

Signup and view all the flashcards

Random Sampling for Outliers

Another method to manage outliers in k-means is to randomly sample data points before clustering. This reduces the chance of selecting outliers and allows for more accurate clustering of the remaining data.

Signup and view all the flashcards

Selecting k in K-means

K-means clustering is sensitive to the choice of the number of clusters (k). A poor selection of k can lead to inaccurate or poorly defined clusters.

Signup and view all the flashcards

SSE in K-means

K-means clustering aims to minimize the sum of squared errors (SSE) within clusters, which means it tries to group data points that are closest to the cluster centroid. The algorithm terminates at a local minimum of SSE, not necessarily the global minimum.

Signup and view all the flashcards

Sample variance of zi1

The sample variance of the zi1 can be expressed as the sum of squared differences between each zi1 value and the mean of all zi1 values, divided by the number of data points (n).

Signup and view all the flashcards

Loading vector φ1

The loading vector φ1, with elements φ11, φ21,..., φp1, represents a direction in the feature space where the data exhibits the most variation. Each element of the vector corresponds to a feature, indicating its contribution to the direction.

Signup and view all the flashcards

Principal component scores

Projecting the data points onto the loading vector φ1 results in the principal component scores (z11,..., zn1), which are the values along the direction of maximum variability.

Signup and view all the flashcards

Second principal component

The second principal component is the linear combination of features (X1, ..., Xp) that has maximum variance among all linear combinations uncorrelated with the first principal component (Z1).

Signup and view all the flashcards

Second principal component scores

The second principal component scores (z12, z22,..., zn2) are calculated using the second principal component loading vector (φ2) and the original features (X1, ..., Xp).

Signup and view all the flashcards

Orthogonality of principal components

Constraining the second principal component (Z2) to be uncorrelated with the first component (Z1) is equivalent to ensuring that the direction of the second principal component loading vector (φ2) is perpendicular to the direction of the first (φ1).

Signup and view all the flashcards

Principal component directions from SVD

The principal component directions (φ1, φ2, φ3, ...) are represented by the right singular vectors of the matrix X, ordered from highest to lowest variability. The variances of the components are proportional to the squares of the singular values.

Signup and view all the flashcards

Principal component directions from eigen decomposition

The principal component directions can also be obtained from the eigenvectors of the matrix 𝑋 𝑇 𝑋. The variances of the components are equal to the eigenvalues of this matrix.

Signup and view all the flashcards

First principal component

The direction in p-dimensional space that is closest to the n data points, minimizing the average squared Euclidean distance.

Signup and view all the flashcards

Plane spanned by the first two principal components

The plane that minimizes the average squared Euclidean distance to all data points.

Signup and view all the flashcards

Scaling variables in PCA

Scaling each variable to have a standard deviation of one, ensuring comparable scales across variables.

Signup and view all the flashcards

Proportion of variance explained (PVE) by a principal component

The proportion of variance explained by a principal component, indicating the amount of variability in the data captured by that component.

Signup and view all the flashcards

Total variance in PCA

The total variance in a data set, calculated as the sum of variances of all variables.

Signup and view all the flashcards

Variance explained by a principal component

The variance explained by a specific principal component, representing the variation captured by that component.

Signup and view all the flashcards

PVE formula

The proportion of variance explained by a principal component divided by the total variance, ranging from 0 to 1.

Signup and view all the flashcards

PVEs sum to 1

The sum of PVEs of all components equals 1, indicating that all the variance in the data is accounted for.

Signup and view all the flashcards

Study Notes

Introduction to Machine Learning AI 305

  • Unsupervised learning is a subset of machine learning.
  • Clustering is a technique in unsupervised learning.
  • Clustering finds similarity groups in data, called clusters.

Supervised Learning vs. Unsupervised Learning

  • Supervised learning discovers patterns relating data attributes to a target (class) attribute. These patterns help predict values of the target attribute in future data.
  • Unsupervised learning lacks a target attribute. The goal is to explore the data and find inherent structures within it.

Clustering

  • Clustering is a method for finding similarity groups in data, creating clusters of similar data points.
  • Data points similar to each other are grouped into the same cluster.
  • Data points far apart, dissimilar, are grouped into different clusters.
  • Clustering is often considered a form of unsupervised learning.

An Illustration

  • A dataset can have naturally occurring groups of data points—natural clusters.

What is Clustering For?

  • Example 1: Grouping people by size to make different sizes of clothing (small, medium, large).
  • Example 2: Marketing segments customers based on similarities, targeting specific subgroups with advertising or product recommendations.
  • Example 3: Organizing documents by topic similarities (e.g., news articles) to create a topic hierarchy. This is important for organizing large volumes of text data.
  • Clustering is used in many fields like medicine, psychology, and marketing.

Aspects of Clustering

  • Clustering Algorithm: This includes partitions (like K-means) and hierarchical methodologies.
  • Distance (Similarity or Dissimilarity) function: Determines how similar or dissimilar data points are.
  • Clustering Quality: Measures how good the clusters are, typically looking for minimized intra-cluster distance and maximized inter-cluster distances.

K-means Clustering

  • A partitional clustering algorithm.
  • Divides data points into k clusters.
  • Each cluster has a center called a centroid.
  • K is specified by the user.

K-means Algorithm Steps

  • Step 1: Randomly choose k data points to be the initial centroids (cluster centers).
  • Step 2: Assign each data point to the centroid. The dataset is divided (partitioned) into k clusters.
  • Step 3: Recalculate the centroids using the current assignments of data points.
  • Step 4: Repeat steps 2 and 3 until a convergence criterion is met (no more significant changes in assignments).
  • Different seed selections lead to differing cluster formation.

K-means Algorithm (cont) steps to detail

  • Step 1: Choose k data points (seeds) and set as initial centroids.
  • Step 2 (iterate): For each data point: compute the distance to each centroid; assign to the nearest centroid.
  • Step 3: Re-compute the centroids based on the current assignments; until a stopping criterion.
  • Stopping criteria: no (or minimum) re-assignments to different clusters.
  • Also minimum change in centroids or minimum decrease in SSE (sum of squared errors). -SSE calculation: SSE = ∑∑ dist(x,mⱼ)²

Stopping/Convergence Criterion

  • No further changes in data point assignments, centroid locations.
  • Minimum decrease in SSE (sum of squared errors) between iterations.

Strengths of K-means

  • Simple to understand and implement.
  • Efficient, with time complexity O(tkn) which is linear when t and k are relatively small.

Weaknesses of K-means

  • Dependent on the correct value of k (number of clusters).
  • Sensitive to outliers, that are very distant data points.
  • Dependent on initial seed selection, as different initial cluster centers, will yield different clusters.
  • Not suitable for clusters that have non-spherical shapes or hyper-ellipsoidal patterns.

Selecting 'k'

  • In the plot k = 2, k = 3, k=4 show how data clusters using k-means will change. A simulated data set with 150 observations in a 2-dimensional space is shown for illustration.

Weaknesses of K-means: Outliers

  • Remove data points that are far from the centroids.
  • Monitor outliers over iterations for removal.
  • Use Random Sampling to reduce the likelihood of selecting outliers.
  • Assign remaining data points via distance or classification.

Weaknesses of K-means: Initial Seeds

  • Random seed selection produces different clustering results
  • Different methods that can be used to help select seeds.

Clustering Algorithm Summary

  • K-means, despite weaknesses, is still popular.
  • Other algorithms may be better for specific tasks, needs, and data sets.
  • Comparing clustering algorithms is complex and subjective.

Common Ways to Represent Clusters

  • Use centroids to represent clusters
  • Calculate spread of each cluster by calculating radius and standard deviation.

Using Classification Model

  • All data points are assumed to share the same class label (cluster ID).

Use frequent values to represent cluster

  • Use for categorical data.
  • Often used in text clustering.

Clusters Of Arbitrary Shapes

  • Hyper-ellipsoidal or hyper-spherical clusters are easier to represent using centroids and spread measures.
  • Irregular shaped clusters are more complex to represent.

Hierarchical Clustering

  • Avoids pre-defining 'k'
  • Creates nested sequence of clusters, a tree (dendrogram).
  • Different methodologies exist (divisive and agglomerative).

Agglomerative Clustering

  • Data points are initially individual clusters (nodes)
  • Merge nodes with the shortest distance between corresponding datapoints from different clusters until a single cluster.

Agglomerative Clustering Algorithm Steps

  1. Each data point in the dataset D is a separate cluster
  2. Compute distances between all pairs of data points in D(X1,X2,...,Xn)
  3. Repeatedly merge the two closest clusters as new clusters are formed
  4. Repeat steps 2 and 3 until only a single cluster remains.

Measuring Distance Between Clusters

  • There are different ways to measure distances between different clusters.
  • Different variations include complete link, single link, average link, and centroids.

Types of Hierarchical Clustering Methods

  • Agglomerative (bottom-up): Starts with individual data points as clusters and merges them iteratively based on similarity
  • Divisive (top-down): Starts with all data points in one cluster and recursively splits them into smaller clusters based on dissimilarity.
  • Distance between two clusters is the shortest distance between any two data points.
  • It favors longer "chains" of clusters, and very sensitive to noise or outliers.
  • May form undesirable “chain effects”
  • Maximum distance between two data points will be the distance between two clusters.
  • Less sensitive to outliers than single linkage but can potentially miss clusters.
  • Compromise between single-link and complete-link.
  • The average distance between all pairs of data points between the two clusters will be used as the distance.

Centroid Method

  • Distance is computed based on the distance between the centroids.

Distance Function

  • “Similarity” or “dissimilarity” are important for clustering concept
  • Methods exist for many types of data (nominal, numeric), and differing applications.
  • Metrics used include Euclidean distance (squared distance), Manhattan distance, Chebychev distance.

How to Choose a Clustering Algorithm

  • No single best algorithm
  • Various algorithms exist each with strengths, weaknesses, suitable use cases.
  • Trial and error via testing, is typically used.
  • Consider the type of data, desired characteristics of clusters.

Principal Components Analysis (PCA)

  • Data visualization or preprocessing for supervised learning.
  • Produces low-dimensional representation (reduced dimensionality).
  • Finds linear combinations of variables with maximal variance, uncorrelatedness.

Principal Components Analysis (PCA): Details

  • First principal component : normalized linear combination of features with largest variance
  • Z₁= Φ₁₁X₁ + Φ₂₁X₂ + ...+ Φp₁Xp
  • Loadings of first principal component are Φ₁₁, Φ₂₁, ...,Φp₁.

PCA: Example

  • Illustrates population size and ad spending for 100 different cities.
  • First and second Principal Components are depicted visually (in 2d).

Computation of Principal Components

  • Assumes variables have mean zero
  • Calculate the linear combination with the largest variance.
  • SVD (singular value decomposition) or eigen decomposition can do the calculation.

Geometry of PCA

  • Loading vector (Φ₁) represents direction in feature space with maximum variance
  • Projecting data points onto this vector generates principal component scores.

Further Components

  • Subsequent PCs (principal components) are orthogonal (uncorrelated with previous PCs) with maximum variance.

Further Principal Components Continued

  • The principal component directions (Φᵢ) are the ordered sequence of right singular vectors.
  • Variances of the components are the squares of the singular values.
  • Components are at most min(n-1, p)

Illustration - Example

  • Example illustrates use of USAarrests data including crimes (Assault, Murder, Rape), population percentage, and state characteristics.

Example Details

  • First two principal components of USAarrests are examined.
  • Visualization of these components highlight relationships.
  • The first component is primarily related to overall crime rates among states.
  • The second component focuses primarily on the urbanization level of states.

Example Details - Continued

  • Relationship between and correlation using component scores and loading vector projections.

Another Interpretation of Principal Components

  • Data points are in a p-dimensional Space, and there is a hyper plane that is closest to these observations
  • A hyperplane has a special loading vector that can represent this closest hyperplane (closest to the n observations). Using average squared Euclidean distance as a measure of closeness).

Scaling the Variables

  • If variables are in different units, they should be scaled (typically to unit variance to ensure numerical stability).

Proportion Variance Explained (PVE)

  • Determine the strength of each component:
  • Calculate how much variance in the data each principle component explains.
  • Explains the proportion of variance.
  • The sum of the PVEs (principle components) is 100%.

Proportion Variance Explained, continued

  • The PVE for a component is the ratio of that component’s variance to the total variance
  • Cumulative PVEs can be displayed and observed
  • Using just the first few principle components can accurately capture the variation in the data.

How Many Components Should We Use?

  • Determining a sufficient number of principal components is not always straightforward in unsupervised learning.
  • Cross-validation is not always possible.
  • “Scree plot” can help, looking for an elbow in a plot of the PVE. Use visual examination.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Clustering Machine Learning Quiz
15 questions
PCA and Spectral Clustering
20 questions
Unsupervised Learning Fundamentals
20 questions
Use Quizgecko on...
Browser
Browser