Cluster Analysis I PDF
Document Details
Carlson School of Management
Vivek Ajmani
Tags
Summary
This presentation is on cluster analysis, focusing on methods, applications, and data types. It details how to group data observations into clusters for different areas like marketing, finance and healthcare. It also covers normalization and distance measures.
Full Transcript
BA 3551 Cluster Analysis I Vivek Ajmani Information and Decision Sciences Carlson School of Management Email: [email protected] 1 Part 1 Descriptive and Predictive Analytics Association Rules...
BA 3551 Cluster Analysis I Vivek Ajmani Information and Decision Sciences Carlson School of Management Email: [email protected] 1 Part 1 Descriptive and Predictive Analytics Association Rules Apriori Algorithm Descriptive Analytics Cluster Analysis (Unsupervised Hierarchical Clustering Learning) K-Means Data Preparation, Pre- Processing, Classification Transformati K-NN on, Decision Trees Predictive Visualization Analytics (Supervised Numeric Prediction Learning) K-NN Regression Trees Clustering The grouping of observations together based on their (many) features Organizing data points/observations into “internally similar” and “meaningful” groups, called clusters https://towardsdatascience.com/ml-algorithms-one-sd-%CF%83-clustering-algorithms- 746d06139bb5 Clustering Spending Customer Age Score (1- Income ….. ….. ….. 100) A 26 39 $75,000 ….. B $150,00 29 81 ….. 0 C 22 50 $50,000 ….. D $170,00 34 40 ….. 0 E … … … ….. F …. ….. … ….. G …. ….. … ….. … …. ….. … ….. Clustering Cluster analysis is a form of unsupervised learning We do not have a clear outcome in mind Observations are not being “classified” into pre-defined categories/groups Clusters are not pre-defined, they are discovered from the data Once clusters are discovered by the algorithm being used, they can be evaluated and interpreted Applications of Clustering 1. Discover natural groups and patterns in the data. Examples: Marketing Analytics: Create groups of similarly-behaving customers (segments) Finance Analytics: Discover stocks with similar price fluctuations Health Analytics: Insurance companies cluster individuals based on claims patterns 2. Facilitate the analysis of very large datasets Instead of looking at each individual data point, we can look at each cluster and study its features Application to Marketing A company wants to segment its customer base to design more personalized marketing campaigns. They have data on customer demographics, purchasing behavior, and engagement with marketing channels. Variables include: Age Income Frequency of purchases Types of products bought Response to previous marketing campaigns (e.g., email opens, clicks) Credit Risk Score Etc. Application to Marketing After running the cluster analysis, the company finds three distinct customer segments: Cluster 1: Young, low-income customers who purchase frequently but low-cost items. They engage mostly through social media ads and respond well to discounts. Cluster 2: Middle-aged, high-income customers who make occasional but high- value purchases. They prefer email communication and are less price-sensitive. Cluster 3: Older customers with moderate income, making regular purchases. They engage with loyalty programs and respond to personalized email offers. Actionable Insights: For Cluster 1, the company can focus on social media advertising and offer discounts on frequently bought items. For Cluster 2, premium product marketing through email and value-based offers (e.g., exclusivity) can be prioritized. For Cluster 3, loyalty programs can be enhanced to reward consistent buying behavior, with email campaigns focusing on personalized product recommendations. Application to Finance (1) A bank wants to better understand its customer base to optimize its product offerings and marketing strategies. It has data on customer demographics, financial behavior, and usage of various banking services. The variables considered include: Age Income Total deposits Loan amounts Credit card usage Investment behavior (e.g., stocks, bonds, mutual funds) Risk tolerance (low, medium, high) Credit Risk Score Etc. Application to Finance (1) After performing the cluster analysis, the bank identifies four key customer segments: Cluster 1 (High-income investors): Customers with high income, substantial deposits, and active investments in stocks and mutual funds. They exhibit a high- risk tolerance and are primarily interested in wealth management services. Cluster 2 (Credit-heavy consumers): Middle-aged customers with moderate income but significant loan amounts (e.g., mortgages, personal loans). They tend to use credit cards heavily and have medium risk tolerance. Cluster 3 (Young savers): Younger customers with lower income, mostly engaged in savings accounts with little or no investment activity. Their financial goals are focused on saving for the future, and they are risk-averse. Cluster 4 (Older conservative investors): Older customers who hold high deposits but prefer low-risk investment options like bonds and CDs. They tend to avoid high-risk ventures and are mostly focused on maintaining their wealth. Actionable Insights: For Cluster 1 (High-income investors), the bank could promote advanced financial products like portfolio management, retirement planning, and premium advisory services. For Cluster 2 (Credit-heavy consumers), targeted offers related to loan refinancing or credit card rewards can be offered, along with debt management services. For Cluster 3 (Young savers), the bank could introduce educational campaigns focused on investing and savings growth, promoting starter investment accounts or Application to Finance (2) Cluster analysis can be a valuable tool for clustering stocks based on their similarities in performance, risk profiles, or other financial metrics. By grouping similar stocks together, you can create diversified portfolios, identify trends, or conduct more targeted analyses To begin clustering stocks, you'll need to gather data that captures the key characteristics of each stock. Common variables include: Price returns (daily, monthly, or yearly) Volatility (standard deviation of returns) Market capitalization P/E ratio (Price-to-Earnings ratio) Dividend yield Beta (measure of stock’s volatility relative to the market) Sector (e.g., technology, healthcare, finance) Volume (trading activity) Etc. Application to Finance (2) Once clustering is complete, you can interpret the resulting clusters. Stocks within the same cluster are considered similar, while stocks in different clusters are dissimilar. Some typical interpretations could be: Cluster 1 (High-growth, high-volatility stocks): Stocks in this cluster may have high returns but also high volatility, making them suitable for aggressive portfolios. Cluster 2 (Blue-chip, low-volatility stocks): Stocks with stable, consistent returns and lower risk, suitable for conservative investors. Cluster 3 (Dividend-paying stocks): Stocks that provide regular dividend payments, appealing to income-focused investors. Cluster 4 (Small-cap stocks): Smaller companies with high growth potential but higher risk. Application to Finance (2) Actionable Insights After clustering, you can use the results to make informed decisions about stock portfolios: Portfolio Diversification: By selecting stocks from different clusters, you can build a well-diversified portfolio that reduces risk. For example, combining high-growth stocks with stable blue-chip stocks can balance risk and reward. Sector-Based Analysis: If your clusters align with sectors (e.g., technology, healthcare), you can assess which sectors are performing similarly and make investment decisions based on sector performance. Risk-Based Segmentation: You might group stocks based on risk characteristics (volatility, beta, etc.), allowing you to design portfolios that match different risk profiles (e.g., aggressive vs. conservative portfolios). Application to Healthcare A hospital system wants to improve the management of patients with chronic conditions (e.g., diabetes, hypertension, and heart disease) by tailoring treatments and allocating healthcare resources more efficiently. They have data on patient demographics, medical histories, and treatment outcomes. The variables include: Age BMI (Body Mass Index) Blood pressure Blood sugar levels Cholesterol levels Number of hospital visits Medication adherence Smoking status Physical activity levels Etc. Application to Healthcare After running the cluster analysis, the hospital identifies four distinct patient segments: Cluster 1 (High-risk patients with poor control): Older patients with high BMI, elevated blood sugar levels, high cholesterol, and frequent hospital visits. These patients tend to have poor medication adherence and low physical activity levels. Cluster 2 (Moderate-risk, compliant patients): Patients with moderately controlled chronic conditions. They have slightly elevated blood pressure and cholesterol, but good medication adherence and moderate physical activity. They visit the hospital occasionally for routine checkups. Cluster 3 (Low-risk, healthy lifestyle patients): Younger patients who maintain a healthy lifestyle with low BMI, regular physical activity, and good medication adherence. Their blood sugar and cholesterol levels are under control, and they have infrequent hospital visits. Cluster 4 (Smokers with heart disease risk): Middle-aged patients who are active smokers and show early signs of heart disease. They have elevated blood pressure and cholesterol but are less engaged in treatment adherence. Smoking cessation is a significant concern for this group. Application to Healthcare Actionable Insights: For Cluster 1 (High-risk patients with poor control), the hospital can prioritize more intensive interventions like frequent monitoring, personalized care plans, and closer follow-ups. Health coaches and telehealth monitoring can be introduced to improve medication adherence. For Cluster 2 (Moderate-risk, compliant patients), a maintenance strategy focusing on continued monitoring and regular check-ins through digital health tools or routine visits could help manage their chronic conditions effectively. For Cluster 3 (Low-risk, healthy lifestyle patients), minimal interventions are needed. A focus on preventive care and education around maintaining a healthy lifestyle can help prevent worsening conditions. For Cluster 4 (Smokers with heart disease risk), targeted interventions like smoking cessation programs, cardiovascular health monitoring, and regular screenings are recommended to reduce the risk of heart disease and improve long-term outcomes. Clustering Main idea: Organizing data into most natural groups called clusters Desired properties of a cluster: 1.High intra-similarity: data points in the same cluster should be similar to each other 2.Low inter-similarity: data points in different clusters should be different “enough” from each other Why we need Algorithms? If the data has 3 Dimensions or less, clustering would be very straightforward Observation/Data-point: each row, observation. Feature/Dimension: Each column that captures a feature A 2D dataset is a dataset with 2 dimensions, so 2 features; a 3D dataset has 3 dimensions, and so on. An observation or data-point in our data can be represented as a point on a plane as a function of the respective features/dimensions Then why do we need algorithms? Spending Customer Age Score (1- Income 100) A 26 5 $75,000 B 29 8 $150,000 C 22 3 $50,000 ----- ---- ---- ---- Why we need Algorithms? Since clustering is based on the features available in the data, if the data is 3D or less, clustering would be very straightforward In low-dimensional spaces, clusters can even emerge from simple plots Why we need Algorithms? We usually must deal with N-Dimensional Spaces! In other words, our dataset will likely have more than 3 dimensions (features) Generally, a dataset of N columns-features and M rows (observations) can be considered as having M observations of N dimensions How do we cluster a space with hundreds or more of dimensions? How to Cluster When the data is high-dimensional (we have a lot of columns), we need to: Understand how to measure similarity between individual data-points and group of points (clusters) Types of “similarities” measures Decide which clustering method to use: Types of Methods: Hierarchical Clustering, K-Means Clustering Decide how many clusters we should have: stopping criteria Evaluate the quality and meaningfulness of the clusters obtained Distance Measures To measure similarity between individual data points we will use Distance Measures There exists different Distance Measures, for different data-types Numerical Data Euclidean Distance Manhattan Distance Binary Data (0-1 or data with only 2 categories) Matching Distance Jaccard Distance Categorical Data (More than 2 categories) Distance Measures Assume k-dimensional data, where k can be any number. We can represent each observation as a data-point where the dimensions are the ”coordinates” on the k- dimensional space A=(a1, …, ak), B=(b1, …, bk), C=(c1, …, ck) Example: consider the 3-dimensional dataset below. Each observation can be described by the 3 features: Age, Education, Income A = (26, 5, 75000) B = (29, 8, 150000) C = (22, 3, 50000) Custom Years of er Post- Age Secondar Income A y Education B 26 5 $75,000 C 22 3 $50,000 29 8 $150,000 Distance Measures: Numerical Data Consider two data points, with k-dimensions: A=(a1, …, ak), B=(b1, …, bk) Euclidean Distance: The length of a line segment between the two points “The straight-line distance” (the Pythagorean theorem) Example in 2D Custom Dim1 Dim2 er 2 14 A 7 2 B = = Distance Measures: Numerical Data Consider two data points, with k-dimensions: A=(a1, …, ak), B=(b1, …, bk) Manhattan Distance: Distance of two points is the sum of the absolute difference of their coordinates The absolute (or modules) operator | | transforms any number inside into a positive number Used where rectilinear distance is relevant Custom Dim1 Dim2 Example in 2D er 2 14 A 7 2 B = 5 + 12 = 17 Distance Measures: Numerical Data Summary, Example in 2D = = = 5 + 12 = 17 How to interpret them: the higher the distance, the more different the two points; or the lower the distance, the more similar the two points Distance Measures: Binary Data A=(a1, …, ak), B=(b1, …, bk ) Stude Senio MIS BA Define the following quantities: nt r Major Minor N00=number of i’s where ai=0 A 0 1 0 B 1 1 1 and bi=0 C 1 0 0 Num. attributes where both A and D 0 1 1 B=0 N01=number of i's where ai=0 and bi=1 Num. attributes where A = 0 and B=1 N10=number of i's where ai=1 Distance Measures: Binary Data A=(a1, …, ak), B=(b1, …, bk ) Stude Senio MIS BA Define the following quantities: nt r Major Minor N00=number of i’s where ai=0 A 0 1 0 B 1 1 1 and bi=0 C 1 0 0 Num. attributes where both A and D 0 1 1 B=0 Example, consider A and B in the N01=number of i's where ai=0 table above: N00 = 0 and bi=1 N01 = 2 Num. attributes where A = 0 and N10 = 0 B=1 N11 = 1 N10=number of i's where ai=1 Distance Measures: Binary Data A=(a1, …, ak), B=(b1, …, bk ) Stude Senio MIS BA nt r Major N00=number of i’s where ai=0 and Minor A 0 1 0 bi=0 B 1 1 1 N01=number of i's where ai=0 and C 1 0 0 bi=1 D 0 1 1 N10=number of i's where ai=1 and bi=0 N10 Matching Distance: d(A,B)=(N01+N )/(N00+N01 11=number of+N i's 10+N11a where ) i=1 = and bi=1 (N01+N10)/k Intuition: number of mismatches divided by the total number of attributes (k) Used for Symmetric Binary Data, where N00 and N11 are equally important Example: D(A,B) = (2 + 0) / 3 ~ 0.66 – range is always [0,1] The higher, the more distant (“different”) the data points Distance Measures: Binary Data A=(a1, …, ak), B=(b1, …, bk ) Stude Seni MIS BA nt or Major Minor N00=number of i’s where ai=0 and A 0 1 0 bi=0 N01=number of i's where ai=0 and B 1 1 1 bi=1 C 1 0 0 N10=number of i's where ai=1 and D 0 1 1 bi=0 N11=number of i's where ai=1 and bi=1 Jaccard Distance: d(A,B)=(N01+N10)/(N01+N10+N11) Intuition: excludes matches where ai=0 and bi=0 Example: D(A, B): (2 + 0) / (1 + 2) ~ 0.66 Used for Asymmetric Binary Data, where N00 is not as important as N11 Cases in which knowing mutual presences is more important than knowing mutual absences Part 2 Dealing with different Data Types In reality, datasets have: Different types of attributes (numerical, binary, etc.) Mixed-Data Can we use any of the distance measures that we just learned directly? What’s the consequence of doing so? Consider the dataset below as an example Has Drivi Individ Incom ng Age ual e Licen se A 1 18 10,000 B 1 27 80,000 C 0 29 45,000 Dealing with different Data Types In reality, datasets have: Different types of attributes (numerical, binary, etc.) Mixed-Data Attributes that take values in very different ranges (example, age and income have very different ranges) Non-comparable data As such, we will need to transform our attributes so they take values from a common range Data Normalization Has Drivi Individ Incom ng Age ual e Licen se A 1 18 10,000 B 1 27 80,000 C 0 29 45,000 Data Normalization Has Drivin Individ Incom g Age ual e Licens e A 1 18 10,000 B 1 27 80,000 C 0 29 45,000 Data normalization: objective is to eliminate specific units of measurement and transforms the attributes to a common scale The term normalization is somewhat used to refer to any method that can be used to scale attributes. Different methods scale attributes in different ways: Min-Max Normalization Standardization Once your data is normalized, you can apply one of the numerical distance measures Data Normalization Min-Max Normalization: rescale the attributes to have values between 0 and 1 using the min and max. A point with value X will be normalized to: NewValue = (X – min) / (max – min) Driver Individ ’s Age Income ual Licens e A 1 18 10,000 B 1 27 80,000 C 0 29 45,000 Example Income: 10000 (10000 - 10000) / (80000 - 10000) = 0 80000 (80000 – 10000) / (80000 - 10000) = 1 45000 (45000 - 10000) / (80000 - 10000) = 0.5 “Driver’s license” is already between 0 and 1, so we do not need to apply Min-Max Normalize Age as exercise (complete solutions available on Canvas) Data Normalization Standardization: transform each attribute to have a mean of 0 and a standard deviation of 1. A point with value X will be standardized to: NewValue = (X – Sample mean) / (Sample standard deviation) Important: If using Standardization, we need to transform Binary data as well Has Drivi Individu ng Age Income al Licen se A 1 18 10,000 B 1 27 80,000 C 0 29 45,000 Example: Standardization NewValue = (X – Sample mean) / (Sample standard deviation) Has Drivin Individu g Age Income al Licens Standardize Age e A 1 18 10,000 STEP 1: B 1 27 80,000 1a. Compute Mean: (18 + 27 + 29) / 3 ~ C 0 29 45,000 24.67 1b. Compute Sample Standard Deviation: Compute sum of squared differences from mean (18-24.67)2+(27-24.67)2 + (29-24.67)= 44.49 + 5.43 + 18.75 = 68.67 Divide by (N-1) where N is the total number of observations and take square root: Sqrt(68.67/2) = 5.86 Has Drivin Age Step 2: Perform standardization for each data-point:Individu al g (Standardi Inco me Licen zed) A, 18: (18 – 24.67)/5.86 = -1.138 se B, 27: (27-24.67)/5.86 = 0.398 A 1 -1.138 10,00 0 C, 29: = 0.739 80,00 B 1 0.398 0 45,00 C 0 0.739 Summary Distance Measures Numerical Attribute: Euclidian Distance Manhattan Distance Binary Attribute Matching Distance Jaccard Distance If attributes of different data-types/different ranges normalization Min-Max Normalization: transform each feature to be between 0 and 1 Standardization: transform each feature to have mean 0 and SD 1 Distance Between Clusters We just discussed the distance between two rows (e.g., two individuals) But does the distance between two rows infer the distance between two clusters? Any connections? Any differences? Types of Similarity Consider points A, B and C. Is B more similar to A or to C? Based on direct distance, one might assume points B and C are more likely to be in the same cluster. Types of Similarity Consider points A, B and C. Is B more similar to A or to C? Based on direct Take into consideration distance, one might the context: points A assume points B and and B belong to the C are more likely to same cluster. be in the same cluster. Distance Between Clusters LINKAGE METHODS Single Linkage Complete Linkage Average Linkage Centroid Linkage Ward’s Method Linkage Methods Single Linkage: min pairwise distance between points from two different clusters. - Compute ALL the distances between points and pick the minimum one. Linkage Methods Single Linkage: min pairwise distance Complete Linkage: max pairwise between points from two different distance between points from two clusters different clusters. - Compute ALL the distances between - Compute ALL the distances between points and pick the minimum one. points and pick the maximum one. Linkage Methods Average Linkage: average pairwise distance between points from two different clusters. - Compute ALL the distances between points from the cluster and take the average. Linkage Methods Average Linkage: average Centroid Distance: distance pairwise distance between between two clusters centroids, i.e. points from two different cluster means. clusters. - Compute the cluster’s centroids, - Compute ALL the distances then compute the distance between points from the cluster between the centroids. and take the average. - (See exercise on clusters’ centroids) Cluster Centroids The Centroid of a cluster is the “mean point” of the cluster, where the coordinates are represented by the mean values of each dimension/feature. Let us consider the following 4 observations in 2D Observati Feature 1 Feature 2 ons A 80 56 B 75 53 C 60 50 D 68 54 Assume points A and D are clustered together in C1 and points B and C are clustered together in C2 C1: {(80,56), (68, 54)} Centroid: (80+68)/2, (56 + 54)/2 = (74, 55) C2:{(75, 53), (60, 50)} Centroid: (67.5, 51.5) A Centroid does not have to be a data-point already existing in your dataset. It is, rather, the “mean” of the data-points If you have a dataset with more than two features (columns), then the coordinates of the centroid are the mean values of each feature. If you have 3D, the centroid will have 3 coordinates and so on. Linkage Methods Vs Ward’s method or minimum variance method: The objective of Ward's linkage is to minimize the within-cluster variance. Within-cluster variance is the sum of squared pairwise distances between the cluster centroid and cluster points. When comparing two clusters, the method compares the within-cluster variance obtained if the two clusters were merged into one to the (sum) of the within-cluster variances for each, if the two were separated Summary Clustering: grouping observations in clusters We need to know how: Measure Distance between Individual points Normalize data Measure distance between groups of points (clusters)