Data Science Summary Notes - Data Analysis Techniques
Document Details

Uploaded by kh4nhl3
Tags
Summary
This document is a summary of data science concepts and techniques. It covers topics such as Key Performance Indicators (KPIs), stratification, target leakage, balanced datasets, and unstructured data. It describes supervised and unsupervised techniques including linear and generalized linear models, decision trees, clustering and principle component analysis.
Full Transcript
Key Performance Indicator (KPI) Why a variable might be a good KPI o It’s measurable given the data and assumptions o If it’s in line with the objective or goal of business problem o If it’s directly used as a measurement to achieve the goal, for ex: revenue is a good...
Key Performance Indicator (KPI) Why a variable might be a good KPI o It’s measurable given the data and assumptions o If it’s in line with the objective or goal of business problem o If it’s directly used as a measurement to achieve the goal, for ex: revenue is a good variable if we’re trying to see how profitable the business is Stratification To create stratified sample: o Determine the number of strata – combination of the wanted predictor variables o For each stratum, take a random sample with the size that has the same proportion of the stratum to the original population(the entire training dataset). To see if stratification is successful o Compare the average of training set and validation set If similar averages, stratification is successful Target Leakage When predictor variables include information that is directly related to the target variable o Ex: when target variable is used as input(predictor) during training Balanced Dataset Use oversampling to create more balanced dataset o Oversampling = draws higher proportion of minority class than majority, it duplicates the observations of minority class until we have an equal number of observations of minority and majority class Unstructured data Advantage: gives qualitative information that can’t be included in structured data Disadvantage: requires more complex methods to process data into a predictive model, takes more resources and time to analyze the data Unsupervised techniques Data analysis without target variable Goal: to find patterns in data, no clear objective of findings Useful in feature creation o PCA, clustering Useful in finding patterns between predictors to inform model selection and improve prediction accuracy Develop analysis to estimate a target variable What filter to use to block out irrelevant information o Ex: filter only on flights over 300 miles because that’s the only group provided with beverages Identify what variables we should measure for a problem o Identify the target – what we try to predict o Identify the predictors – what variables we should look at that would explain the target o Ex: measure avg daily passengers on flights in & out of WA since that’s how we measure total market size Missing Data Can replace with mean Remove rows with missing observations – if there’s already a large amount of observations compared to amt of missing observations Remove columns (fields) – if there’s majority is missing data for this variable Bias-Variance Tradeoff Variance measures how much the shape of the model changes with different training data Bias measures the average closeness between the modeled distribution and actual distribution Higher flexibility => higher variance => lower bias Variance = expected error due to model being too flexible Bias = expected error due to model not being flexible enough Model with lowest RMSE strikes a balance between the two Overfitting vs. Underfitting If overfitted o Use stepwise selection to identify and drop weak predictors, reducing flexibility o Use regularization to shrink coefficients and potentially drop predictors (lasso or elastic net) If underfitted o Add an interaction term to describe patterns between 2 predictors, increasing flexibility p-value If the coefficient estimate has p-value < 0.05, the coefficient is statistically significant Numeric vs Categorical Variable (aka Factor Variable) Prefer factor variable o When the number inputs of the variables don’t have meaningful order or scale The ordering of clusters is arbitrary; the magnitudes of clusters are also arbitrary Ex: cluster 3 isn’t “greater” or “higher” than cluster 1 o Treating as factors forces the model to ignore the numbers themselves and treat them as different groups o When there is NOT a linear, monotonic relationship between the predictor and target variable Factor variable allows for this because they’re treated as groups Disadvantage of being factor variable o There might be too many levels in the factor which would complicate the model If we want to see if the predictor variable should be numerical or categorical (factor) o Do bivariate analysis between the predictor and target If there is a linear monotonic relationship => numerical, if not => categorical Combining Levels in Factor Variable If a level has too few observations, its low representation may produce unreliable statistics Combining it with another level, we wouldn’t be able to tell if the target variable is changed due to the effect of that level or the other level it’s being combined with Log Transformation Used for right skewed distribution Compresses larger values and expands smaller values to show more symmetric distribution Can’t log transform if variable has non-positive values Reduces impact of outliers Cons of log transform: o Makes it harder to interpret the model coefficients o Does not necessarily improve model performance o The transformation might have spike at 0 Binarization In context of clustering o Let’s say there are 4 clusters First choose a base level, say cluster 1 Create dummy variables for the other 3 clusters Set cluster_x to 1 for observations in cluster x and 0 otherwise If observation is in cluster 1, we set all the other clusters to 0 In context of stepwise selection o Treats each level of factor variable separately o If there are 5 levels, it’s possible that only 2 end up in the chosen model Means that the other 3 don’t have enough predictive power to be included o Removing a dummy variable causes levels to merge The original base level merges with levels that no longer have a dummy variable Accuracy Metrics – Confusion Matrix For classification problems, NOT regression Sensitivity = true positives / real positives = TP / (TP + FN) o High sensitivity means the model is good at identifying real positive values in the data and not misclassifying them as negative when they’re positive Specificity = true negatives / real negatives = TN / (TN + FP) Accuracy = % of correct predictions = (TN+TP) / all Real Real Classification error rate = % of wrong predictions = 1 – accuracy neg pos False positive rate = false positive / real negatives = FP / (FP + TN) o 1 – specificity Precision = % of positive predictions that are correct = TP out of predicted positives o TP / (TP + FP) o High precision means model is good at classifying positive outcomes when they are actually positive If want to correctly predict positives => focus on sensitivity If want to correctly predict negatives => focus on specificity If lowering positive response cutoff o Allows more positive predictions, less negative predictions => increasing sensitivity, lowering specificity Exploratory Data Analysis – Univariate & Bivariate Techniques Univariate Bivariate Numerical: Numerical: Mean Correlation Variance Statistics by level Quantiles Frequency Frequency Graphical: Graphical: Scatterplot Histogram Side-by-side plots Bar chart o Histogram Box plot o Bar chart o Box plot Bar Chart vs Pie Chart o Bar chart is better Bar Chart vs Line Graph Bivariate analysis between predictor vs. target variables o Scatterplot Pros: can see full range of observations Cons: too many data points so it’s difficult to see how many points are in a certain range; difficult to see overall trend o Side-by-side box plots Pros: can see distribution of each range of x’s, can easily see outliers for each range Cons: don’t really see how many observations are in each range, aka how densely the data is at any particular range o If predictor and target variables are highly correlated, it means the predictor will be very predictive if included in the model o If the predictors are correlated to each other, they exhibit collinearity which is a violation and cause problems if used together in a model MLR Assumptions; Residuals vs. Predicted Values Plot, qq Plot OLS seeks coefficient estimates that minimizes SSE Residual vs. predicted value plot – used to determine if model suffers from bias or heteroscedasticity Should look like: MLR Assumptions Residual Plot Mean of errors = 0 Residuals are balanced Variance of errors = around 0; if not = bias constant issue (homoscedasticity) Spead of residuals is Errors are independent constant Points appear random and show no obvious trend The spread of residuals increases as the predicted values increase, which suggests that there are more predictors to be added to explain the variability in the errors o Normal dist and identity link is not appropriate for this Recommend gamma distribution and log link Gamma because the SD of residuals are higher for higher predicted values Log link because we want positive values of predictions More MLR Assumptions: o Errors are normally distributed o A predictor cannot be a linear combination of other predictors Violations: o Non-zero average of residuals o Heteroscedasticity o Dependent non-normal errors o Outliers o Collinearity / Perfect collinearity If predictor variables are highly correlated o High dimensions – too many predictors qq Plot o Check to see if the points deviate from superimposed line If far away from superimposed line, there are outliers = > the data is not normally distributed Shortcoming of MLR o Allows for negative predictions, might not be appropriate for some target variable Predicted vs. Actual Plot Points directly on the red line means prediction = actual, the predictions were accurate Points above the red line means prediction < actual, underestimated Points below the red line means prediction > actual, overestimated Model Choice Weak model: o Predictor variables are highly correlated to each other o When predictor variables have non-linear relationship with target variable o Missing data for a good amount of observations Replacing coefficients with constant value of mean might lead to perfect collinearity Ex: B*x1*x2, replacing x2 with constant mean gives B*x1*c, which would be perfectly linear with B* x1 Interpreting Scatterplot between 2 numeric variables: o See if there’s a linear monotonic relationship between 2 variables Or is there a quadratic relationship between 2 variables For 2 numeric variables: o We want a linear monotonic relationship For GLM, we still want monotonic relationship o If not, the model will fit the values close to 0 Model improvements: o Log transform a target variable if the distribution of histogram is right skewed o Change a numeric variable to factor variable Allows model to fit non-linear, non-monotonic relationship Might add complexity if there’s too many levels in the factor o Add interaction term between 2 predictor variables o Add a variable to explain a pattern in data o Remove a level from a factor variable if deemed irrelevant or create noise Ex: removing data for 2020 because of impact from COVID-19, this is just a special occurrence Transformed Variables: o Untransformed vs Transformed (in green) Untransformed – there is not linear monotonic relationship based on the scatterplot (purple dots), the fitted line shows just a slight uptrend Transformed – Stepwise, V-shape on predicted boardings; there is a monotonic trend on either side. The transformation is able to fit the nonlinearity of data better, resulting in more accurate predictions ROC vs. AUC in model performance evaluation Both assess the performance of binary classification problems ROC plots all possible combinations of TPR (sensitivity) and FPR (1 – specificity) as we change the cutoff AUC is the area under the ROC curve, representing accuracy or overall performance of the classifier o If AUCs between 2 models are similar, this means both models have similar ability to distinguish between 2 classes o We can use AUC to compare weighted vs. unweighted models Training vs. Validation Data (Test Data) Metric calculated based on training data is based on data that the model has already been seen. Better to calculate metric based on validation data, or test data, which is the data the model has not seen yet o This will better detect the degree of overfitting that may have happened during the fitting of training data o If there’s not enough for validation splits, cross validation can be used instead Flexibility parameter vs. RMSE graph o Training errors < Test error at every level of flexibility If a model performs excellently on training set but significantly worse on test set => model is overfitted o In boosted tree, subsequent trees try to correct remaining errors from prior trees. So as learning rate e increases, the model fits to training data better, making it more flexible and the error RMSE decreases Test RMSE is calculated from testing set so as learning rate e increases, the impact of overfitting causes the RMSE to increase, so we end up with worse results Using testing set data to tune hyperparameter o We should NOT use test set to tune hyperparameter because testing set is used to assess the model after it has already been created, therefore we should keep test set separate and should not let it influence the model creation in the first place. Alternatively, we can use k-fold cross validation since it uses training data to tune the hyperparameter o Interaction Variable Scatterplot shows that there is an interaction between variables Control and Cost to Attend o Different slopes for 2 levels o Without interaction, the slope would be the same whether it’s public or private => assumes same effect of x on y for all levels o Without interaction, the model would use the same coefficient for both control levels o Adding interaction term allows each level to have its own slope coefficient Interaction term is the additional impact on top of the individual impact of each variable o Ex: Both discount and passengers have positive effect on the target, in isolation. The negative interaction term reduces the individual positive effects when both discount and passengers increase. Model Coefficients Table Linear Model o If the school is public, the loan repayment rate is lower by 3.465% compared to private non-profit school (base level). If the school is private for profit, the repayment rate is 10.587% lower compared to base level o Ex: Public institution with 100% of undergrads receiving Pell grants 1.009 – 0.549 * 1 – 0.03465 = 42.51% GLM o Can tell it’s Gamma with log link because “family = Gamma (link = “log”) o Ex: Private for-profit institution with 50% of undergrads receiving Pell grants Exp(0.0601 – 0.7704 * 0.5 – 0.1537) = 61.95% If any coefficient estimate has p-value more than 0.05, these predictors are NOT statistically significant Magnitude of coefficient estimates o If too small, consider changing maybe from numeric to factor NA values for some coefficients – because perfect collinearity; happens when: o There’s a coefficient for every level of factor variable; should be 1 less o When mean is used to replace missing values when there’s 100% missing data for that level Stepwise Selection – Forward & Backward Selection Stepwise selection vs. Shrinkage methods (regularization) o Similarities Both can be used for variable selection to reduce model complexity (variance) Both avoid overfitting data, especially when number of observations is small compared to number of predictors o Differences Stepwise takes iterative steps: Forward selection: from no predictors, then add based on performance metric (AIC) Backward selection: from all predictors, then remove based on performance metric Shrinkage fits all predictors simultaneously to optimize a loss function Loss function includes penalty term with a parameter lambda that penalizes large coefficients Shrinkage can reduce coefficient without completely removing it – ridge Stepwise Selection performance metrics – AIC and BIC o AIC is the measure of goodness of fit of the model, it reflects likelihood of the model given the data o We want LOW values, this provides the best balance between model fit and flexibility o AIC = SSE* + 2p o BIC = SSE* + (ln n)p o As p increases, first term(the error term) decreases, second term increases o BIC is quicker to rise than AIC so BIC favors models with fewer predictors Forward selection Backward selection From no predictors, then add based on From all predictors, then remove based performance metric on performance metric Only consider interaction term if the Only remove individual terms once the individual terms are already in the interaction term is removed first model Works in high-dimensional settings, Maximizes potential of finding where n > p complementing predictors Might miss complementing predictors Backward with AIC – more predictors as it focuses on finding the best o More flexible, higher accuracy, low variable each round bias Forward with BIC – fewer predictors o Preferred when focus on finding key predictors of target variable Ex: A table result of first iteration of backward selection o Shows what the AIC would be if each predictor alone is dropped from model is when no predictors are dropped o AIC is lowest at 3 places: , table, x Impossible to tell which one of the 3 is the best from this output alone o All others are useful because AIC would increase if excluding any of them If focus on optimizing predictive power over interpretability => Choose backward with AIC o Because we want to retain as many features as possible to provide higher predictive power even if it results in complicated model o As number of predictors increases, BIC is quicker to rise than AIC so it’s quicker to eliminate the predictor from the model. BIC tends to favor smaller model with less predictors o Backward model tends to larger models bc it maximizes potential of finding complementing factors Forward model might miss out on such variables because it focuses on adding the next best variable each round Regularization Technique to address overfitting in linear regression models o By shrinking coefficients = restricting the possible values for coefficient estimates Reducing impact of less important predictors p Penalty term is the second term, λ ∑ b j 2 j =1 Lambda λ is the shrinkage hyperparameter o As λ increases, variance reduces, flexibility reduces When λ = 0 => flexible model with ALL predictors Ridge vs lasso regression output table o Lasso has coefficients closer to 0 than ridge o High coefficients => highly significant Near 0 coefficients => not as significant Ridge regression Lasso regression Elastic Net Regression Seeks coefficient ests Seeks coefficient ests that Seeks coefficient ests that that minimize p minimize SSE+ λ¿ minimize SSE+ λ ∑ ¿ b j∨¿ ¿ Alpha = between 0 to 1 j =1 Weighted average of p Alpha = 1 restrictions in ridge and SSE+ λ ∑ b2j Can shrink coefficients all the lasso j =1 way to equal 0 and remove Can also do feature Alpha = 0 predictors – “feature selection Only shrink the selection” coefficients to be closer Could ignore hierarchical to 0, but not equal 0 principle – keep interaction term when the individual variable’s coefficient is reduced to 0 Standardization before regularization o The coefficients of predictors with smaller range will get shrunk more aggressively than coefficients of predictors with larger range Find optimal λ – use k-fold cross validation o Choose a set of lambdas to test. For each lambda, calculate a CV error o Divide data into k folds o For a specific lambda Fit a model on k-1 groups, leaving one group out Calculate an RMSE using the fitted equation and the left out group’s observations. This is a test RMSE Repeat the process for each of the remaining groups to obtain k test RMSEs for a given lambda Average the k test RMSEs to obtain CV error for that lambda o The optimal lambda is the one with the lowest CV error One-standard-error rule o Choose the largest lambda within 1 standard error of the lowest CV error GLM Key assumptions of GLM: o All observations in the data are independent o The distribution of the target variable is a member of the linear exponential family (non-normal) o The model prediction is a transformation of a linear combination of the predictor variables Poisson distribution – good for non-negative values, count variables and right skewed data Gamma distribution with log link o Gamma Good for continuous positive values Good for when variance increases with predicted values o Log link bc when we only want positive values Distributi Reasons on Normal Allows both positive and negative values Bell shaped distribution Poisson When target variable are positive values Count variables; integer values Right skewed data Gamma When target variable are positive values Continuous variables Variance increases with predicted values Binomial When response variable is binary Link function connects the mean of the target to predictors o Determines how the mean(expected value of the target) changes in response to changes in predictor o The choice of link function depends on the error distribution Link Reasons Coefficient Interpretation function Identity For normal distribution For every unit increase in x j, the link Real-valued μ predicted target value changes Doesn’t guarantee non-negative by b j values Log link When target variable allows only For every unit increase in x j, the positive values predicted odds changes by a μ is positive factor of e b j Logit link When target variable allows only For every unit increase in x j, the probabilities as values predicted target value changes μ between 0 and 1 by a factor of e b j Gamma with Log link Ex: For every unit increase in carat, the predicted price changes by a factor of exp(- 1.2935) = 0.274. Or, the predicted price decreases by 0.274 – 1 = 72.6% Normal linear regression vs. GLM with log link Normal linear regression GLM with log link Estimates coefficients using OLS Estimates coefficients using MLE o Minimize residuals (errors) o Maximize likelihood estimates Only allows for one distribution: Has flexibility to choose a linear monotonic relationship distribution that best fits the shape of response variable Variance of the response is Variance can be function of the constant mean (mean can change with each observation) GLM vs. Tree-based Models Why choose GLM over tree-based models? o If there is a linear relationship between predictor and target variable o Trees tend to perform better with categorical predictor variables If the predictors are continuous variables, then choose GLM Weights Weighting variable – measures the relative importance of each observation o Model will adjust the contribution of each observation in estimating model coefficients based on the weight o Model will not estimate a coefficient for weighting variable o Scale the variance of the target for all observations Weights vs. offsets o Weights assign relative importance to each observation in the model o Weights adjust the variance of the response variable They’re employed to handle heteroskedasticity or unequal variances o Offsets are used in the model with a fixed coefficient of 1 so no estimated coefficient It’s used when a variable has a known relationship with the target variable It acts as a scaling factor for the target variable It’s a way to incorporate exposure into Poisson GLM How weights affect AIC: o AIC reflects the likelihood of the model o Weights can significantly affect the log-likelihood calculation in the model o Therefore, weights can contribute to AIC of the weighted model so we should not compare AICs between unweighted and weighted model Decision Trees Interpretation o Ex: The predicted value is 9.9. This node accounts for 42% of total records in this training dataset o There is an interaction between month and year because there is a subsequent split for year after month Advantages: o No assumptions about r/s between predictors and target o Easily interpretable o Handle qualitative predictors and missing data well o Automatic feature selection – automatically select significant predictors o Robust to outliers o Automatically capture interactions Disadvantages: o Prone to overfitting o Lower predictive accuracy o Unstable – sensitive to small changes in data o Use a greedy algorithm that may not find global optimum Regression trees Constructed using recursive binary splitting Algorithm: o At root node, consider all partitions across all features in dataset o For each potential partition, calculate total SSE across 2 groups that would result from the split o Choose the split that minimizes the total SSE (impurity) o Repeat the process at each subsequent node until stopping criteria is met Stopping criteria – hyperparameters to prevent tree growth (address overfitting) o Min # of observations required to attempt a split o Min # observations required in each terminal node after a split o Max depth of any node in decision tree o Minimum reduction in SSE required for split to occur How to calculate impurity – only for regression trees o Calculate SSE for all observations in the node Greedy algorithm o Makes greedy split based on most information gain/minimized SSE at each immediate step without considering future splits Does not necessarily produce the best fitting overall model Classification trees Predictions based on majority in each node o At each split, aim to maximize information gain – change in Gini index or entropy How to calculate information gain – only for classification trees o Calculate the splitting measure (Gini index or entropy) for each split AND parent node o Calculate the weighted average of the splitting measure o Information gain = parent node’s splitting measure – weighted average cp Table cp values determine the threshold of improvement needed in order to produce an additional split. cp = 0 means largest tree (unpruned tree), last row in the table cp table shows impact that changing cp value has on the test metrics, like CV error (xerror) CV error (xerror) measures how the model performs on unseen data, which penalizes both overfit and underfit models Choose the cp value with lowest CV error (ex: line 6 above) Want to minimize SSET +cp ×∨T ∨× SSE0 As cp decreases, flexibility increases o cp = 0 => we have the largest tree o cp = 1 => tree with no splits, just root node one-standard-error rule o choose the left point of where the lowest CV is Single decision tree vs. Ensemble models Similarities o Both can be used to build regression and classification models, determining splits based on minimizing impurity or maximizing information gain measures o Both don’t assume specific distribution for target or any relationship between predictors and target Differences o Single decision splits data into a single set of distinct regions. Ensemble methods fit multiple trees and aggregate predictions from the trees to determine model’s predictions o Single tree tends to have high variance and likely to overfit training data, while ensemble methods reduce overfitting Ensemble Methods Tree models are resilient to outliers o Because they split data into group, even if there’s outliers, it only impacts one group and not the other RMSE vs MAE: MAE more robust to outliers o Outliers tend to result in larger error terms o RMSE would exacerbate this error because of the squaring of error term o MAE just takes the absolute value √ n ¿ o ∑ ( y i−y i ) 2 i=1 R MSE= n o ¿ Random Forest vs Boosting Random Forest Boosting Avoid overfitting Prone to overfitting – regularization is more essential for GBM Trees are built simultaneously Trees are built sequentially using information from previous trees Let individual trees grow freely and Limit size of each individual tree become large => overfitting o Small individual trees Individual trees have high variance, Individual trees have high bias, low aggregating predictions across many variance; aggregating predictions from trees then reduces the variance multiple trees reduces bias Number of trees is not a flexibility Number of trees is a flexibility measure measure o More trees => more flexible Random Forest Random forests are created from applying bagging (bootstrap aggregation) and taking random feature subsets to construct multiple trees, which are averaged to produce a prediction Bagging = Bootstrap Aggregation: o Obtain multiple bootstrap samples from original dataset Each bootstrap sample is formed by sampling from the training dataset with replacement o Fit a tree to each bootstrap sample o Each individual tree is trained on a different training dataset so it has high variance and low bias o Bagging aggregate predictions from multiple individual trees so it reduces variance o Aggregation doesn’t have clear impact on bias Bagged model error = average of errors from each decision trees Tuning random forest hyperparameters: o Mtry = number of predictors to be considered at each split Consider random subsets of features at each split => reduces chance of correlated predictions, further reduces variance Bagging = all predictors are considered at every split; dominant predictor is likely the 1st split Choose mtry value where CV error is lowest or when AUC is highest o Ntree = number of trees Increasing ntree does not cause overfitting; drawback is computational time Choose ntree = where CV error is first lowest Boosted tree Trees are built sequentially, in which each tree tries to correct the remaining errors after fitting the prior one, naturally leading to overfitting Each individual tree is small and has high bias, explaining only a little of the target o Bias is reduced by summing trees’ predictions Algorithm: o First tree: training dataset o Second tree: modify training dataset by emphasizing data points explained poorly by 1st tree o Third tree: modify training dataset by emphasizing data points explained poorly by 2nd tree Tuning hyperparameter for boosted tree o Learning rate – it controls the amount of information gained by each subsequent tree, which affects how quickly the error is reduced by each subsequent tree Variable Importance Plot Shows how much impact a variable has on model predictions Variables with high incremental node impurity have higher importance Ex: most significant predictors = Hour & Direction Partial Dependence Plot How values in partial dependence plot are calculated for a specific predictor variable in random forest: o First, replace all values for that predictor in the dataset with the lowest value for that variable o Calculate predictions for all observations using trained model o Average the predictions o Repeat this process for all values of the predictor variable o Plot the average predictions against the predictor values Limitations: o Plots don’t depict exact relationship between the predictors and target Focuses on only one predictor with target, ignoring possible interactions within subsets of the data o Some predictions may be unrealistic as combinations with other predictor’s observations Fit petal.length of 2 to petal.width of 7 is unrealistic Actual vs. Predicted Values Plot There are 4 terminal nodes in this graph below: Principle Component Analysis PCA is an unsupervised learning technique that create uncorrelated variables that maximize variance 1st few PCs will explain most of the variation in original variables PC will replace the original variables When is PCA needed: o High dimensionality – when there’s a lot of variables, PCA summarizes the high- dimensional data into fewer composite variables while retaining as much info as possible If we have small dimensionality – any info loss won’t be outweighed by improvements in model performance or capture of latent variables PCA only works with numeric variables; factor variables need to be binarized prior to applying PCA Interpretation: o Standard Deviation of PCs SD for PC1 much larger than for PC2 and PC3 => strong correlation among 3 features, not much additional variation is captured after PC1 o Proportion of variance 96% of total variance is captured by PC1, only 2.56% and 1.3% captured by PC2 and PC3 3 features are correlated, PC1 can be used as replacement for 3 original vars o Loadings of PCs Represent the influence of each original variable on the PC The larger the loading, the more influence the original variable has on the PC Similar loadings for all original variables => all variables contribute equally to PC If one increases, the others increase as well If different loadings, ex: PC2 Loading for 3rd variable is high in absolute value & different sign => the remaining variation not explained by PC1 is mostly explained by this variable in PC2 Ineffective PCA o When SDs for all PCs are similar o No correlation between the 3 variables – almost orthogonal o Proportion of variance are similar – it’s still high in PC3 Clustering Clustering based on numeric data vs categorical data o Advantage: can be sure which observations should belong exactly to which cluster based on exact measurements; more accurate and detailed compared to categorical being more of a general estimate o Disadvantage: based on numerical data can be too granular and can introduce noise and patterns that can’t be easily generalized, clustering results can be hard to explain k-means Clustering vs. Hierarchical Clustering Similarities: o Both are unsupervised learning techniques, meaning they both group observations to show structures and relationships without reference to the target variable o Both are used to create new features from multiple predictor variables Differences: o K-means determines the number of clusters before running the algorithm, hierarchical clustering determines the number of clusters after running the algorithm by choosing a height to cut the dendrogram. o K-means only considers dissimilarity between observations using Euclidean distance, but not between clusters, meanwhile hierarchical considers dissimilarity between clusters using linkage function k-means Clustering Algorithm: o Choose k number of clusters o Randomly assign each observation to one of the clusters, 1 to k => initial assignment o Calculate each cluster’s centroid o Reassign each observation to the closest centroid o Repeat until cluster assignments stop changing Algorithm needs to be ran multiple times with different initial cluster assignments to find global minimum Goal: to minimize within-cluster variation = sum of squared distance from centroid Calculate within-cluster sum of squares: o There are 5 clusters. Calculate the centroid of each cluster for each variable o For 1st cluster: Latitude’s centroid = mean(latitude values in cluster 1) Longitude’s centroid = mean(longitude values in cluster 1) o Calculate the squared distance between each city and centroid of its cluster: Sqdistance = (1st city’s latitude value – latitude centroid)^2 + (1 st city’s longitude value – longitude centroid)^2 o Sum the squared distances for all cities Within sum of squares = sum(sqdistance) Standardization o All variables should be standardized before running k-means clustering o Ensures all variables get equal weight in clustering o Without standardizing, the variables on larger scales will dominate the clustering Clustering principle components vs. untransformed variables o Only useful if the variables are highly correlated; if not, it’s not useful Elbow plot of k-means clustering o Tradeoff between the % of variance explained and model complexity. Smaller k explains less data but it’s simpler than larger k. We recommend k should be where the elbow is in the plot, which is where the variance begins to level off this would be where k=2 Adding features to k-means analysis o With 2 or 3 features, can easily plot and interpret scatterplot. With more features added, more complicated to visualize and interpret o Outliers need to considered. Outliers too far away might become its own clusters => suboptimal solutions o If the added numeric features have different ranges, need to standardize before running clustering algorithms Hierarchical Clustering Typically avoided for large dataset Algorithm: o Start with each observation as its own cluster o Calculate inter-cluster dissimilarity between all clusters o 2 clusters with lowest dissimilarity are fused o Repeat until all observations are in one cluster Dissimilarity measure: o Correlation-based distance – focuses on patterns across variables o Euclidean distance – focuses on numerical closeness of values Linkage methods – 4 ways to calculate dissimilarity: Height of dendrogram = measure of dissimilarity o Final cluster assignment => choose a height of dendrogram to cut => then all the observations will fall into its clusters Inversion – when the clusters join at a height lower than either individual clusters o Can happen with centroid Deterministic o Algorithm only needs to run once – produces the same result every time if run multiple times