Utrecht University Data Science Lecture Notes PDF

Utrecht, 2024 Data Science (INFOMDSS) Lecture D - Predictive Analytics Georg Krempl Utrecht University Summary of the Previous Lecture(s) How &...

Utrecht, 2024 Data Science (INFOMDSS) Lecture D - Predictive Analytics Georg Krempl Utrecht University Summary of the Previous Lecture(s) How & Why make it happen? What will Prescriptive ▶ Data Integration: Data Warehousing happen? Analytics Predictive See [Sharda et al., 2018, chapter 3] Why did Analytics it happen? n atio Diagnostic ti miz Value What happened? Analytics Op ▶ Descriptive Analytics ht sig Descriptive Analytics Fore See [Sharda et al., 2018, chapter 2] ig ht Ins n Info rm atio ht ▶ Predictive Analytics sig Hin Difficulty See [Sharda et al., 2018, chapters 4–5] Figure: Analytic Ascendancy Model ▶ Prescriptive Analytics (based on Gartner’s model See [Sharda et al., 2018, chapter 6] [Laney, 2012]) 1 / 139 Predictive Analytics: Data Mining Figure: Textbook [Sharda et al., 2018, Chapter 3] Figure: Paper [Wu et al., 2008] Sharda, Delen, Turban & King (2018). Wu et al. (2008)). Top 10 Algorithms in Data Business Intelligence, Analytics & Data Mining. Knowledge Information Sys., vol. 14. Science: A Managerial Perspective 4th Global DOI:10.1007/s10115-007-0114-2 Edition, Pearson. ISBN-13: 9781292220567 Advanced Literature (Voluntary Further Reading): Figure: Textbook [Manning et al., 2008] Manning, Raghavan, Schütze (2008) Figure: Textbook [Hand et al., 2001] Introduction to Information Retrieval. Hand, Mannila, Smyth (2001). Principles of Cambridge University Press. ISBN-13: Data Mining. The MIT Press. ISBN 978-0124114616 978-0262082907 http://nlp.stanford.edu/IR-book/ 2 / 139 Outline and Summary1 ▶ Overview on Data Mining Tasks See [Sharda et al., 2018, chapter 4.2] ▶ Segmentation: Cluster Analysis See [Sharda et al., 2018, ch. 4.5], [Wu et al., 2008], [Hand et al., 2001, ch. 9] Cluster Evaluation ▶ Prediction: Classification and Regression See [Sharda et al., 2018, ch. 4.5], [Wu et al., 2008], and [Hand et al., 2001, ch. 10-11] ▶ Association: Frequent Itemset and Association Rule Mining See [Sharda et al., 2018, ch. 4.5], [Wu et al., 2008], [Hand et al., 2001, ch. 13] ▶ (Mining Semi- and “Unstructured” Data) (optional) See [Sharda et al., 2018, chapter 5] Start Appendix 1 Note: This lecture integrates knowledge from several further sources (cited where used, except if my own). 3 / 139 Section Objectives Learning Objectives (1) ▶ Overview and categorisation of data science / data mining / machine learning tasks: un-/supervised, segmentation/prediction/association ▶ Tasks, exemplary applications and techniques ▶ Segmentation / Unsupervised Learning: Clustering ▶ Clustering Task ▶ Shape vs. Density ▶ Algorithms: Hierarchical (divisive, agglomerative), k-Means, EM ▶ Distance / Similarity Measures ▶ Linkage ▶ Evaluation 4 / 139 Outline and Summary1 ▶ Overview on Data Mining Tasks See [Sharda et al., 2018, chapter 4.2] ▶ Segmentation: Cluster Analysis See [Sharda et al., 2018, ch. 4.5], [Wu et al., 2008], [Hand et al., 2001, ch. 9] Cluster Evaluation ▶ Prediction: Classification and Regression See [Sharda et al., 2018, ch. 4.5], [Wu et al., 2008], and [Hand et al., 2001, ch. 10-11] ▶ Association: Frequent Itemset and Association Rule Mining See [Sharda et al., 2018, ch. 4.5], [Wu et al., 2008], [Hand et al., 2001, ch. 13] ▶ (Mining Semi- and “Unstructured” Data) (optional) See [Sharda et al., 2018, chapter 5] Start Appendix 1 Note: This lecture integrates knowledge from several further sources (cited where used, except if my own). 5 / 139 Predictive Analytics: Overview on Data Mining Tasks Segmentation ▶ Unsupervised (Machine Learning) ▶ Clustering, outlier analysis Image not available due to Prediction copyright restrictions. Please refer to the source ▶ Supervised (Machine Learning) cited below. ▶ Classification, Regression, Time-Series-Analysis Association ▶ Unsupervised (Machine Learning) Figure: Data Mining Tasks ▶ Frequent Itemset & Association Rule (Source: [Sharda et al., 2018, page 227]) Mining, Link Analysis, Sequence Analysis 6 / 139 Discussion of DM Approaches: Foreword ▶ In the following subsections, we will discuss each of the three DM problem types in detail ▶ Later, we will extend our view to data of different structure (i.e., approaches for so-called semi-/“un-”structured data) ▶ For each DM problem, we will follow the following structure: 1. Example of a problem instance 2. Definition of the problem 3. Discussion of ideas to solve the problem 4. Overview on selected approaches & main ideas 5. Evaluation methodology for models of this DM problem ▶ This part of the lecture will be more interactive: ▶ Less “author”-driven, more “reader”-driven ▶ Less slides, but more time to take your own notes! ▶ We will (mostly) use the scikit-learn library for Python: https://scikit-learn.org/stable 7 / 139 Outline and Summary1 ▶ Overview on Data Mining Tasks See [Sharda et al., 2018, chapter 4.2] ▶ Segmentation: Cluster Analysis See [Sharda et al., 2018, ch. 4.5], [Wu et al., 2008], [Hand et al., 2001, ch. 9] Cluster Evaluation ▶ Prediction: Classification and Regression See [Sharda et al., 2018, ch. 4.5], [Wu et al., 2008], and [Hand et al., 2001, ch. 10-11] ▶ Association: Frequent Itemset and Association Rule Mining See [Sharda et al., 2018, ch. 4.5], [Wu et al., 2008], [Hand et al., 2001, ch. 13] ▶ (Mining Semi- and “Unstructured” Data) (optional) See [Sharda et al., 2018, chapter 5] Start Appendix 1 Note: This lecture integrates knowledge from several further sources (cited where used, except if my own). 8 / 139 Segmentation: Examplary Problem and Task Definition 9 / 139 Clustering: Exemplary Problem Feature 2 Feature 1 Figure: Clustering Task 10 / 139 Clustering: Further Exemplary Problems X2 X2 X2 X1 X1 X1 (a) (b) (c) X2 X2 X2 X1 X1 X1 (d) (e) (f) Figure: Scatter plots of different data sets. 11 / 139 Segmentation: Overview on Selected Approaches Distinctive Characteristics Flat vs. Hierarchical Is the output ▶ a hierarchy: there exists a hierachy between the instances based on their proximity ▶ a flat partitioning: clusters are simple partitions (no hierarchy between instances) Hard vs. Soft Are assignments between instances and clusters ▶ hard: every instance belongs to one and only one cluster ▶ soft: an instance belongs to all clusters with varying membership probability (also denoted a probabilistic clustering) 12 / 139 Segmentation: Overview on Flat Partitioning Approaches Shape-based Flat Clustering ▶ Starts with an assumption about the shape of the cluster(s) ▶ Examples: ▶ Spheres (k-means), in Python: kmeans2 ▶ Gaussians or other distributions (Expectation-Maximisation for Mixture Models) In Python: GaussianMixture 3 ▶... ▶ Parameters: ▶ Shape (or proximity) & parameters) of components ▶ Number of components Density-based Flat Clustering ▶ Assumes that clusters are separated by low-density regions ▶ Example Algorithms: ▶ DBSCAN [?], in Python: sklearn.cluster.DBSCAN ▶... ▶ Parameters: ▶ Density estimation technique ▶ Density threshold 2 See https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html. 3 See https://scikit-learn.org/stable/modules/mixture.html 13 / 139 Segmentation: Overview on Hierarchical Approaches Agglomerative Hierarchical Clustering ▶ Starts with each instance being its own cluster, iterative merging ▶ Parameters: ▶ Proximity (similarity or distance) measure ▶ Linkage ▶ Different criteria for linking instances & clusters: Single Linkage: Given clusters A, B with instances xa ∈ A, xb ∈ B, consider the distance between the closest pair (xa∗ , xb∗ ) of instances Complete Linkage: Given clusters A, B with instances xa ∈ A, xb ∈ B, consider the distance of the most distant pair (xa∗ , xb∗ ) of instances Average Linkage: Given clusters A, B with instances xa ∈ A, xb ∈ B, consider the average distance of all instance pairs (xa , xb ) ▶ R: agnes() in the package cluster Divisive Hierarchical Clustering ▶ Starts with a single cluster containing all points. Each iteration, split one cluster ▶ Python: scipy.cluster.hierarchy 14 / 139 Segmentation: Types of Linkage Figure: Overview on Types of Linkage (Table from James et al., [James et al., 2013]) 15 / 139 Segmentation: Measuring Distance & Similarity Measuring Proximity Metric Scales Q-Correlation Coefficient Similarity Measures Mahalanobis Distance Minkowski Distance Distance Measures L1 -Norm, City-Block-Metrik L2 -Norm, Euklidean Distance 16 / 139 Minkowski Distance4 " M # 1r X dij = |xim − xjm |r (1) m=1 dij Distance between the instances i and j xim Value of attribute m for instance i M Number of attributes r Minkowski Constant r = 1 L1 -Norm, City-Block Metric City-Block Metric. Simple addition of the absolute distances (on each dimension) between the instances r = 2 L2 -Norm, Euklidean Distance Eukl. Dist. weights larger distances overproportionally 4 Equation from [Backhaus et al., 2006, page 503]. 17 / 139 Minkowski Distance: Example Minkowski Distance Example: r r r r L2 -Norm Hr H H L1 -Norm r r r r r 18 / 139 Minkowski Distances: Scaling Issues ▶ Note that the Minkowski-Distances (Euclidean, City-Block/Manhattan,... ) i X1 X2 are scaling-dependent! A 0 0 X2 ▶ Example: Three data points A,B,C B 0 1 10 C 10 0 ▶ Question: Is B or C closer to A? 8 6 ▶ Depends on the scaling of X1 relative to X2 ! 4 ▶ If we rescale only X1 to a range [0; 1], 2 B then A and C are (seemingly) close A C 0 ▶ If we rescale only X2 to a range [0; 1], -1 -1 0 2 4 6 8 10 X1 then A and B are (seemingly) close Figure: Scale Dependency Illustrated ▶ Either adjust scale, or use a scale-invariant Mahalanobis distance are (seemingly) close 19 / 139 Standardised Euclidean and Mahalanobis Distances Standardised Euclidean Distance M ! 12 X (xim − xjm )2 dij = 2 (2) m=1 sm dij Distance between the instances i and j xim Value of attribute m for instance i M Number of attributes sm Standard deviation of m-th attribute For zero covariance between attributes, this equals the Mahalanobis Distance. 20 / 139 Standardised Euclidean and Mahalanobis Distances Mahalanobis Distance 1 2 xi − x⃗j )T Σ−1 (⃗ dij = (⃗ xi − x⃗j ) (3) dij Distance between the instances i and j x⃗i Feature vector of instance i M Number of attributes Σ Covariance Matrix of the data For zero covariance between attributes, this equals the Standardised Euclidean Distance. 21 / 139 Q-Correlation Coefficient5 PM m=1 (xim − x¯i )(xjm − x¯j ) aij = h i1 (4) M − x¯i )2 · M 2 2 P P m=1 (xim m=1 (xjm − x¯j ) aij Similarity between instances i and j xim Value of attribute m of instance i M Number of attributes x¯i Arithmetic average of all of instance i’s attributes ▶ Denominator: compare with covariance (like Pearson’s correlation coefficient’s denominator) ▶ Nominator: compare to Pearson’s correlation coefficient’s nominator ▶ Similarity Measure: Compares the similarity of the profiles, not their location! ▶ In particular useful, if one is interested in comparing the relative attribute values between instances, and not in their absolute values. 5 Equation from [Backhaus et al., 2006, page 505]. 22 / 139 Q-Correlation Coefficient math A: Alpha B: Small Alpha 9 C: Inv. Alpha D: Averaged 8 7 english 6 sports 5 biology music Euclidean Distance Q-Correlation Coefficient A→A A→B A→C A→D A→A A→B A→C A→D 0.00 2.24 7.28 3.80 1 1 -1 1 23 / 139 Jaccard Similarity How similar are these texts? ▶ "this is not bad" ▶ "this is really bad" Jaccard Similarity ▶ Bag-of-words representation ▶ |A ∩ B| J(A, B) = (5) |A ∪ B| ▶ Common: 3 (this, is, bad) ▶ Union: 5 (this, is, bad, not, really) 24 / 139 Segmentation: Selected Clustering Approaches 25 / 139 Clustering: k-Means Algorithm Feature 2 Feature 1 Figure: k-Means Clustering 26 / 139 Clustering: k-Means Algorithm k-Means Algorithm 1. Parameters: n, distance measure, stopping criterion 2. Initalise cluster randomly Feature 2 3. Compute distances between instances & centers 4. Assign each instance to its nearest center Feature 1 5. Compute new cluster centre as mean of its instances Alternative: median Figure: k-Means in progress... 6. Repeat with step 3 until stopping criterion is met 27 / 139 Clustering: Expectation-Maximisation Algorithm EM Algorithm ▶ Assume data to be generated by a Mixture Model e.g., Gaussian Mixture Model (GMM) ▶ Given: Instances with features D = {x(1), x(2), · · · , x(n)} ▶ Probabilistic Clustering: Instances are members of all clusters, but with varying probability ▶ Views cluster-assignment as hidden variable Z with H = {z(1), z(2), · · · z(n)} being the instances’ hidden values ▶ Aim is to determine the Likelihood-maximising Model Likelihood of observed data: X l(θ) = log p(D|θ) = log p(D, H|θ) H of probabilistic model p(D, H|θ) with unknown parameters θ Q(H) probability distribution of missing data H ▶ Iterate between E(xpectation) and M(aximisation) Steps E-Step: Q k+1 = arg maxQ F (Q k , θk ) M-Step: θk+1 = arg maxθ F (Q k+1 , θk ) ▶ Also useable for missing value imputation 28 / 139 Clustering: Divisive Hierarchical Clustering Clustering Approaches Hierarchical Divisive Agglomerative Single-Linkage: Nearest-Neighbour Complete-Linkage: Furthest-Neighbour Average-Linkage Ward: Variance Minimising 29 / 139 Clustering: Divisive Hierarchical Clustering Initialisation: 5x 3x 4x 1x 2x 0x x 7x 8x 9x 6 30 / 139 Clustering: Divisive Hierarchical Clustering 1st Iteration: 2 Cluster 5x 3x 4x 1x 2x 0x x 7x 8x 9x 6 31 / 139 Clustering: Divisive Hierarchical Clustering 2nd Iteration: 3 Cluster 5x 3x 4x 1x 2x 0x x 7x 8x 9x 6 32 / 139 Clustering: Divisive Hierarchical Clustering 3rd Iteration: 4 Cluster 5x 3x 4x 1x 2x 0x x 7x 8x 9x 6 33 / 139 Clustering: Agglomerative Hierarchical Clustering Clustering Approaches Hierarchical Divisive Agglomerative Single-Linkage: Nearest-Neighbour Complete-Linkage: Furthest-Neighbour Average-Linkage Ward: Variance Minimising 34 / 139 Clustering: Agglomerative Hierarchical Clustering Initialisation: N Cluster 5x 3x 4x 1x 2x 0x x 7x 8x 9x 6 35 / 139 Clustering: Agglomerative Hierarchical Clustering 1st Iteration: N − 1 Cluster 5x 3x 4x 1x 2x 0x x 7x 8x 9x 6 36 / 139 Clustering: Agglomerative Hierarchical Clustering 2nd Iteration: N − 2 Cluster 5x 3x 4x 1x 2x 0x x 7x 8x 9x 6 37 / 139 Clustering: Agglomerative Hierarchical Clustering 3rd Iteration: N − 3 Cluster 5x 3x 4x 1x 2x 0x x 7x 8x 9x 6 38 / 139 Clustering: Agglomerative Hierarchical Clustering 4th Iteration: N − 4 Cluster 5x 3x 4x 1x 2x 0x x 7x 8x 9x 6 39 / 139 Segmentation: Evaluation Methodology6 Clustering Objective Maximise ▶ Intra-Cluster-Homogenity, i.e. similarity of instances in the same cluster ▶ Inter-Cluster-Heterogenity, i.e. dissimilarity of instances from different clusters Internal Measures ▶ Based on the clustered data only, not on external class information ▶ Various measures exist, e.g., silhouette_score in Python’s scikit-learn library ▶ Example: Silhouette index External Measures ▶ Based on an external class label ▶ Again, various measures exist like adjusted_rand_score in Python’s scikit-learn library ▶ Example: Rand Index, Purity, NMI 6 See [Manning et al., 2008, chapter 16.3,pp.156], freely available at https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf 40 / 139 Segmentation: Clustering Evaluation Measures Silhouette Index For each instance xi , ▶ for each cluster k, calculate the arithmetic average distance of xi to all instances in that cluster k: 1 X d(xi , k) = dist(xi , x) nk − 1 x∈z ,x̸=x k i where dist() is a distance function, nk is the number of instances in cluster k ▶ determine its average distance within its own cluster l: a(xi ) = d(xi , l) ▶ from the average distances to other clusters k ̸= l, determine the minimum: b(xi ) = min d(xi , k) k̸=l ▶ calculate the silhouette width of that instance xi as b(i) − a(i) s(xi ) = max(a(i), b(i)) 41 / 139 Segmentation: Clustering Evaluation Measures (2) Silhouette Index Having computed the silhouette width of each instance, we compute now the aggregates: ▶ For each cluster k, calculate the mean silhouette width sk 1 X sk = s(xi ) nk x ∈z i k ▶ Calculate Silhouette Index as the mean of the clusters’ mean silhouettes: |Z | 1 X SI = sk |Z | k=1 42 / 139 Segmentation: Clustering Evaluation Measures Purity7 |Z | 1X purity (Y , Z ) = max zi ∩ yj n i=1 j ▶ Z = {z1 , z2 ,... , z|Z | } set of clusters ▶ Y = {y1 , y2 ,... , y|Y | } set of classes ▶ n is the number of instances (in all clusters/classes) ▶ Assigns each cluster to the most frequent class in it ▶ Purity is value in [0; 1], with 1 being a perfect clustering ▶ Depends on the number of clusters (1 if one cluster per instance) 7 See [Manning et al., 2008, Eq. 182]. 43 / 139 Segmentation: Clustering Evaluation Measures Normalised Mutual Information NMI8 I (Y , Z ) NMI (Y , Z ) = H(Z )+H(Y ) 2 ▶ Z = {z1 , z2 ,... , z|Z | } set of clusters ▶ Y = {y1 , y2 ,... , y|Y | } set of classes P|Y | |zi ∩yj | N·|zi ∩yj | ▶ Mutual information I (Y , Z ) = |Z | P i=1 j=1 n log |zi |·|yj | ▶ Entropy H(Z ) = − |Z | |zi | log |zi | P i=1 n n ▶ n is the number of instances (in all clusters/classes) ▶ Assigns each cluster to the most frequent class in it ▶ NMI is value in [0; 1], with 1 being a perfect clustering ▶ Normalised for the number of clusters (i.e., independent of |Z |) 8 See [Manning et al., 2008, Eq. 183–187]. 44 / 139 Segmentation: Clustering Evaluation Measures Rand Index (or Accuracy)9 TP + TN RI (Y , Z ) = TP + FP + FN + TN ▶ Start with matching each cluster with the majority class in it E.g., cluster 1 with x, 2 with o, 3 with 3 ▶ True positives, the number of pairs in the same cluster and class: ▶ True negatives TN: number of pairs in different clusters and classes I.e., all ▶ False positives FP: number of pairs in the same cluster but different classes ▶ False negatives FN: number of pairs in different clusters but same class ▶ Measures the percentage of correct decisions ▶ RI is value in [0; 1], with 1 being a perfect clustering 9 See [Manning et al., 2008, Eq. 188–189]. 45 / 139 Segmentation: Clustering Evaluation Example10 ▶ TP = ij P |zi ∩ yj | = 2 5 1 + + · · · = 20 2 2 ▶ TP + FP = i P Σcolsi = 2 Figure: Clustering Example from 6 6 5 + + = 40 [Manning et al., 2008, Fig. 16.1] 2 2 2 ▶ FP = TP + FP − TP = 40 − 20 = 20 P y1 : x y2 : o y3 : 3 cols ▶ TP + FN = j Σrowsj 5 1 0 6 P z1 = 2 z2 1 4 1 6 zP 2 0 3 5 8 5 4 3 + + = 44 rows 8 5 4 17 2 2 2 10 See [Manning et al., 2008, chapter 16.3,pp.156], freely available at https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf 46 / 139 Segmentation: Clustering Evaluation Example ▶ from before: TP = 20, FP = 20, TP + FN = 44 ▶ FN = TP + FN − TP = 44 − 20 = 24 ▶ TN = n − TP − FP − FN = 2 Figure: Clustering Example from 17 [Manning et al., 2008, Fig. 16.1] −20−20−24 = 136−64 = 72 2 P y1 : x y2 : o y3 : 3 cols ▶ Purity = 0.71 z1 5 1 0 6 ▶ NMI = 0.36 z2 1 4 1 6 ▶ RI = 20+72 = 0.676 zP3 2 0 3 5 136 rows 8 5 4 17 47 / 139 Segmentation: Choosing the Number of Clusters Elbow Method ▶ Iterate over different values of k, for each k, perform several clusterings with different random initialisations for each of them, compute a performance measure, ▶ E.g., compute the residual sum of squares P RSSk = zi ∈Z x∈zi (x − z¯i )2 P (where z¯i is the i-th cluster’s center) ▶ Plot the averaged performance values against the number of clusters, inspect the curve for “knees” (points where it flattens) ▶ Curve flattens at k = 4 and k = 9, i.e. Figure: Estimated Residual Sum of Squares as these are candidates for k ∗ Function of k in k-means [Manning et al., 2008, Fig. 16.8] ▶ Note: Compare to selecting number of components in a Principle Component Analysis (PCA) 48 / 139 Segmentation: Cluster Analysis Questions? 49 / 139 Outline and Summary1 ▶ Overview on Data Mining Tasks See [Sharda et al., 2018, chapter 4.2] ▶ Segmentation: Cluster Analysis See [Sharda et al., 2018, ch. 4.5], [Wu et al., 2008], [Hand et al., 2001, ch. 9] Cluster Evaluation ▶ Prediction: Classification and Regression See [Sharda et al., 2018, ch. 4.5], [Wu et al., 2008], and [Hand et al., 2001, ch. 10-11] ▶ Association: Frequent Itemset and Association Rule Mining See [Sharda et al., 2018, ch. 4.5], [Wu et al., 2008], [Hand et al., 2001, ch. 13] ▶ (Mining Semi- and “Unstructured” Data) (optional) See [Sharda et al., 2018, chapter 5] Start Appendix 1 Note: This lecture integrates knowledge from several further sources (cited where used, except if my own). 50 / 139 Classification: Credit Scoring Example Let the business objective be minimising risk of loss due to unpaid orders by customers, i.e., by identifying risky orders and acting accordingly (accepting or declining) ▶ Each new order constitutes an instance ▶ For each instance, certain characteristics are known before the decision has to be taken to accept or decline the order. These characteristics correspond order value product... default? to features (or explanatory variables) X 2.2 A … no Instances 1.7 A … yes ▶ Instances are classified by their outcome, e.g., 1.3 B … yes default vs. non-default. This outcome corresponds 3.1 A … no............ to the class label or dependent variable Y ▶ Aim: A classifier (predictor) function f : X → Y features / explanatory variables class label / dependent var. that optimises a desired performance measure, e.g., the 0-1-loss 0 f (x) = y L(y , f (x)) = 1 otherwise over a set of instances. 51 / 139 Classification in a Nutshell ▶ Historical Data e.g. previous customer’s records Credit Scoring: Will a new order ▶ Generate Training Sample L with by paid? feature variables (e.g. order value X ) and class label (e.g. default Y ) ▶ Aim: Classifier function f : X → Y ▶ How do we get there? 52 / 139 Classification in a Nutshell Credit Scoring: Will a new order by paid? ▶ Historical Data e.g. previous customer’s records X1 X2 default? ▶ Generate Training Sample L with feature variables (e.g. order value X ) 2.2 52 no and class label (e.g. default Y ) 1.7 28 yes 1.3 33 yes 3.1 42 no ▶ Aim: Classifier function f : X → Y......... ▶ Discussion: How do we get there? Nearest Neighbour Classifier Bayes’ Classifier (Histogram-Based) - Bayes’ Classifier { KErnel-Based) X2 - + Naive Bayes (multivariate data) - + + Multivariate Bayes X1 Decision Trees ▶ many more... 53 / 139 Classification Approaches: k-Nearest Neighbour Training: feature X1 ▶ Memorize all training instances... label Y 2.2 … no 1.7 … yes 1.3 … yes Prediction: 3.1 … no......... ▶ Upon arrival of new instance (x, ?), calculate its distance to all training instances ▶ Sort the neighbours by their distance ▶ Assign the label shared by majority of k nearest neighbours Nearest Neighbour is positive classify as + dist(x,x3) Characteristics: + + ? + - - - X1 ▶ This is a lazy learner (x,?) ▶ No model is build, no abstraction ▶ Fast training, slow prediction Classifier Overview 54 / 139 Classification Approaches: Univariate Bayes (Using Histogram/Discretisation) feature X1... default? ▶ Calculate posterior Pr (Y |X ) using the Bayes 2.2 … no Theorem: 1.7 … yes 1.3 … yes 3.1 … no Pr (X |Y ) · Pr (Y )......... Pr (Y |X ) = Pr (X ) How to derive the quantities therein? + + + p(y=+) = 3/6 = 0.5 + + + - - - ▶ Class prior probability: p(y=-) =1-p(y=+) = 0.5 - - - + + + - - - |L+ | Pr (Y = +) = |L+ ∪ L− | + + + - - - X1 Classifier Overview 55 / 139 Probability Calculus Recap ▶ Explanatory variable X , e.g., test score, with possible outcomes in domain X = [0, 10] ▶ Dependent variable (class label) Y , e.g., passing the final exam, with possible outcomes Y = {+, −} ▶ Scaling: Summing/Integrating the probabilities of all possible (mutually exclusive) outcomes gives one X Pr (y ) = Pr (+) + Pr (−) = 1 (6) y ∈Y Z Pr (x) = 1 (7) x∈X All outcomes have non-negative probabilities: Pr (x) ≥ 0 ∀x ∈ X (8) ▶ Additivity: Probability of observing any of two mutually exclusive events (e.g., event y1 = + and y2 = −) is the sum of their individual probabilities: Pr (y1 ∪ y2 ) = Pr (y1 ) + Pr (y2 ) (9) 56 / 139 Probability Calculus Recap ▶ Pr (X , Y ) = Pr (X |Y ) · Pr (Y ) (10) = Pr (Y |X ) · Pr (X ) (11) ▶ X Pr (x) = Pr (x, y ) (12) y ∈Y Z Pr (y ) = Pr (x, y ) (13) x∈X 57 / 139 Classification Approaches: Univariate Bayes (Using Histogram/Discretisation) feature X1... default? 2.2 … no 1.7 … yes 1.3 … yes Conditional Feature Probability 3.1 … no......... Calculate Pr (X = x|Y = +) by discretisation / histograms: ▶ Discretise feature X into disjoint intervals ▶ In each interval, count class’ occurences ▶ Use frequencies as estimate for probability 1.Interval 2.Interval 3.Interval Pr (X = x|Y = +) ▶ Repeat for the other class(es) 2 pos 2 neg 1 pos 1 neg + + + - - - X1 Classifier Overview 58 / 139 Classification Approaches: Univariate Bayes (Using Histogram/Discretisation) feature X1... default? 2.2 … no 1.7 … yes (Unconditional) Feature Probability 1.3 … yes 3.1 … no......... Calculate Pr (X = x) by marginalisation over y : X Pr (X = x) = Pr (X , Y = y ) (14) y probability 1.Interval 2.Interval 3.Interval p(x=1)= p(x=2)= p(x=3)= X p(+,x=1) 3/6 2/6 +p(-,x=1 = Pr (X |Y = y ) · Pr (Y = y ) (15) )=1/6 y sum: 3 sum: 2 sum: 1+ + + - - - X1 Classifier Overview 59 / 139 Classification Approaches: Univariate Bayes (Using Histogram/Discretisation) feature X1... default? Posterior Probability 2.2 … no 1.7 … yes 1.3 … yes Calculate posterior Pr (Y |X ) 3.1 … no......... using the Bayes Theorem Pr (X |Y ) · Pr (Y ) Pr (Y |X ) = Classifier Function Pr (X ) + - probability 1.Interval 2.Interval 3.Interval p(+|x=1) p(+|x=2) p(+|x=3) = 1/1=1 = 2/3 = 0/2 = 0 classified Prediction: + ▶ Use posterior for classification: ? (x,?) X1 ▶ Assign most probable label Classifier Overview 60 / 139 Classification Approaches: Univariate Bayes (Using Histogram/Discretisation) Posterior Probability feature X1 Calculate posterior Pr (Y |X )... default? using the Bayes Theorem 2.2 … no 1.7 … yes Pr (X |Y ) · Pr (Y ) 1.3 … yes Pr (Y |X ) = 3.1 … no Pr (X )......... Intervals Prediction: 1 2 3 4 5 6 ▶ Use posterior for classification: probability ▶ Assign most probable label ? 1+ 1+ 1+ 1- 1- 1- + + + - - - X1 Prediction: ▶ Challenge: Interval size ▶ Too big: loss of detail ▶ Too small: lack of data Classifier Overview 61 / 139 Classification Approaches: Univariate Bayes (Using Kernel Density Estimation) ▶ Calculate posterior Pr (Y |X ) using the Bayes Theorem: feature X1... default? Pr (X |Y ) · Pr (Y ) Pr (Y |X ) = Pr (X ) 2.2 … no 1.7 … yes ▶ Derive conditional feature probability 1.3 … yes 3.1 … no......... Pr (X = x|Y = +) by Kernel Density Estimation ▶ On the location of each instance xi , place a (Gaussian) Kernel k(xi , x) (x − xi )2 ! 1 d(x|+) = 1/3 Σ k(x,xi) K (x, xi ) = √ exp − density 2πσ 2 2σ 2 cmp. [Hand et al., 2001, 10.2.2 Probabilistic Models for Classification] + + + - - - X1 ▶ Calculate Pr (X = x|Y = +) by summing over the kernels: P xi ∈L+ K (x, xi ) Pr (X = x|Y = +) = |L+ | Classifier Overview 62 / 139 Classification Approaches: Univariate Bayes (Using Kernel Density Estimation) feature X1... default? 2.2 … no ▶ Repeat Kernel Density Estimation on other 1.7 … yes 1.3 … yes classes 3.1 … no......... ▶ Unconditional density of features: Classifier Function d(X ) = d(X |Y = +) + d(X |Y = −) + - d(x) density d(x,+) d(x,-) ▶ Derive Decision Boundary at intersections of joint distributions + + + - - - X1 decision boundary Classifier Overview 63 / 139 Classification Approaches: Univariate Bayes (Using Kernel Density Estimation) feature X1... default? ▶ Repeat Kernel Density Estimation on other classes 2.2 … no 1.7 … yes 1.3 … yes ▶ Unconditional density of features: 3.1 … no......... d(X ) = d(X |Y = +) + d(X |Y = −) Classifier Function + - ▶ Derive Decision Boundary d(x) density d(x,+) d(x,-) at intersections of posterior distributions ▶ Derive Posterior using the Bayes Theorem as above + + + - - - X1 decision boundary ▶ Predict new data, i.e., apply f to instances from a test sample U Classifier Overview 64 / 139 Classification Approaches: Naive Bayes (on Multivariate Data) ▶ Posterior Pr (Y |X1 , X2 ,... ) of multivariate data: Pr (X2 |Y , X1 ) · Pr (X1 |Y ) · Pr (Y ) Pr (Y |X1 , X2 ) = X1 X2 default? Pr (X1 , X2 ) 2.2 52 no ▶ Calculating multivariate joint probabilities 1.7 28 yes 1.3 33 yes poses a challenge: Curse of dimensionality 3.1 42 no......... ▶ Conditional Independence Assumption: Xi and Xj assumed to be conditionally independent given class label Pr (Xi |Xj , Y ) = Pr (Xi |Y ) X2 - - + - ▶ Simplifies Posterior into: + + Pr (X2 |Y ) · Pr (X1 |Y ) · Pr (Y ) X1 ∗ Pr (Y |X1 , X2 ) = Pr (X1 , X2 ) 2 Y ∝ Pr (Y ) · Pr (Xi |Y ) i=1 Classifier Overview 65 / 139 Recap Chapter 2: Conditional Independence11 Two variables Xi and Xj are conditionally independent given a third variable Y , if for all values of Xi , Xj , Y holds Pr (Xi , Xj |Y ) = Pr (Xi |Y ) · Pr (Xj |Y ) Pr (Xi |Xj , Y ) = Pr (Xi |Y ) Negatives Negatives Negatives X2 X2 X2 Positives Positives Positives P(X2|-) P(X2|+) P(X2|+) P(X2|+) P(X2|-) P(X2|-) X1 P(X1|-) X1 P(X1|-) X1 P(X1|-) P(X1|+) P(X1|+) P(X1|+) (a) (b) (c) Figure: In scatterplots (a,c), X1 , X2 are cond. indep. given Y. In (b), they are not. 11 See [Hand et al., 2001, Section 4.3.1]. 66 / 139 Classification Approaches: Multivariate Bayes X1 X2 default? ▶ Posterior Pr (Y |X1 , X2 ,... ) of multivariate data: 2.2 52 no 1.7 28 yes Pr (X2 |Y , X1 ) · Pr (X1 |Y ) · Pr (Y ) 1.3 33 yes Pr (Y |X1 , X2 ) = 3.1 42 no Pr (X1 , X2 )......... Pr (X1 , X2 |Y ) · Pr (Y ) = Pr (X1 , X2 ) - X2 ▶ Alternative to Naive Bayes and the - + - assumption of conditional independence? + + ▶ Use a parametric model for density estimation X1 e.g., assume cluster structure (e.g., Gaussian Mixture) Classifier Overview 67 / 139 Classification Approaches: Decision Trees X1 X2 default? Training: 2.2 52 no ▶ Iteratively build decision tree, 1.7 28 yes 1.3 33 yes 3.1 42 no ▶ by splitting on different features......... ▶ Greedy choice of split attribute X1>2 e.g., using Information Gain + X2>30 + - Prediction: - - - X2 - ▶ Upon arrival of new instance (x, ?), - + + + - assign it to a leaf and classify based on + + + majority class there X1 Classifier Overview 68 / 139 Prediction: Evaluation Methodology Classification Objective ▶ Most accurate prediction on new data ▶ Further aspects to consider in classifier evaluation: ▶ Interpretability of the model and its predictions ▶ Runtime (and memory consumption) for training, parameter tuning, and testing ▶ Parameter tuning and sensitivity of parameters ▶ Insights into model’s reliability/confidence and the use of instances/features Design of Experimental Evaluations ▶ Split into training and hold-out (validation and test) sets ▶ Example: (k-Fold) Cross-Validation Performance Measures ▶ Confusion Matrix ▶ Accuracy, Error Rate ▶ Area under the Receiver Operating Characteristic Curve (AUC ROC Curve) 69 / 139 Classifier Evaluation: Experimental Design X1... default? Challenge: 2.2 no ▶ Which parameter (or classifier) is the best? 1.7 … yes 1.3 yes ▶ How to get a realistic performance estimation? 3.1 no...... Overfitting: Band- k(x,xi) ▶ Occurs when classifier models even width σ noise/outliers ▶ Does not generalise well to new data density Gold Standard: ▶ Use separate sets for training, validation and X1 + ++ +- - + - -- - testing 70 / 139 Classifier Evaluation: Experimental Design X1 Y Test Using Hold-Out Sets: Whole Data Set Set 2.2 no ▶ Split data into training, validation and test sets 1.7 yes X1 Y 1.3 yes ▶ Training set: 3.1 no 1.3 yes 2.9 no............ Used for training classfier ▶ Validation set: Training Set Used for evaluating classfier with given density + + + +- -+- -- - X1 parametrisation + - Validation Set i.e., comparing different parameter values k(x,xi) density Bandwidth σ + + + - - - ▶ Test set: X1 Used for final evaluation of the chosen Test Set classifier and parameter combination density Should provide a realistic estimate of the + + - - X1 out-of-sample performance on new, unseen data 71 / 139 Classifier Evaluation: Experimental Design Image not available due to copyright restrictions. Please refer to the source cited below. Figure: k-Fold Cross Validation(Source: [Sharda et al., 2018, page ?]) K-Fold Cross Validation ▶ Split the whole data set into subsets ▶ Iterate over each subset, thereby use the selected subset for testing, and all 72 / 139 Classifier Performance Measures 1.Decision Boundary 2.Decision Boundary + - + Density True True True Positives Positives Negatives False False Feature X Positives Negatives Figure: Classification and Error Rates Classification Errors ▶ True Positives (TP): correctly classified pos. ▶ True Negatives (TN): correctly classified neg. ▶ False Positive (FP): Actual negative instance classified as positive ▶ False Negative (FN): Actual positive instance classified as negative ▶ Bayes Error Rate: Unavoidable error of a Bayes-optimal classifier ▶ Question: Possible to achieve zero FPs (or FNs)? 73 / 139 Classifier Performance Measures Actual Class 1.Decision Boundary 2.Decision Boundary + - + Density Positive Negative Negative Positive Predicted Class True True False True True Positives Positive Positive Positives Negatives False True Negative Negative False False Feature X Positives Negatives Figure: Classification and Error Rates Figure: Confusion Matrix Confusion Matrix ▶ Quadratic table where ▶ columns = true classes ▶ rows = predicted classes ▶ Diagonal elements: Correct predictions ▶ Off-diagonal elements: Misclassifications ▶ Might be weighted by multiplying with a cost matrix of same size 74 / 139 Classifier Performance Measures Accuracy Share of correct classifications TP + TN Acc = FP + FN + TP + TN Error Rate (= 1 − Acc) FP + FN Err = FP + FN + TP + TN Sensitivity or True Positive Rate (TPR) TP Sens = TPR = TP + FN Specificity or 1-(False Pos. Rate (FPR)) FP TN Spec = 1 − FPR = 1 − = TN + FP TN + FP 75 / 139 Classifier Performance Measures Misclassification Loss ▶ Accuracy and error rate weight all errors similarly ▶ For unequal misclassification costs: MLoss = FP · CostFP + FN · CostFN ▶ Hereby, the FP-cost and the FN-cost are often normalized to sum up to one: CostFP + CostFN = 1, such that CostFP corresponds to the cost-ratio. 76 / 139 Classifier Performance Measures AUC under the ROC Curve The Receiver Operating Characteristic Curves True Positive Rate (Sensitivity) 1 ▶ Plots the false positive rate against the true 0.5 0.6 0.7 0.8 0.9 positive rate ▶ the steeper the increase, the better ▶ Suited for problems with class imbalance Classifier A 0.4 ▶ Area under this Curve (AUC): Classifier B 0.3 Summarises the curve by integrating it Classifier C 0.2 Classifier D ▶ Examplary characteristics from Fig. 21 0.1 ▶ Classifier D is a perfect classifier (AUC = 1), 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate (1-Specificity) it dominates all other classifiers Figure: Plot of Examplary Area Under ▶ Classifier C: similar to random guessing the Receiver Operating (AUC = 12 ) Characteristics (ROC) Curves (AUC) ▶ Neither classifier A is dominating B, nor classifier B is dominating A. Each is better for a different combination of FPR vs. TPR. 77 / 139 Prediction: Classification (and Regression) Questions? 78 / 139 Classification: Interpretability and Explainability LIME (Local Interpretable Model-agnostic Explanations) Recommended Papers: ▶ Ribeiro, Singh, und Guestrin. Why should I trust you?: Explaining the predictions of any classifier. Proceedings der 22. SIGKDD, 1135–1144, ACM, 2016. ▶ Došilović, Brčić, und Hlupić. Explainable artificial intelligence: A survey. Proceedings der 4. MIPRO, 210–0215, IEEE, 2018. 79 / 139 Motivating Example ▶ Aim: Classify Wolf vs. Dog ▶ Approach: Deep Neural Network (DNN) ▶ Result: Classification seems good (small error rate) ▶ Question: Should we use this classifier? Should we trust it? (a) wrongly classified Husky (b) Explanation: snow in background Figure: Source: [Ribeiro et al., 2016b, Fig.11] 80 / 139 Motivating Example (2) ▶ Approach: Use softmax-scores as confidence measure? ▶ No, not suited to identify misclassifications Figure: Example for misclassifications despite high confidence (≥ 99.6%) (Source: DNN on ImageNet, [Nguyen et al., 2015, Fig.1]) 81 / 139 Motivating Examples: Explain Image Classifications ▶ Given: Trained Black-Box-Classifier ▶ Aim: Explain classification with examples P(Frog | x) P(Billard | x) P(Balloon | x) = 0.54 = 0.07 = 0.05 Test Image Why Why Why Frog? Billard? Balloon? Figure: Image classification example (Source: Own work based on [Ribeiro et al., 2016a]) 82 / 139 headache eadache o fatigue no fatigue ge Motivating Examples: Explain Text Classifications and Prediction Explanation Human makes decision ▶ Text Classifier for“Atheism” vs. “Christianity” sneeze ▶ Question: Which features (words) were relevant?12 headache active Explainer (LIME) Explanation 12 Source: [Ribeiro et al., 2016b] 83 / 139 Post-Hoc Explainability ▶ Given: Trained Black-Box-Classification Model ▶ Aim: Based on examples, 1. explain a single classification13 2. explain the model 13 Image sources: [Ribeiro et al., 2016b] 84 / 139 Post-Hoc Explainability ▶ Given: Trained Black-Box-Classification Model ▶ Aim: Based on examples, 1. explain a single classification13 2. explain the model 13 Image sources: [Ribeiro et al., 2016b] 85 / 139 Local Interpretable Model-Agnostic Explanations: LIME in a Nutshell Original Image Explanation: P(Tree Frog|x) Superpixels with = 0.54 highest pos. weights 1 2 Perturbations z’ with black-box model’s 3 4 Superpixels = Locally weighted Interpretable posterior estimates interpretable model Components x’ P(Tree Frog|z’) (linear regression) Figure: Illustration of LIME (Source: Own work based on [Ribeiro et al., 2016a]) 86 / 139 LIME (1): Creating interpretable Components14 Step 1: Create Interpretable Components ▶ Features x themselves might be too complex ⇒ difficult to explain Original Image Explanation: ▶ Idea: Create interpret. Features x ′ , so-called. P(Tree Frog|x) Superpixels with = 0.54 highest pos. weights interpretable representations 1 Superpixels = Interpretable 2 Perturbations z’ with black-box model’s posterior estimates 3 Locally weighted interpretable model 4 ▶ Implementation on structured data: Components x’ P(Tree Frog|z’) (linear regression) use features (columns) ▶ Implementation on images: Segmentation Segment = Superpixel ▶ Implementation on text data: ? Single words Bag-of-Words Question: Why not word embeddings? difficult to explain! 14 Own work based on [Ribeiro et al., 2016a]. 87 / 139 LIME (2): Create variations of data15 Step 2: Create data for local approximation ▶ Aim: Local Exploration of the Black-Box-Classification Original Image around given Instance x ′ Explanation: P(Tree Frog|x) Superpixels with ▶ Idea: Create variation z ′ ∈ Z by = 0.54 highest pos. weights 1 Superpixels = 2 Perturbations z’ with black-box model’s 3 Locally weighted 4 ▶ Varying the given instance Interpretable posterior estimates interpretable model Components x’ P(Tree Frog|z’) (linear regression) ▶ (De-)activating single interpret. components (feature values) ▶ Implementation on ▶ structured data: missing values ▶ images: gray out segments (superpixel) ▶ texts: cut out words ▶ classify z ′ ∈ Z using BB-model 15 Own work baased on [Ribeiro et al., 2016a]. 88 / 139 LIME (3): Training an explanation model16 Step 3: learn a local, explaining model ▶ Local Imitation of the BB-Model-Classification of f (x) ▶ Aim: Find the best model ξ in family of explaining Original Image Explanation: P(Tree Frog|x) = 0.54 Superpixels with highest pos. weights models G 1 2 Perturbations z’ with 3 4 ▶ Approach: Superpixels = black-box model’s Locally weighted Interpretable Components x’ posterior estimates P(Tree Frog|z’) interpretable model (linear regression) ▶ Weight varied instances z ′ ∈ Z using proximity measure πx ▶ Train linear regression modell g (z ′ ) ∈ G on these

Utrecht University Data Science Lecture Notes PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue