ALL_SLIDES.pdf PDF

Part 1: Motivation Prof. Dr. Stefan Feuerriegel Institute of AI in Management LMU Munich https://www.ai.bwl.lmu.de Placeholder for organisational unit name / logo First...

Part 1: Motivation Prof. Dr. Stefan Feuerriegel Institute of AI in Management LMU Munich https://www.ai.bwl.lmu.de Placeholder for organisational unit name / logo First name Surname (edit via “Insert” > “Header & Footer”) | Sep 5, 2018 | 1 (edit in slide master via “View” > “Slide Master”) ABOUT OUR INSTITUTE Solving management problems with artificial intelligence (AI) PhD Assistant professor 1 2 3 1 Information 2 Innovation 3 Impact What defines We solve We develop new We evaluate the our research management algorithms from the added value of our problems of area of AI tools rigorously in relevance by using (statistics, computer management data science science, etc.) practice 2 ABOUT ME Example of our research with impact AI for health management 1 2 3 11 Effective police patrolling 2 Effective disease management 3 Early warnings for fake news in social media 3 OUR IMPACT Our impact during the current COVID-19 epidemic Data Modeling Impact ▪ Nationwide data on ▪ New artificial ▪ A tool for public decision-makers micro-level human intelligence real-time monitoring of mobility during the algorithm to link compliance with social distancing epidemic mobility and case ▪ Knowledge transfer through ▪ ~1.5 bn movements growth membership in the COVID-19 of individuals Working Group of the World Health Organization ▪ Dissemination to the public through media appearances (WEF blog, etc.) Persson, Parie, Feuerriegel @ PNAS 2021: Monitoring the COVID-19 epidemic with nationwide telecommunication data https://doi.org/10.1073/pnas.2100664118 4 ABOUT OUR INSTITUTE Combining AI technologies and real-world applications Management AI for decision- AI for Good AI & Web practice making Real-world AI Focus on Healthcare Digital traces demonstrations sequential settings applications Social media Field Causal ML (with medical data (e.g., fake experiments (e.g., sequential practitioners) news) Financial deconfounding, AI to support Clickstream data implications sequential ITE Sustainable Development Mobility data Organizational Off-policy learning (e.g., Goals (e.g., and behavioral sustainability, implications dynamic treatment inequality) (fairness, accountability, regimens) etc.) 5 ABOUT OUR INSTITUTE In joint industry collaborations, we strive for lasting impact in practice Companies Digital transformation ▪ Partnership for a Typical challenges that we address 3-year project Insight Data Insufficient data integration, missing policies for data use ▪ Full funding for a PhD position by Analytics No use of advanced analytics, often scattered and simple tools the company ▪ Partners: Software No analytics packages in place, software not user-friendly enough People Missing trust in data, no capabilities in using or designing reports Process Analytics not embedded into regular decision-making processes No overall strategy for using big data and advanced analytics for Strategy business decisions Decision 6 OUR RESEARCH We seek to publish in leading outlets from both artificial intelligence and domain applications Artificial intelligence Application domains ▪ Thought leadership in applied AI ensures ▪ Contributes to research that is relevant research that is rigorous ▪ Demonstrates impact in practice ▪ Provides state-of-the-art performance Example outlets Example outlets ▪ SIGKDD Conference on Knowledge ▪ PNAS Discovery and Data Mining (KDD) ▪ Management Science ▪ The Web Conference (WWW) ▪ Marketing Science ▪ EMNLP, ACL, CHI, … Research collaborations 7 VISION We see 3 challenges for bringing AI into business management Clarify boundaries of AI interventions Missing accountability of AI 01 decisions Implement governance structures Establish risk management frameworks Develop human-in-the-loop frameworks from a human-centered perspective Incomplete frameworks for 02 human-in-the-loop analytics Promote frameworks Promote frameworksforfor and exploration and exploration human learning human learning Advance prescriptive algorithms Appoint transformation workforce Encourage managers to explore and 03 Organizational inertia experiment Incentivize adoption Feuerriegel, Shrestha, von Krogh, Zhang: Bringing AI into business management 8 COURSE Course Overview ▪ 4 SWS / 6 ECTS ▪ Date: 4-day block course (adapted for online use) with optional Q&A ▪ Grading: ▪ Exam with programming (demanding!) focusing on implementing an analytics solution ▪ Constraints: BSc BWL ▪ Requirements: ▪ Programming skills and basic Maths (regression) ▪ Expectation: ▪ Practical application of machine learning is an integral element ▪ Course has little focus on mathematical proofs, rather intuition and overarching concepts ▪ Nevertheless, there is a strong method-focus 9 COURSE Books ▪ James, Witten, Hastie & Tibshirani. An Introduction to Statistical Learning: with Applications in R. Springer, 2013. PDF: http://www-bcf.usc.edu/~gareth/ISL/ ▪ Wickham & Grolemund. R for Data Science. O’Reilly, 2017. Online version: http://r4ds.had.co.nz/ 10 Homework / Reading 11 Course Outline ▪ Organizational details 1 Motivation ▪ Applications of business analytics ▪ Definition of machine learning 𝑦 = 𝑓𝜃 (𝑥) 2 Predictive ▪ Taxonomy of predictive modeling modeling ▪ Performance assessments Linear ▪ Linear model (ordinary least squares) Linear 𝑓: 3 modeling ▪ Regularization (lasso, ridge regression, elastic net) 𝑦 = 𝛼 + 𝛽𝑇 𝑥 ▪ Decision trees, random forest Non-linear 𝑓 4 Nonlinear modeling ▪ Boosting ▪ Neural networks ▪ Train/test split How 𝜃? 5 Model tuning ▪ Cross-validation Bringing to ▪ Management challenges 6 ▪ Pitfalls in practice practice 12 Mastering business analytics promises value creation Technologies of how information is collected, analyzed and visualization for achieving better decisions Understand Obtain business understanding in order to business define goals and derive KPIs Identify and collect internal and external Prepare data sources that potentially contain data interesting information Apply Apply model from predictive analytics and analytics evaluate performance for decision-making Related terms: predictive analytics, machine learning, artificial intelligence, … Placeholder for organisational unit name / logo | | 13 (edit in slide master via “View” > “Slide Master”) Illustrative Naïve predictions can fuel decision-making – but only sometimes Example: Forecasted sales volume → input to production quota Calibration phase Go-live ▪ Replicates observed patterns from past data Y = f(x1, …, xn) Sales volume with e.g. neural networks ▪ Leverages external predictors Time Placeholder for organisational unit name / logo | | 14 (edit in slide master via “View” > “Slide Master”) Predictive analytics anticipates outcomes for a given input Descriptive Business value / complexity Human input What happened? Diagnostic Human input Why did it happen? Data Decision Action Predictive Human input What will happen? Prescriptive Decision automation What should happen? ▪ Predictive analytics learns by “demonstration” and thus replicates past decisions (and errors) ▪ Prescriptive analytics promises to identify optimal decisions Placeholder for organisational unit name / logo | | 15 (edit in slide master via “View” > “Slide Master”) Prediction of churn is based on several input variables feeding statistical propensity models Different variables are identified… …to build stable predictive models Customer behavior 6 months prior to churning Continuous improvement of models through “learning” from actual customer behavior Bad debt (in days) 30 Actual churn Data-mart 25 Analytical Predicted churn model 3 5 0 0 0 -6 m. -5 m. -4 m. -3 m. -2 m. -1 m. churn Example 1: Logistic Regression Number of inbound calls to customer care Probability to churn 3 Logistic 2 regression model 1 1 Example 2: Decision tree 0 0 0 -6 m. -5 m. -4 m. -3 m. -2 m. -1 m. churn Calls >= 5 and bad debt >=30 and … Customer characteristics 6 months prior to churning Probability to churn Demographics Contact type … Calls “Slide Master”) COLLECT AND COMBINE DATA Banking example – A set of typical variables are considered in this context Category Current and dynamic (changes in …) Desired but usually hard to get Customer Specifics ▪ Age ▪ Changes in life stages, personal ▪ Gender and professional events (marriage, ▪ Marital status promotion) ▪ Household size ▪ Income and job type ▪ ZIP Code ▪ Segment (value/behavior) ▪ Tenure (customer since …) Product holding ▪ Product holdings (yes/no per product) ▪ … ▪ Average balances ▪ Number of accounts ▪ Total assets under management ▪ Total liabilities Product usage/transactions ▪ Number of transactions ▪ Transaction details, i.e., detailed ▪ Transaction volumes information on channel, purpose (e.g., ▪ Transaction channels recipient of transaction is other bank, ▪ Inflow vs. outflow luxury retailer, or utility) Contact history and other ▪ Date of last contact ▪ Online data: e.g., clicks on Internet ▪ Last offer (product or service) page, logins, origination pages, ▪ Last campaign and channel browser used, mobile used, … ▪ Inbound contacts (call center, etc,) Placeholder for organisational unit name / logo ▪ Internet logins/usage (web analytics) | | 18 (edit in slide master via “View” > “Slide Master”) EXAMPLE Predicting customers that respond positively to marketing campaigns Instead of treating simply all, treat only those that are “persuadable” Yes Persuadables Sure buyers Purchase if treated? A clear business KPI is needed in order to focus on “persuadebles” and No Lost causes Sleeping dogs not target those would’ve purchased anyway (“sure buyers”) No Yes Purchase if not treated? Solution are AI algorithms that directly predict to which subgroup a customer belongs, instead of targeting all Placeholder for organisational unit name / logo | | 19 (edit in slide master via “View” > “Slide Master”) Targeting the right customers at the right time to prevent churn for customers that show high likelihood of churning …churn early warning systems are implemented based on analytical propensity models… …which yield concrete customer interactions in the call center Propensity to churn (percent) AI use for customer lifetime management 100 Alert by early warning system Immediate The early warning system raises an alert because the action to prevent propensity to churn for that customer has gone above a 75 churn predefined threshold for the segment the customer belongs to 50 Customer put on watchlist Customer specific action The AI provides not only the churn probability but also the variables/reasons 25 The system can be enriched with traditional database information, e.g. CLV, segment, billing history, contracts, etc. -7 -6 -5 -4 -3 -2 -1 0 Months prior to churn Call with individualized script AI can make such predictions The customer is being called with the best available Management defines (and experiments) with thresholds for different actions script at the earliest possible time to avoid churn Best action might differ per customer segment, e.g. for the best customers The success rate of a call script can again be predicted the thresholds should be lower and the actions probably be more expensive (e.g. calling a customer instead of sending an email) Placeholder for organisational unit name / logo | | 20 (edit in slide master via “View” > “Slide Master”) Group work: Pitch your idea of business analytics! 21 EXAMPLE Managerial decision support for effective police management Crime risk Patrol unit Data Crime risk reduction estimation routing System periodically Officers on the ground Prevented crime identifies areas and follow instructions times of elevated risk Operator assesses and prioritizes alerts Theory Analytics Optimization 22 EXAMPLE Dynamic estimations of crime risk Historic crime Current practice: hotspot map Dynamic risk map Kadar, Maculan & Feuerriegel (2019): Public decision support for low population density areas. Decision Support Systems 23 EXAMPLE Data-driven risk model ℙ(crime𝑖𝑡 = 1) = 𝑓(spatial𝑖 , temporal𝑡 , crime𝑖,𝑡−1 ; 𝜃) Kadar, Maculan & Feuerriegel (2019): Public decision support for low population density areas. Decision Support Systems 24 EXAMPLE Effectiveness 25 PRIMER 26 MODELING BASICS There are 2 approaches to modeling – supervised and unsupervised learning, but only the former uses historical data to predict the future Y = f(X) X Y X Y Supervised learning Take … to predict Take most … to predict historical data what happened recent data … the future … in the past Time Today ▪ Unclear management implications of clusters Unsupervised ▪ No ground truth labels what a „true“ labels are learning ▪ Not designed to generalize to unseen instances Identify patterns in historical data … 27 Agenda ▪1 Models ▪2 Performance ▪3 Training & evaluation ▪4 Implementation ▪5 Managing AI 28 The general idea is to map input and output into a mathematical space Purchase volume Example: k-nearest neighbor approach Time since last purchase k=3 k=5 Churn ? Or no churn ? 29 Discriminatory approaches are needed that directly model the output variable ▪ The k-nearest neighbor approach has a number of limitations ▪ Uneven frequency of classes ▪ Robustness to noise ▪ Instead, discriminatory approaches are preferred 30 Model choice depends on the trade-off between prediction power and interpretability ▪ A plethora of classifiers exist but others Linear model have not been reliably superior 1 ▪ Unstructured data (e.g. images, text) require custom models Interpretability Random forest xglboost Recommendations for practice ▪ Small datasets (< 1000) are usually well served with linear models Neural network ▪ Large-scale datasets (1 mio observations) only benefit from neural networks Prediction power ▪ Ensembles: combine multiple classifiers but benefit is ~1% 1) www.kaggle.com 31 Linear model are straightforward to fit and interpret Linear models combine several predictors via an additive scheme ▪ x: predictor, feature regressor, independent variable ▪ y: outcome, prediction, label, dependent variable ▪ 𝛼, 𝛽: intercept and coefficients → needs to be estimated Example PRICE_WINE = 3 + 10 × ALCOHOL + … 32 Decision trees describe a flowchart-like structure for deriving predictions Example: predict whether you have can sell well icecream Follow tree downwards to arrive at a prediction At each node, choose appropriate branching Prediction 33 Combining decisions trees to yield a random forest Decision tree Random forest Flowchart-like structure to display decision Combines ensemble of trees by majority vote criterions Averages out errors and handles non-linearity End nodes of each branch indicate outcomes Pro: good out-of-the-box performance for Pro: easy to understand and apply prediction without parameter tuning Reduces complexity of Big Data x x... Cloudy Sunny rain no + rain y Common in explanatory tasks Highly suitable for prediction 34 Neural network are inspired by the human nervous system ▪ Artificial neural networks are stacked generalized linear models (= layers) ▪ Universal approximator theorem ensures that any f can be modeled Purchase volume Churn Time since last purchase No churn Layer Layer Layer Visual demonstration: https://playground.tensorflow.org Neuron z = A(w1 x1 + …. + wN xN) 35 Deep neural networks stack multiple hidden layers Challenge: finding an appropriate network architecture How many layers? Which type of layers? Which size of layers? Which optimizer? Which activation function? How to achieve speed? #epochs, batch size, … How to prevent overfitting? E.g. dropout, batch normalization, … Example: VGGNet classifier for predicting image content 36 Why deep learning has become so widespread Data influence on performance Example: Sales forecasting ~1 million samples Ng, A. (2016a). Machine Learning Yearning: Technical Strategy for AI Engineers, In the Era of Kraus, M., Feuerriegel, S., & Oztekin, A. (2019). Deep learning in business analytics and Deep Learning, Draft Version 0.5 operations research: Models, applications and managerial implications. European Journal of Operational Research. https://doi.org/10.1016/j.ejor.2019.09.018 37 Managing unstructured data requires tailored modeling approaches Images Text Spatial problems word0 word1 wordτ variable length ▪ Convolutional neural network (CNN) Natural language processing ▪ Gaussian processes (for sparse data ▪ Recurrent neural network (RNN) points) ▪ Long short-term memory (LSTM) ▪ Deep spatio-temporal residual ▪ Language models (e.g. BERT) network (for dense data) Kraus, Feuerriegel, Oztekin (2019): Deep learning in business analytics and operations research: Models, applications and managerial implications. EJOR. https://doi.org/10.1016/j.ejor.2019.09.018 38 Terminology ▪ Artificial intelligence ▪ In the old days: generating rules like „if X, then predict Y“ ▪ Nowdays, buzzwords considering all aspects of ML, but also other optimization techniques that are concerned with data-driven decision-making (Markov decison processes) ▪ Machine learning (ML) ▪ Supervised + unsupervised learning (next slides) => train, then deploy ▪ Reinforcement learning (sequential/continuous!) ▪ Predictive analytics ~= supervised machine learning ▪ Reinforcement learning, one of prescriptive analytics ▪ Statistical learning (specific stats-driven types of ML; like lasso, ridge, … mostly regression) 39 Agenda ▪1 Models ▪2 Performance ▪3 Training & evaluation ▪4 Implementation ▪5 Managing AI 40 In classification, performance is measured via the confusion matrix Customer Note: in regression, we simply compute the deviation | true Y – predicted Y | 41 Being aware of class imbalances is imperative for a rigorous performance evaluation 1 Imbalance-aware metrics Receiver operating characteristic & area under the curve F1 score Balanced accuracy → but more difficult to interpret 2 Compare classifier against majority vote as a naïve baseline 3 Where possible, translate confusion matrix into finanical KPI (= custom loss function) 42 Agenda ▪1 Models ▪2 Performance ▪3 Training & evaluation ▪4 Implementation ▪5 Managing AI 43 Training and evaluating models is based on the following 3-step standard process Split your existing data into a training and 1 test set 2 Use the training set to fit all model parameters Evaluate the performance on unseen data 3 via the test set Test error Out-of-sample performance → ensures that the model is tested according to its ability of generalization Rule-of-thumb: 80% for training; 20% for testing (or 90/10) 44 Train-test procedure avoids a phenomenon of overfitting, so that the model generalizes well to unseen data 45 46 47 48 Agenda ▪1 Models ▪2 Performance ▪3 Training & evaluation ▪4 Implementation ▪5 Managing AI 49 Still, most AI implementations are custom-made as few general-purpose tools exist ▪ There is no “one-fits-all” approach, as all implementations require problem-specific customizations Specifies ML ▪ Common are programming languages pipeline ▪ R (for prototyping) ▪ Python (scikit-learn, tensorflow, keras, PyTorch) ▪ Packages for others are being developed (e.g. Template-based wrapper for ML.NET) existing APIs ▪ Existing tools are limited in capabilities, e.g. ▪ No time series ▪ No merging with external data ▪ Low prediction performance 50 Software, platforms, and packages for business analytics Type of Prediction Task Available Software, Platforms, and Packages General Systems ▪ AzureML (Microsoft) ▪ WEKA ▪ Alteryx ▪ SAS/SPSS Preprocessing ▪ Dplyr ▪ Rmarkdown ▪ Pandas/ Numpy Machine Learning ▪ Sklearn ▪ Deep learning: Pytorch, Keras, Tensorflow ▪ R/Caret ▪ Julia Probabilistic ▪ Pyro ▪ Edward ▪ sklearn 51 Managing AI projects requires an iterative approach Cross Industry Standard Process for Data Mining (CRISP-DM) Requires interdisciplinary team with data translators Consumes usually 80% of the time Re-deployment beneficial if (A) substantially more data is available or (B) highly Consists of another dynamic environments iterative process 52 The step of modeling requires multiple iterations Modeling: example process Recommendations ▪ Align process for rapid deployment ▪ Build in-house experience ▪ Keep in mind that parallelism in in AI development is limited, thereby limiting scalability ▪ Leverage tools for reproducibility ▪ Follow a lean paradigm ▪ Priorities are on minimum viable products ▪ Pause process early as modeling is expensive ▪ High-performance is usually only achieved by experts EDA: explanatory data analysis Kuhn, Jonson: Feature engineering and selection: A practical approach for predictive models http://www.feat.engineering/ 53 Data is the most important thing, yet methods can tweak the last inches “Using the right data is more important than using the right modeling technique” – a practitioner's view Improving the model Improving the data 10% 1% vs Basic Advanced Basic Advanced Same data, better model Same model, better data Sufficient in 90% of the cases are linear regressions or tree-based models Spend your time with improving the data first, later try more sophisticated models 54 COLLECT AND COMBINE DATA The final customer view includes multiple sets of internal and external data Customer journey Digital marketing (display, (stages, needs, touch- paid search, affiliates, points) SEO) Purchase Behaviors (e.g., Online browsing (e.g., value, products purchased, Collect & combine visit, browse, conversion, longitudinal migrations) data from multiple feature usage) sources … Social (e.g., linkage of Mobile usage (e.g., value, Facebook ‘likes’, sources of shopping behaviors, traffic) feature usage) Ethnographies (e.g., … into an 3rd party payments (e.g., attitudes, perceptions, integrated 360 competitor shopping, what sources of shopping customer view they buy, how much they inspiration) spend, when) 55 Avoiding pitfalls with AI Before ▪ Ensure that the rules for good AI cases are all satisfied 1 implementation ▪ Define your expected / desired level of performance ▪ Implement a baseline (random guess, current performance level, etc.) 2 During ▪ Follow a rapid prototyping: a first model within the first day (similar to a implementation minimum viable product, otherwise “kill“ project) ▪ At a later stage, consider fusing additional predictors from public data 3 After deployment ▪ Consider re-training your model from time-to-time if you have a dynamic environment 56 Agenda ▪1 Models ▪2 Performance ▪3 Training & evaluation ▪4 Implementation ▪5 Managing AI 57 A holistic approach is required, yet facing various challenges Dimension 58 Personal experience on what makes successful AI implementations AI value chain Key principles Goal of the AI value chain is to identify and prioritize areas of improvement in analytics and big data along a structured framework Decision backwards Build the capability by starting with the business decisions you want to drive and working backwards Step-by-step Focus on specific topics and set each element in Matrix of AI value chain dimensions and place - a chain is only as strong as its weakest link application areas Application areas Test and learn Dimension Move from data to decision and from decision back to the data with which to measure the outcome 59 Predictive Modeling Business Analytics Stefan Feuerriegel Today’s Lecture Objectives 1 Learn common concepts of machine learning 2 Being able to evaluate the predictive performance 3 Distinguishing predictive and explanatory power Prediction 2 Outline 1 Concepts of Machine Learning 2 k -Nearest Neighbor Classifier 3 Prediction Performance 4 Management Guidelines 5 Wrap-Up Prediction 3 Outline 1 Concepts of Machine Learning 2 k -Nearest Neighbor Classifier 3 Prediction Performance 4 Management Guidelines 5 Wrap-Up Prediction: Concepts of Machine Learning 4 Machine Learning Goal: Learning to perform a task from experience Examples I Speech recognition (e. g. Siri, speed-dialing) I Hand-writing recognition (e. g. letter delivery) I Fraud detection (e. g. credit cards) I Text filtering (e. g. spam filters) I Image processing (e. g. object tracking, Kinect) I Robotics (e. g. Google driverless car) Prediction: Concepts of Machine Learning 5 Machine Learning Goal: Learning to perform a |{z} task from experience | {z } | {z } | {z } (2) (3) (1) (4) 1 Task I Often expressed as mathematical function y = f (x , w ) I Input x, output y , parameter w (this is what is "learned") I Output y is either discrete or continuous I Curse of dimensionality: Complexity increases exponentially with number of dimensions 2 Learning I We do not want to encode the knowledge ourselves I Machine should learn the relevant criteria automatically from past observations and adapt to the given situation I Most often learning = optimization I Search hypothesis space for the "best" function and model parameter w I Prediction: Concepts of Machine y = f (x , w ) with respect to the performance measure Maximize Learning 6 Task Learning: Examples Regression with continuous output Classification with discrete output I Automatic control of a vehicle I Email filtering + x ∈ [a − z ] 7→ y ∈ {important, spam} I Character recognition x ∈ Rn 7→ y ∈ {a,... , z } Controller r + x y d f(x; w) Input + Actuating signal (Error) Output a f c a d m p Feedback Prediction: Concepts of Machine Learning 7 Machine Learning Goal: Learning to perform a |{z} task from experience | {z } | {z } | {z } (2) (3) (1) (4) 3 Performance I Measured typically as one number → e. g. % correctly classified letters, % games won I "99 % correct classification" → Of what? Characters, words or sentences? Speaker/writer independent? Over what data set? I Example: "The car drives without human intervention 99 % of the time on country roads" 4 Experience → what data is available? I Supervised learning (→ data with labels) I Unsupervised learning (→ data without labels) I Reinforcement learning (→ with feedback/rewards) Prediction: Concepts of Machine Learning 8 Supervised vs. Unsupervised Learning Supervised learning I Machine learning task of inferring a function from labeled training data I Training data includes both the input and the desired results → correct results (target values) are given Unsupervised learning I Methods try to find hidden structure in unlabeled data I The model is not provided with the correct results during the training I No error or reward signal to evaluate a potential solution I Examples: I Clustering (e. g. by k -means algorithm) → group into classes only on the basis of their statistical properties I Dimensionality reduction (e. g. by principal component analysis) I Hidden Markov models with unsupervised learning Prediction: Concepts of Machine Learning 9 Statistical Data Types I Data type specifies semantic content of the variable I Controls which (predictive) model can be used Data I Discrete variable can take on one of a limited (and usually fixed) number of possible values I Ordinal: With natural ordering, e. g. Discrete grades (A,... , F) I Nominal: Without this ordering, e. g. blood type (A, B, AB, 0) Nominal Ordinal I Continuous: numbers from interval I ⊆R I If data cannot be described by a single number, called multivariate → e. g. vectors, matrices, sequences, networks Prediction: Concepts of Machine Learning 10 Taxonomy of Machine Learning I Machine learning estimates function and parameter in y = f (x , w ) I Type of method varies depending on the nature of what is predicted I Regression I Predicted value refers to a real number I Continuous y Feature B I Classification I Predicted value refers to a class label I Discrete y (e. g. class membership) I Clustering Feature A I Group points into clusters based on how "near" they are to one another I Identify structure in data Feature B Examples? Feature A Prediction: Concepts of Machine Learning 11 Multi-Class Prediction I 2-class problem with binary target values yi ∈ {0, 1} I K -class problem with 1-of-K coding scheme, i. e. yi ∈ [0, 1, 0, 0, 0]T Prediction: Concepts of Machine Learning 12 Outline 1 Concepts of Machine Learning 2 k -Nearest Neighbor Classifier 3 Prediction Performance 4 Management Guidelines 5 Wrap-Up Prediction: k -Nearest Neighbor Classifier 13 k -Nearest Neighbor (k -NN) Classification I Input: training examples as vectors in a multidimensional feature space, each with a class label I No training phase to calculate internal parameters ? I Testing: Assign to class according to k -nearest neighbors I Classification as majority vote I Problems? → What label to assign to the circle? Prediction: k -Nearest Neighbor Classifier 14 k -Nearest Neighbor (k -NN) Classification I Input: training examples as vectors in a multidimensional feature space, each with a class label I No training phase to calculate internal parameters ? I Testing: Assign to class according to k -nearest neighbors I Classification as majority vote I Problems? I Skewed data I Uneven frequency of classes I Robustness to noise → What label to assign to the circle? I Scalability (with N) Prediction: k -Nearest Neighbor Classifier 14 Distance Metrics I Distance metrics measure distance between two points I Given points p = [p1 ,... , pN ] ∈ RN and q = [ q1 ,... , qN ] ∈ R N I Arbitrary distance metric d (p , q ) I Euclidean distance s N d2 (p , q ) = kp − q k2 = ∑ ( qi − pi ) 2 i =1 Blue → Euclidean I Manhattan distance Black → Manhattan N d1 (p , q ) = kp − q k1 = ∑ |qi − pi | i =1 Prediction: k -Nearest Neighbor Classifier 15 Choosing Number of Nearest Neighbors k 5-Nearest Neighbor 15-Nearest Neighbor Prediction: k -Nearest Neighbor Classifier 16 Outline 1 Concepts of Machine Learning 2 k -Nearest Neighbor Classifier 3 Prediction Performance 4 Management Guidelines 5 Wrap-Up Prediction: Prediction Performance 17 Outline 3 Prediction Performance Model Choice Training & Testing Performance Metrics Statistical Model Comparison Overfitting Prediction: Prediction Performance 18 Assessment of Models 1 Predictive performance (measured by accuracy, recall, F1, ROC,... ) 2 Computation time for both model building and predicting 3 Robustness to noise in predictor values 4 Interpretability → transparency, ease of understanding Question: I What could be reasons why one chooses one over the other? I Where and why could interpretability be demanded? Prediction: Prediction Performance 19 Assessment of Models 1 Predictive performance (measured by accuracy, recall, F1, ROC,... ) 2 Computation time for both model building and predicting 3 Robustness to noise in predictor values 4 Interpretability → transparency, ease of understanding Question: I What could be reasons why one chooses one over the other? I Where and why could interpretability be demanded? Prediction: Prediction Performance 19 Prediction Power vs. Interpretability Lasso OLS Interpretability Decision tree Boosting, Random forest Support vector machine Deep neural network Flexibility Prediction: Prediction Performance 20 Outline 3 Prediction Performance Model Choice Training & Testing Performance Metrics Statistical Model Comparison Overfitting Prediction: Prediction Performance 21 Training and Test Set I Datasets in machine learning are usually split into disjunct sets for training and testing 1 Training set is used to fit and calibrate the model parameters 2 Test set is used to measure the predictive performance on unseen data I Each measures a different error, i. e. the training and test error I Rule-of-thumb: 80 % for training and 20 % for testing (or 90 % vs. 10 %) Train Test Model Result I Training error results from applying the model to the training data I Test error is the average error when predicting on unseen observations I Alternative terms refer to in-sample and out-of-sample performance Prediction: Prediction Performance 22 Outline 3 Prediction Performance Model Choice Training & Testing Performance Metrics Statistical Model Comparison Overfitting Prediction: Prediction Performance 23 Prediction Performance Metric depends on type of outcome variable 1 Classification I Confusion matrix I Accuracy,... I... 2 Regression I Mean squared error I Root mean squared error I... Note: Performance improvements are often small and thus subject to statistical significance testing Prediction: Prediction Performance 24 Confusion Matrix Confusion matrix (also named contingency table or error matrix) displays predictive performance Condition (as determined by Gold standard) True False Positive True Positive (TP) False Positive (FP) Precision or Outcome → Type I Error Positive Predictive Value TP → False Alarm = TP +FP Negative False Negative (FN) True Negative (TN) Outcome → Type II Error / Miss Sensitivity† Specificity Accuracy TP +TN = TP Rate = TN Rate = Total TP TN = TP +FN = FP +TN † Equivalent with hit rate and recall Prediction: Prediction Performance 25 Confusion Matrix Example: New machine learning system providing warnings of loan defaults Patient with Loan Default No default Early warning TP: Default FP: Solvent person Precision TP correctly predicted yet default assumed = TP +FP No warning FN: Default TN: Solvent person not predicted predicted as solvent Sensitivity Specificity Accuracy TP +TN = TP Rate = TN Rate = Total TP TN = TP +FN = FP +TN Different loss functions: Missed revenue in FP case, but actual financial loss in FN case Prediction: Prediction Performance 26 Assessing Prediction Performance Imagine the following confusion matrix with an accuracy of 65 % Patient with Cancer True False Positive Blood TP = 60 FP = 5 Test Outcome Negative Blood FN = 30 TN = 5 Test Outcome Question: Would you bet money on this predictor? Prediction: Prediction Performance 27 Assessing Prediction Performance Imagine the following confusion matrix with an accuracy of 65 % Patient with Cancer True False Positive Blood TP = 60 FP = 5 Precision TP Test Outcome = TP +FP ≈ 0.92 Negative Blood FN = 30 TN = 5 Test Outcome Sensitivity Specificity Accuracy TP TN TP +TN = TP +FN ≈ 0.67 = FP +TN = 0.50 = Total = 0.65 Question: Would you bet money on this predictor? No, because of unevenly distributed data, a model which always guesses true will score an accuracy of 90 % Prediction: Prediction Performance 27 Regression Error Common choices given prediction ŷ and true value y 1 n I Mean squared error MSE = ∑ (yi − ŷi )2 n i =1 s 1 n I Root mean squared error RMSE = ∑ (yi − ŷi )2 n i =1 1 n I Mean absolute error MAE = ∑ |yi − ŷi | n i =1 Question I Relative errors such as the mean absolute percentage error (MAPE) 1 n yi −ŷi seem easily interpretable MAPE = ∑ yi n i =1 I But why is their use discouraged? I Prediction: Prediction Performance 28 Regression Error Common choices given prediction ŷ and true value y 1 n I Mean squared error MSE = ∑ (yi − ŷi )2 n i =1 s 1 n I Root mean squared error RMSE = ∑ (yi − ŷi )2 n i =1 1 n I Mean absolute error MAE = ∑ |yi − ŷi | n i =1 Question I Relative errors such as the mean absolute percentage error (MAPE) 1 n yi −ŷi seem easily interpretable MAPE = ∑ yi n i =1 I But why is their use discouraged? I Hint: yi → 0 Prediction: Prediction Performance 28 Trade-Off: Sensitivity vs. Specificity/Precision I Performance goals frequently place more emphasis on either sensitivity or specificity/precision I Example: Airport scanners triggered on low-risk items like belts (low precision), but reduce risk of missing risky objects (high sensitivity) I Trade-Off: F1 score is the harmonic mean of precision and sensitivity 2 TP F1 = 2 TP + FP + FN I Visualized by receiver operating characteristic (ROC) curve Prediction: Prediction Performance 29 Receiver Operating Characteristic (ROC) Homework Read the following paper 1 Fawcett (2004). ROC graphs: Notes and practical considerations for researchers. Pattern Recognition Lettters 27(8):882–891 DOI: 10.1080/00031305.2016.1154108 I What are benefits of the ROC over other metrics? Prediction: Prediction Performance 30 Receiver Operating Characteristic (ROC) ROC illustrates performance of binary classifier as its discrimination threshold y (x ) is varied 1 C Interpretation: 0.75 B Sensitivity I Curve A is random guessing (50 % A 0.5 correct guesses) I Curve from model B performs better 0.25 than A, but worse than C I Curve C from perfect prediction 0 0 0.25 0.5 0.75 1 1-Specifity Area south-east of curve is named area under the curve (AUC) and should be maximized Prediction: Prediction Performance 31 Outline 3 Prediction Performance Model Choice Training & Testing Performance Metrics Statistical Model Comparison Overfitting Prediction: Prediction Performance 32 Outline 3 Prediction Performance Model Choice Training & Testing Performance Metrics Statistical Model Comparison Overfitting Prediction: Prediction Performance 33 Model Selection Task: Which model should we select? 1 Model A consisting of 10 explanatory variables with an R 2 = 0.6 2 Model B consisting of 6 explanatory variables with an R 2 = 0.4 Explanatory modeling: information criterion I Deals with trade-off between complexity and the goodness of fit I Cannot tell anything about how well a model fits the data in an absolute sense I Prefer model with the minimum information criterion value I Examples: Akaike Information Criterion, Bayesian Information Criterion Predictive modeling: out-of-sample performance Prediction: Prediction Performance 34 Predictive vs. Explanatory Power Significant difference between predicting and explaining: 1 Empirical models for prediction I Empirical predictive models (e. g. statistical models, methods from data mining) designed to predict new/future observations I Predictive analytics describes the evaluation of the predictive power, such as accuracy or precision 2 Empirical models for explanation I Any type of statistical model used for testing causal hypothesis I Use methods for evaluating the explanatory power, such as statistical tests or measures like R 2 Prediction: Prediction Performance 35 Predictive vs. Explanatory Power N = 15 N = 100 o oo o o oo oooooo oo oo o o o oo ooo ooo oo o oo o oo oooo oo ooo o o o o o ooo ooo oo o o oo oo ooo oooo oo ooo oooooooo o oo o oooo I Explanatory power does not imply predictive power I Red is the best explanatory model; gray the best predictive I In particular, dummies do not translate well to predictive modes I Do not write something like "the regression proves the predictive power of regressor xi " Prediction: Prediction Performance 36 Overfitting I When learning algorithm is performed for too long, the learner may adjust to very specific random features not related to the target function I Overfitting: Performance on training data (in gray) still increases, while the performance on unseen data (in red) becomes worse o o o o o 1.0 oo o ooo o o o 0.020 oo Mean Squared Error o o oo 0.8 o o o o y o o 0.6 0.010 o 0.4 o o o 0.000 0.05 0.10 0.15 0.20 0.25 0.30 0.35 2 4 6 8 10 12

ALL_SLIDES.pdf PDF

Document Details

Tags

Related

Summary

Full Transcript