Galit Shmeuli, Peter C. Bruce, Peter Gedeck, Inbal Yahav, Nitin R. Patel - Machine Learning For Business Analytics_ Concepts, Techniques, and Applications in R-Wiley (2023).pdf
Document Details
Uploaded by FaultlessEcstasy4621
2023
Tags
Full Transcript
Table of Contents Cover Title Page Copyright Dedication Foreword by Ravi Bapna Foreword by Gareth James Preface to the Second R Edition Acknowledgments PART I: Preliminaries CHAPTER 1: Introduction 1.1 WHAT IS BUSINESS ANALYTICS? 1.2 WHAT IS MACHINE...
Table of Contents Cover Title Page Copyright Dedication Foreword by Ravi Bapna Foreword by Gareth James Preface to the Second R Edition Acknowledgments PART I: Preliminaries CHAPTER 1: Introduction 1.1 WHAT IS BUSINESS ANALYTICS? 1.2 WHAT IS MACHINE LEARNING? 1.3 MACHINE LEARNING, AI, AND RELATED TERMS 1.4 BIG DATA 1.5 DATA SCIENCE 1.6 WHY ARE THERE SO MANY DIFFERENT METHODS? 1.7 TERMINOLOGY AND NOTATION 1.8 ROAD MAPS TO THIS BOOK CHAPTER 2: Overview of the Machine Learning Process 2.1 INTRODUCTION 2.2 CORE IDEAS IN MACHINE LEARNING 2.3 THE STEPS IN A MACHINE LEARNING PROJECT 2.4 PRELIMINARY STEPS 2.5 PREDICTIVE POWER AND OVERFITTING 2.6 BUILDING A PREDICTIVE MODEL 2.7 USING R FOR MACHINE LEARNING ON A LOCAL MACHINE 2.8 AUTOMATING MACHINE LEARNING SOLUTIONS 2.9 ETHICAL PRACTICE IN MACHINE LEARNING PROBLEMS NOTES PART II: Data Exploration and Dimension Reduction CHAPTER 3: Data Visualization 3.1 USES OF DATA VISUALIZATION 3.2 DATA EXAMPLES 3.3 BASIC CHARTS: BAR CHARTS, LINE CHARTS, AND SCATTER PLOTS 3.4 MULTIDIMENSIONAL VISUALIZATION 3.5 SPECIALIZED VISUALIZATIONS 3.6 SUMMARY: MAJOR VISUALIZATIONS AND OPERATIONS, BY MACHINE LEARNING GOAL PROBLEMS NOTES CHAPTER 4: Dimension Reduction 4.1 INTRODUCTION 4.2 CURSE OF DIMENSIONALITY 4.3 PRACTICAL CONSIDERATIONS 4.4 DATA SUMMARIES 4.5 CORRELATION ANALYSIS 4.6 REDUCING THE NUMBER OF CATEGORIES IN CATEGORICAL VARIABLES 4.7 CONVERTING A CATEGORICAL VARIABLE TO A NUMERICAL VARIABLE 4.8 PRINCIPAL COMPONENT ANALYSIS 4.9 DIMENSION REDUCTION USING REGRESSION MODELS 4.10 DIMENSION REDUCTION USING CLASSIFICATION AND REGRESSION TREES PROBLEMS Note PART III: Performance Evaluation CHAPTER 5: Evaluating Predictive Performance 5.1 INTRODUCTION 5.2 EVALUATING PREDICTIVE PERFORMANCE 5.3 JUDGING CLASSIFIER PERFORMANCE 5.4 JUDGING RANKING PERFORMANCE 5.5 OVERSAMPLING PROBLEMS NOTES PART IV: Prediction and Classification Methods CHAPTER 6: Multiple Linear Regression 6.1 INTRODUCTION 6.2 EXPLANATORY VS. PREDICTIVE MODELING 6.3 ESTIMATING THE REGRESSION EQUATION AND PREDICTION 6.4 VARIABLE SELECTION IN LINEAR REGRESSION PROBLEMS NOTES CHAPTER 7: k ‐Nearest Neighbors ( k ‐NN) 7.1 THE K ‐NN CLASSIFIER (CATEGORICAL OUTCOME) 7.2 K ‐NN FOR A NUMERICAL OUTCOME 7.3 ADVANTAGES AND SHORTCOMINGS OF K ‐NN ALGORITHMS PROBLEMS NOTES CHAPTER 8: The Naive Bayes Classifier 8.1 INTRODUCTION 8.2 APPLYING THE FULL (EXACT) BAYESIAN CLASSIFIER 8.3 SOLUTION: NAIVE BAYES 8.4 ADVANTAGES AND SHORTCOMINGS OF THE NAIVE BAYES CLASSIFIER PROBLEMS CHAPTER 9: Classification and Regression Trees 9.1 INTRODUCTION 9.2 CLASSIFICATION TREES 9.3 EVALUATING THE PERFORMANCE OF A CLASSIFICATION TREE 9.4 AVOIDING OVERFITTING 9.5 CLASSIFICATION RULES FROM TREES 9.6 CLASSIFICATION TREES FOR MORE THAN TWO CLASSES 9.7 REGRESSION TREES 9.8 ADVANTAGES AND WEAKNESSES OF A TREE 9.9 IMPROVING PREDICTION: RANDOM FORESTS AND BOOSTED TREES PROBLEMS NOTES CHAPTER 10: Logistic Regression 10.1 INTRODUCTION 10.2 THE LOGISTIC REGRESSION MODEL 10.3 EXAMPLE: ACCEPTANCE OF PERSONAL LOAN 10.4 EVALUATING CLASSIFICATION PERFORMANCE 10.5 VARIABLE SELECTION 10.6 LOGISTIC REGRESSION FOR MULTI‐CLASS CLASSIFICATION 10.7 EXAMPLE OF COMPLETE ANALYSIS: PREDICTING DELAYED FLIGHTS PROBLEMS NOTES CHAPTER 11: Neural Nets 11.1 INTRODUCTION 11.2 CONCEPT AND STRUCTURE OF A NEURAL NETWORK 11.3 FITTING A NETWORK TO DATA 11.4 REQUIRED USER INPUT 11.5 EXPLORING THE RELATIONSHIP BETWEEN PREDICTORS AND OUTCOME 11.6 DEEP LEARNING 11.7 ADVANTAGES AND WEAKNESSES OF NEURAL NETWORKS PROBLEMS NOTES CHAPTER 12: Discriminant Analysis 12.1 INTRODUCTION 12.2 DISTANCE OF A RECORD FROM A CLASS 12.3 FISHER'S LINEAR CLASSIFICATION FUNCTIONS 12.4 CLASSIFICATION PERFORMANCE OF DISCRIMINANT ANALYSIS 12.5 PRIOR PROBABILITIES 12.6 UNEQUAL MISCLASSIFICATION COSTS 12.7 CLASSIFYING MORE THAN TWO CLASSES 12.8 ADVANTAGES AND WEAKNESSES PROBLEMS NOTES CHAPTER 13: Generating, Comparing, and Combining Multiple Models 13.1 ENSEMBLES 13.2 AUTOMATED MACHINE LEARNING (AUTOML) 13.3 EXPLAINING MODEL PREDICTIONS 13.4 SUMMARY PROBLEMS NOTES PART V: Intervention and User Feedback CHAPTER 14: Interventions: Experiments, Uplift Models, and Reinforcement Learning 14.1 A/B TESTING 14.2 UPLIFT (PERSUASION) MODELING 14.3 REINFORCEMENT LEARNING 14.4 SUMMARY PROBLEMS NOTES PART VI: Mining Relationships Among Records CHAPTER 15: Association Rules and Collaborative Filtering 15.1 ASSOCIATION RULES 15.2 COLLABORATIVE FILTERING 15.3 SUMMARY PROBLEMS NOTES CHAPTER 16: Cluster Analysis 16.1 INTRODUCTION 16.2 MEASURING DISTANCE BETWEEN TWO RECORDS 16.3 MEASURING DISTANCE BETWEEN TWO CLUSTERS 16.4 HIERARCHICAL (AGGLOMERATIVE) CLUSTERING 16.5 NON‐HIERARCHICAL CLUSTERING: THE ‐MEANS ALGORITHM PROBLEMS PART VII: Forecasting Time Series CHAPTER 17: Handling Time Series 17.1 INTRODUCTION 17.2 DESCRIPTIVE VS. PREDICTIVE MODELING 17.3 POPULAR FORECASTING METHODS IN BUSINESS 17.4 TIME SERIES COMPONENTS 17.5 DATA PARTITIONING AND PERFORMANCE EVALUATION PROBLEMS NOTES CHAPTER 18: Regression‐Based Forecasting 18.1 A MODEL WITH TREND 18.2 A MODEL WITH SEASONALITY 18.3 A MODEL WITH TREND AND SEASONALITY 18.4 AUTOCORRELATION AND ARIMA MODELS PROBLEMS NOTES CHAPTER 19: Smoothing and Deep Learning Methods for Forecasting 19.1 SMOOTHING METHODS: INTRODUCTION 19.2 MOVING AVERAGE 19.3 SIMPLE EXPONENTIAL SMOOTHING 19.4 ADVANCED EXPONENTIAL SMOOTHING 19.5 DEEP LEARNING FOR FORECASTING PROBLEMS NOTES PART VIII: Data Analytics CHAPTER 20: Social Network Analytics 20.1 INTRODUCTION 20.2 DIRECTED VS. UNDIRECTED NETWORKS 20.3 VISUALIZING AND ANALYZING NETWORKS 20.4 SOCIAL DATA METRICS AND TAXONOMY 20.5 USING NETWORK METRICS IN PREDICTION AND CLASSIFICATION 20.6 COLLECTING SOCIAL NETWORK DATA WITH R 20.7 ADVANTAGES AND DISADVANTAGES PROBLEMS NOTES CHAPTER 21: Text Mining 21.1 INTRODUCTION 21.2 THE TABULAR REPRESENTATION OF TEXT: TERM–DOCUMENT MATRIX AND “BAG‐OF‐ WORDS” 21.3 BAG‐OF‐WORDS VS. MEANING EXTRACTION AT DOCUMENT LEVEL 21.4 PREPROCESSING THE TEXT 21.5 IMPLEMENTING MACHINE LEARNING METHODS 21.6 EXAMPLE: ONLINE DISCUSSIONS ON AUTOS AND ELECTRONICS 21.7 EXAMPLE: SENTIMENT ANALYSIS OF MOVIE REVIEWS 21.8 SUMMARY PROBLEMS NOTES CHAPTER 22: Responsible Data Science 22.1 INTRODUCTION 22.2 UNINTENTIONAL HARM 22.3 LEGAL CONSIDERATIONS 22.4 PRINCIPLES OF RESPONSIBLE DATA SCIENCE 22.5 A RESPONSIBLE DATA SCIENCE FRAMEWORK 22.6 DOCUMENTATION TOOLS 22.7 EXAMPLE: APPLYING THE RDS FRAMEWORK TO THE COMPAS EXAMPLE 22.8 SUMMARY PROBLEMS NOTES PART IX: Cases CHAPTER 23: Cases 23.1 CHARLES BOOK CLUB 23.2 GERMAN CREDIT 23.3 TAYKO SOFTWARE CATALOGER 23.4 POLITICAL PERSUASION 23.5 TAXI CANCELLATIONS 23.6 SEGMENTING CONSUMERS OF BATH SOAP 23.7 DIRECT‐MAIL FUNDRAISING 23.8 CATALOG CROSS‐SELLING 23.9 TIME SERIES CASE: FORECASTING PUBLIC TRANSPORTATION DEMAND 23.10 LOAN APPROVAL NOTES References R Packages Used in the Book Data Files Used in the Book Index End User License Agreement List of Tables Chapter 1 TABLE 1.1 ORGANIZATION OF MACHINE LEARNING METHODS IN THIS BOOK, ACCORDING... Chapter 2 TABLE 2.1 FIRST 10 RECORDS IN THE WEST ROXBURY HOME VALUES DATASET TABLE 2.2 DESCRIPTION OF VARIABLES IN WEST ROXBURY (BOSTON) HOME VALUE DATA... TABLE 2.3 WORKING WITH FILES IN R TABLE 2.4 SAMPLING IN R TABLE 2.5 REVIEWING VARIABLES IN R TABLE 2.6 CREATING DUMMY VARIABLES IN R TABLE 2.7 IMPUTING MISSING DATA TABLE 2.8 HYPOTHETICAL DATA ON ADVERTISING EXPENDITURES AND SUBSEQUENT SALE... TABLE 2.9 DATA PARTITIONING IN R TABLE 2.10 OUTLIER IN WEST ROXBURY DATA TABLE 2.11 CLEANING AND PREPROCESSING DATA TABLE 2.12 TRAINING A REGRESSION MODEL AND GENERATING PREDICTIONS (FITTED V... TABLE 2.13 GENERATING PREDICTIONS FOR THE HOLDOUT DATA TABLE 2.14 PREDICTION ERROR METRICS FOR TRAINING AND HOLDOUT DATA (ERROR FI... TABLE 2.15 DATA FRAME WITH THREE RECORDS TO BE SCORED TABLE 2.16 SAMPLE FROM A DATABASE OF CREDIT APPLICATIONS TABLE 2.17 SAMPLE FROM A BANK DATABASE TABLE 2.18 Chapter 3 TABLE 3.1 DESCRIPTION OF VARIABLES IN BOSTON HOUSING DATASET TABLE 3.2 FIRST NINE RECORDS IN THE BOSTON HOUSING DATA Chapter 4 TABLE 4.1 DESCRIPTION OF VARIABLES IN THE BOSTON HOUSING DATASET TABLE 4.2 FIRST NINE RECORDS IN THE BOSTON HOUSING DATA TABLE 4.3 SUMMARY STATISTICS FOR THE BOSTON HOUSING DATA TABLE 4.4 CORRELATION TABLE FOR BOSTON HOUSING DATA TABLE 4.5 NUMBER OF NEIGHBORHOODS THAT BOUND THE CHARLES RIVER VS. THOSE TH... TABLE 4.6 AVERAGE MEDV BY CHAS AND RM TABLE 4.7 PIVOT TABLES IN R TABLE 4.8 DESCRIPTION OF THE VARIABLES IN THE BREAKFAST CEREAL DATASET TABLE 4.9 CEREAL CALORIES AND RATINGS TABLE 4.10 PCA ON THE TWO VARIABLES CALORIES AND RATING TABLE 4.11 PCA OUTPUT USING ALL 13 NUMERICAL VARIABLES IN THE BREAKFAST CER... TABLE 4.12 PCA OUTPUT USING ALL NORMALIZED 13 NUMERICAL VARIABLES IN THE BR... TABLE 4.13 PRINCIPAL COMPONENTS OF NON‐ NORMALIZED WINE DATA Chapter 5 TABLE 5.1 PREDICTION ERROR METRICS FROM A MODEL FOR TOYOTA CAR PRICES. TRAI... TABLE 5.2 CONFUSION MATRIX BASED ON 3000 RECORDS AND TWO CLASSES TABLE 5.3 CONFUSION MATRIX: MEANING OF EACH CELL TABLE 5.4 24 RECORDS WITH THEIR ACTUAL CLASS AND THE PROBABILITY (PROPENSIT... TABLE 5.5 CONFUSION MATRICES BASED ON THRESHOLDS OF 0.5, 0.25, AND 0.75 (RI... TABLE 5.6 RECORDS SORTED BY PROPENSITY OF OWNERSHIP (HIGH TO LOW) FOR THE M... TABLE 5.7 PROPENSITIES AND ACTUAL CLASS MEMBERSHIP FOR HOLDOUT DATA Chapter 6 TABLE 6.1 VARIABLES IN THE TOYOTA COROLLA EXAMPLE TABLE 6.2 PRICES AND ATTRIBUTES FOR USED TOYOTA COROLLA CARS (SELECTED ROWS... TABLE 6.3 LINEAR REGRESSION MODEL OF PRICE VS. CAR ATTRIBUTES TABLE 6.4 PREDICTED PRICES (AND ERRORS) FOR 20 CARS IN HOLDOUT SET AND SUMM... TABLE 6.5 CROSS‐VALIDATION IN TOYOTA COROLLA EXAMPLE TABLE 6.6 EXHAUSTIVE SEARCH FOR REDUCING PREDICTORS IN TOYOTA COROLLA EXAMP... TABLE 6.7 SUBSET SELECTION ALGORITHMS REDUCING PREDICTORS IN TOYOTA COROLLA... TABLE 6.8 RIDGE REGRESSION FOR SHRINKING PREDICTORS IN TOYOTA COROLLA EXAMP... TABLE 6.9 LASSO REGRESSION FOR REDUCING PREDICTORS IN TOYOTA COROLLA EXAMPL... TABLE 6.10 COMPARING PERFORMANCE OF THE DIFFERENT VARIABLE SELECTION METHOD... TABLE 6.11 DESCRIPTION OF VARIABLES FOR BOSTON HOUSING EXAMPLE TABLE 6.12 DESCRIPTION OF VARIABLES FOR TAYKO SOFTWARE EXAMPLE TABLE 6.13 DESCRIPTION OF VARIABLES FOR AIRFARE EXAMPLE Chapter 7 TABLE 7.1 LOT SIZE, INCOME, AND OWNERSHIP OF A RIDING MOWER FOR 24 HOUSEHOL... TABLE 7.2 RUNNING K ‐NN TABLE 7.3 ACCURACY (OR CORRECT RATE) OF K‐NN PREDICTIONS IN VALIDATION SET... TABLE 7.4 CLASSIFYING A NEW HOUSEHOLD USING THE “BEST K ” = 7 Chapter 8 TABLE 8.1 PIVOT TABLE FOR FINANCIAL REPORTING EXAMPLE TABLE 8.2 INFORMATION ON 10 COMPANIES TABLE 8.3 DESCRIPTION OF VARIABLES FOR FLIGHT DELAYS EXAMPLE TABLE 8.4 NAIVE BAYES CLASSIFIER APPLIED TO FLIGHT DELAYS (TRAINING) DATA TABLE 8.5 PIVOT TABLE OF FLIGHT STATUS BY DESTINATION AIRPORT (TRAINING DAT... TABLE 8.6 SCORING THE EXAMPLE FLIGHT (PROBABILITY AND CLASS) TABLE 8.7 CONFUSION MATRICES FOR FLIGHT DELAY USING A NAIVE BAYES CLASSIFIE... TABLE 8.8 MEANS AND STANDARD DEVIATIONS OF DISTANCE (A CONTINUOUS PREDICTOR... Chapter 9 TABLE 9.1 LOT SIZE, INCOME, AND OWNERSHIP OF A RIDING MOWER FOR 24 HOUSEHOL... TABLE 9.2 SAMPLE OF DATA FOR 20 CUSTOMERS OF UNIVERSAL BANK TABLE 9.3 CONFUSION MATRICES AND ACCURACY FOR THE DEFAULT (SMALL) AND DEEPE... TABLE 9.4 TABLE OF COMPLEXITY PARAMETER () VALUES AND ASSOCIATED TREE ERROR... TABLE 9.5 EXTRACTING THE RULES FROM THE BEST‐ PRUNED TREE TABLE 9.6 SPECIFICATIONS FOR A PARTICULAR TOYOTA COROLLA Chapter 10 TABLE 10.1 DESCRIPTION OF PREDICTORS FOR ACCEPTANCE OF PERSONAL LOAN EXAMPL... TABLE 10.2 LOGISTIC REGRESSION MODEL FOR LOAN ACCEPTANCE (TRAINING DATA) TABLE 10.3 PROPENSITIES FOR THE FIRST FIVE CUSTOMERS IN HOLDOUT DATA TABLE 10.4 ORDINAL AND NOMINAL MULTINOMIAL REGRESSION IN R TABLE 10.5 DESCRIPTION OF PREDICTORS FOR FLIGHT DELAYS EXAMPLE TABLE 10.6 SAMPLE OF 20 FLIGHTS TABLE 10.7 ESTIMATED LOGISTIC REGRESSION MODEL FOR DELAYED FLIGHTS (BASED O... TABLE 10.8 NUMBER OF FLIGHTS BY CARRIER AND ORIGIN TABLE 10.9 LOGISTIC REGRESSION MODEL WITH FEWER PREDICTORS TABLE 10.10 REGULARIZED LOGISTIC REGRESSION MODEL FOR DELAYED FLIGHTS Chapter 11 TABLE 11.1 TINY EXAMPLE ON TASTING SCORES FOR SIX CONSUMERS AND TWO PREDICT... TABLE 11.2 NEURAL NETWORK WITH A SINGLE HIDDEN LAYER (THREE NODES) FOR THE... TABLE 11.3 CONFUSION MATRIX FOR THE TINY EXAMPLE TABLE 11.4 SUBSET FROM THE ACCIDENTS DATA, FOR A HIGH‐FATALITY REGION TABLE 11.5 DESCRIPTION OF VARIABLES FOR AUTOMOBILE ACCIDENT EXAMPLE TABLE 11.6 A NEURAL NETWORK WITH TWO NODES IN THE HIDDEN LAYER (ACCIDENTS D... TABLE 11.7 DATA PREPROCESSING OF FASHION MNIST FOR DEEP LEARNING TABLE 11.8 MODEL DEFINITION AND TRAINING OF FASHION MNIST CONVOLUTIONAL NET... TABLE 11.9 PREDICTING IMAGES TABLE 11.10 DATA FOR CREDIT CARD EXAMPLE AND VARIABLE DESCRIPTIONS Chapter 12 TABLE 12.1 DISCRIMINANT ANALYSIS FOR RIDING‐ MOWER DATA, DISPLAYING THE ESTI... TABLE 12.2 CLASSIFICATION SCORES, PREDICTED CLASSES, AND PROBABILITIES FOR... TABLE 12.3 DISCRIMINANT ANALYSIS FOR RIDING‐ MOWER DATA WITH MODIFIED PRIOR... TABLE 12.4 SAMPLE OF 20 AUTOMOBILE ACCIDENTS FROM THE 2001 DEPARTMENT OF TR... TABLE 12.5 DISCRIMINANT ANALYSIS FOR THE THREE‐CLASS INJURY EXAMPLE: CLASSI... TABLE 12.6 CLASSIFICATION SCORES, MEMBERSHIP PROBABILITIES, AND CLASSIFICAT... TABLE 12.7 CLASSIFICATION AND MEMBERSHIP PROBABILITIES Chapter 13 TABLE 13.1 EXAMPLE OF BAGGING AND BOOSTING CLASSIFICATION TREES ON THE PERS... TABLE 13.2 DATA PREPROCESSING AND MAKING AVAILABLE IN H2O.AI TABLE 13.3 TRAINING AND COMPARING AUTOML MODELS TABLE 13.4 OVERVIEW OF AUTO MODEL RESULTS: COMPARING ACCURACY AND RUN TIME... Chapter 14 TABLE 14.1 EXAMPLE OF RAW DATA RESULTING FROM A/B TEST TABLE 14.2 RESULTS FROM A/B TEST AT PHOTOTO, EVALUATING THE EFFECT OF A NEW... TABLE 14.3 COMPUTING P‐VALUES FOR T‐TESTS IN R TABLE 14.4 DATA ON VOTERS (SMALL SUBSET OF VARIABLES AND RECORDS) AND DATA... TABLE 14.5 RESULTS OF SENDING A PRO‐ DEMOCRATIC MESSAGE TO VOTERS TABLE 14.6 OUTCOME VARIABLE (MOVED_AD) AND TREATMENT VARIABLE (MESSAGE) ADD... TABLE 14.7 CLASSIFICATIONS AND PROPENSITIES FROM PREDICTIVE MODEL (SMALL EX... TABLE 14.8 CLASSIFICATIONS AND PROPENSITIES FROM PREDICTIVE MODEL (SMALL EX... TABLE 14.9 UPLIFT: CHANGE IN PROPENSITIES FROM SENDING MESSAGE VS. NOT SEND... TABLE 14.10 UPLIFT IN R APPLIED TO THE VOTERS DATA TABLE 14.11 USING A CONTEXTUAL MULTI‐ARMED BANDIT (DATA PREPARATION)... Chapter 15 TABLE 15.1 TRANSACTIONS DATABASE FOR PURCHASES OF DIFFERENT‐COLORED CELLULA... TABLE 15.2 PHONE FACEPLATE DATA IN BINARY INCIDENCE MATRIX FORMAT TABLE 15.3 ITEMSETS WITH SUPPORT COUNT OF AT LEAST TWO TABLE 15.4 BINARY INCIDENCE MATRIX, TRANSACTIONS DATABASE, AND RULES FOR FA... TABLE 15.5 FIFTY TRANSACTIONS OF RANDOMLY ASSIGNED ITEMS TABLE 15.6 ASSOCIATION RULES OUTPUT FOR RANDOM DATA TABLE 15.7 SUBSET OF BOOK PURCHASE TRANSACTIONS IN BINARY MATRIX FORMAT TABLE 15.8 RULES FOR BOOK PURCHASE TRANSACTIONS TABLE 15.9 SCHEMATIC OF MATRIX FORMAT WITH RATING DATA TABLE 15.10 SAMPLE OF RECORDS FROM THE NETFLIX PRIZE CONTEST, FOR A SUBSET... TABLE 15.11 DESCRIPTION OF VARIABLES IN MOVIELENS DATASET TABLE 15.12 DATA PREPROCESSING FOR THE MOVIELENS DATA TABLE 15.13 TRAIN RECOMMENDERLAB MODELS AND PREDICT NEW USERS TABLE 15.14 EVALUATE PERFORMANCE OF RECOMMENDERLAB MODELS TABLE 15.15 SAMPLE OF DATA ON SATELLITE RADIO CUSTOMERS TABLE 15.16 DATA ON PURCHASES OF ONLINE STATISTICS COURSES TABLE 15.17 EXCERPT FROM DATA ON COSMETICS PURCHASES IN BINARY MATRIX FORM... TABLE 15.18 ASSOCIATION RULES FOR COSMETICS PURCHASES DATA TABLE 15.19 RATINGS OF ONLINE STATISTICS COURSES: 4 = BEST, 1 = WORST, BLAN... Chapter 16 TABLE 16.1 DATA ON 22 PUBLIC UTILITIES TABLE 16.2 DISTANCE MATRIX BETWEEN PAIRS OF UTILITIES, USING EUCLIDEAN DIST... TABLE 16.3 ORIGINAL AND NORMALIZED MEASUREMENTS FOR SALES AND FUEL COST TABLE 16.4 DISTANCE MATRIX BETWEEN PAIRS OF UTILITIES, USING EUCLIDEAN DIST... TABLE 16.5 DISTANCE MATRIX AFTER ARIZONA AND COMMONWEALTH CONSOLIDATION CLU... TABLE 16.6 COMPUTING CLUSTER MEMBERSHIP BY “CUTTING” THE DENDROGRAM... TABLE 16.7 DISTANCE OF EACH RECORD FROM EACH CENTROID TABLE 16.8 DISTANCE OF EACH RECORD FROM EACH NEWLY CALCULATED CENTROID TABLE 16.9 k‐MEANS CLUSTERING OF 22 UTILITIES INTO CLUSTERS (SORTED BY CLU... TABLE 16.10 CLUSTER CENTROIDS AND SQUARED DISTANCES FOR k ‐MEANS WITH TABLE 16.11 EUCLIDEAN DISTANCE BETWEEN CLUSTER CENTROIDS Chapter 17 TABLE 17.1 PREDICTIVE ACCURACY OF NAIVE AND SEASONAL NAIVE FORECASTS IN THE... Chapter 18 TABLE 18.1 OUTCOME VARIABLE (MIDDLE) AND PREDICTOR VARIABLE (RIGHT) USED TO... TABLE 18.2 SUMMARY OF OUTPUT FROM A LINEAR REGRESSION MODEL APPLIED TO THE... TABLE 18.3 NEW CATEGORICAL VARIABLE (RIGHT) TO BE USED (VIA DUMMIES) AS PRE... TABLE 18.4 SUMMARY OF OUTPUT FROM FITTING ADDITIVE SEASONALITY TO THE AMTRA... TABLE 18.5 SUMMARY OF OUTPUT FROM FITTING TREND AND SEASONALITY TO AMTRAK R... TABLE 18.6 FIRST 24 MONTHS OF AMTRAK RIDERSHIP SERIES WITH LAG‐1 AND LAG‐2... TABLE 18.7 OUTPUT FOR AR(1) MODEL ON RIDERSHIP RESIDUALS TABLE 18.8 COMPUTING AUTOCORRELATIONS OF LAG‐1 DIFFERENCED S&P500 MONTHLY C... TABLE 18.9 OUTPUT FOR AR(1) MODEL ON S&P500 MONTHLY CLOSING PRICES TABLE 18.10 REGRESSION MODEL FITTED TO TOYS ”R” US TIME SERIES AND ITS PRED... TABLE 18.11 OUTPUT FROM REGRESSION MODEL FIT TO DEPARTMENT STORE SALES IN T... Chapter 19 TABLE 19.1 APPLYING MA TO THE RESIDUALS FROM THE REGRESSION MODEL (WHICH LA... TABLE 19.2 SUMMARY OF HOLT–WINTERS EXPONENTIAL SMOOTHING APPLIED TO THE AMT... TABLE 19.3 CONVERTING A TIME SERIES OF LENGTH t INTO A SEQUENCE OF SUB‐SERI... TABLE 19.4 PREPARATION OF TRAINING DATA FOR FORECASTING A SERIES WITH LSTM... TABLE 19.5 DEFINING AN LSTM MODEL FOR FORECASTING A SERIES TABLE 19.3 FORECASTS FOR TEST SERIES USING EXPONENTIAL SMOOTHING Chapter 20 TABLE 20.1 EDGE LIST EXCERPT CORRESPONDING TO THE DRUG‐LAUNDERING NETWORK I... TABLE 20.2 ADJACENCY MATRIX EXCERPT CORRESPONDING TO THE TWITTER DATA IN FI... TABLE 20.3 COMPUTING CENTRALITY IN R TABLE 20.4 DEGREE DISTRIBUTION OF TINY LINKEDIN NETWORK TABLE 20.5 COMPUTING NETWORK MEASURES IN R TABLE 20.6 FOUR MEASUREMENTS FOR USERS A, B, C, AND D TABLE 20.7 NORMALIZED MEASUREMENTS FOR USERS A, B, C, AND D TABLE 20.8 EUCLIDEAN DISTANCE BETWEEN EACH PAIR OF USERS TABLE 20.9 NETWORK METRICS TABLE 20.10 COMBINING THE NETWORK AND NON‐ NETWORK METRICS TABLE 20.11 TWITTER INTERFACE IN R Chapter 21 TABLE 21.1 TERM–DOCUMENT MATRIX REPRESENTATION OF WORDS IN SENTENCES S1–S3... TABLE 21.2 TERM–DOCUMENT MATRIX REPRESENTATION OF WORDS IN SENTENCES S1–S4... TABLE 21.3 TOKENIZATION OF S1–S4 EXAMPLE TABLE 21.4 STOPWORDS IN R TABLE 21.5 TEXT REDUCTION OF S1–S4 (AFTER TOKENIZATION) TABLE 21.6 TF‐IDF MATRIX FOR S1–S4 EXAMPLE (AFTER TOKENIZATION AND TEXT RED... TABLE 21.7 IMPORTING AND LABELING THE RECORDS, PREPROCESSING TEXT, AND PROD... TABLE 21.8 FITTING A PREDICTIVE MODEL TO THE AUTOS AND ELECTRONICS DISCUSSI... TABLE 21.9 PREPARE DATA FOR SENTIMENT ANALYSIS OF MOVIE REVIEWS DATA TABLE 21.10 CREATE WORD AND SENTENCE VECTORS USING GLOVE TABLE 21.11 TRAIN SENTIMENT ANALYSIS MODEL Chapter 22 TABLE 22.1 LOGISTIC REGRESSION MODEL FOR COMPAS DATA TABLE 22.2 LOGISTIC REGRESSION AND RANDOM FOREST MODELS FOR COMPAS DATA TABLE 22.3 OVERALL AND BY‐RACE PERFORMANCE OF LOGISTIC REGRESSION AND RANDO... TABLE 22.4 TOP 6 MOST IMPORTANT FEATURES USING PERMUTATION FEATURE IMPORTAN... TABLE 22.5 JIGSAW TOXICITY SCORES FOR CERTAIN PHRASES Chapter 23 TABLE 23.1 LIST OF VARIABLES IN CHARLES BOOK CLUB DATASET TABLE 23.2 FIRST FOUR RECORDS FROM GERMAN CREDIT DATASET TABLE 23.3 OPPORTUNITY COST TABLE (DEUTSCHE MARKS) TABLE 23.4 AVERAGE NET PROFIT (DEUTSCHE MARKS) TABLE 23.5 VARIABLES FOR THE GERMAN CREDIT DATASET TABLE 23.6 FIRST 10 RECORDS FROM TAYKO DATASET TABLE 23.7 DESCRIPTION OF VARIABLES FOR TAYKO DATASET TABLE 23.8 DESCRIPTION OF VARIABLES FOR EACH HOUSEHOLD TABLE 23.9 DESCRIPTION OF VARIABLES FOR THE FUNDRAISING DATASET List of Illustrations Chapter 1 FIGURE 1.1 TWO METHODS FOR SEPARATING OWNERS FROM NONOWNERS FIGURE 1.2 MACHINE LEARNING FROM A PROCESS PERSPECTIVE. NUMBERS IN PARENTHES... FIGURE 1.3 RSTUDIO SCREEN Chapter 2 FIGURE 2.1 SCHEMATIC OF THE DATA MODELING PROCESS FIGURE 2.2 SCATTER PLOT FOR ADVERTISING AND SALES DATA FIGURE 2.3 OVERFITTING: THIS FUNCTION FITS THE DATA WITH NO ERROR FIGURE 2.4 THREE DATA PARTITIONS AND THEIR ROLE IN THE MACHINE LEARNING PROC... FIGURE 2.5 LAYERS OF TOOLS SUPPORTING MACHINE LEARNING AUTOMATION Chapter 3 FIGURE 3.1 BASIC PLOTS: LINE GRAPH (TOP LEFT), SCATTER PLOT (TOP RIGHT), BAR... FIGURE 3.2 DISTRIBUTION PLOTS FOR NUMERICAL VARIABLE MEDV. LEFT: HISTOGRAM A... FIGURE 3.3 SIDE‐BY‐SIDE BOXPLOTS FOR EXPLORING THE CAT.MEDV OUTPUT VARIABLE... FIGURE 3.4 HEATMAP OF A CORRELATION TABLE. DARK RED VALUES DENOTE STRONG POS... FIGURE 3.5 HEATMAP OF MISSING VALUES IN A DATASET ON MOTOR VEHICLE COLLISION... FIGURE 3.6 ADDING CATEGORICAL VARIABLES BY COLOR‐CODING AND MULTIPLE PANELS.... FIGURE 3.7 SCATTER PLOT MATRIX FOR MEDV AND THREE NUMERICAL PREDICTORS FIGURE 3.8 RESCALING CAN ENHANCE PLOTS AND REVEAL PATTERNS. LEFT: ORIGINAL S... FIGURE 3.9 TIME SERIES LINE CHARTS USING DIFFERENT AGGREGATIONS (RIGHT PANEL... FIGURE 3.10 SCATTER PLOT WITH LABELED POINTS FIGURE 3.11 SCATTER PLOT OF LARGE DATASET WITH REDUCED MARKER SIZE, JITTERIN... FIGURE 3.12 PARALLEL COORDINATES PLOT FOR BOSTON HOUSING DATA. EACH OF THE V... FIGURE 3.13 MULTIPLE INTER‐LINKED PLOTS IN A SINGLE VIEW. NOTE THE MARKED OB... FIGURE 3.14 NETWORK PLOT OF EBAY SELLERS (DARK BLUE CIRCLES) AND BUYERS (LIG... FIGURE 3.15 TREEMAP SHOWING NEARLY 11,000 EBAY AUCTIONS, ORGANIZED BY ITEM C... FIGURE 3.16 MAP CHART OF STATISTICS.COM STUDENTS' AND INSTRUCTORS' LOCATIONS... FIGURE 3.17 WORLD MAPS COMPARING “WELL‐ BEING” (TOP) TO GDP (BOTTOM). SHADING... Chapter 4 FIGURE 4.1 DISTRIBUTION OF CAT.MEDV (BLACK DENOTES CAT.MEDV = 0) BY ZN. SIMI... FIGURE 4.2 QUARTERLY REVENUES OF TOYS “R” US, 1992–1995 FIGURE 4.3 SAMPLE FROM THE 77 BREAKFAST CEREALS DATASET FIGURE 4.4 SCATTER PLOT OF RATING VS. CALORIES FOR 77 BREAKFAST CEREALS, WIT... FIGURE 4.5 SCATTER PLOT OF THE SECOND VS. FIRST PRINCIPAL COMPONENTS SCORES... Chapter 5 FIGURE 5.1 HISTOGRAMS AND BOXPLOTS OF TOYOTA PRICE PREDICTION ERRORS, FOR TR... FIGURE 5.2 CUMULATIVE GAINS CHART (LEFT) AND DECILE LIFT CHART (RIGHT) FOR C... FIGURE 5.3 HIGH (TOP) AND LOW (BOTTOM) LEVELS OF SEPARATION BETWEEN TWO CLAS... FIGURE 5.4 PLOTTING ACCURACY AND OVERALL ERROR AS A FUNCTION OF THE THRESHOL... FIGURE 5.5 ROC CURVE FOR RIDING‐MOWERS EXAMPLE FIGURE 5.6 PRECISION, RECALL, AND F1‐SCORE AS A FUNCTION OF THE THRESHOLD VA... FIGURE 5.7 CUMULATIVE GAINS CHARTS FOR THE MOWER EXAMPLE USING ROCR, CARET,... FIGURE 5.8 DECILE‐WISE LIFT CHART FIGURE 5.9 CUMULATIVE GAINS CURVE INCORPORATING COSTS FIGURE 5.10 CLASSIFICATION ASSUMING EQUAL COSTS OF MISCLASSIFICATION FIGURE 5.11 CLASSIFICATION ASSUMING UNEQUAL COSTS OF MISCLASSIFICATION FIGURE 5.12 CLASSIFICATION USING OVERSAMPLING TO ACCOUNT FOR UNEQUAL COSTS FIGURE 5.13 DECILE‐WISE LIFT CHART FOR TRANSACTION DATA FIGURE 5.14 CUMULATIVE GAINS AND DECILE‐WISE LIFT CHARTS FOR SOFTWARE SERVIC... Chapter 6 FIGURE 6.1 HISTOGRAM OF MODEL ERRORS (BASED ON HOLDOUT SET) FIGURE 6.2 CROSS‐VALIDATED RMSE OF TOYOTA PRICE PREDICTION MODELS AS A FUNCT... Chapter 7 FIGURE 7.1 SCATTER PLOT OF LOT SIZE VS. INCOME FOR THE 14 HOUSEHOLDS IN THE... Chapter 8 FIGURE 8.1 LIFT CHART OF NAIVE BAYES CLASSIFIER APPLIED TO FLIGHT DELAYS DAT... Chapter 9 FIGURE 9.1 EXAMPLE OF A TREE FOR CLASSIFYING BANK CUSTOMERS AS LOAN ACCEPTOR... FIGURE 9.2 SCATTER PLOT OF LOT SIZE VS. INCOME FOR 24 OWNERS AND NONOWNERS O... FIGURE 9.3 SPLITTING THE 24 RECORDS BY INCOME VALUE OF 60 FIGURE 9.4 VALUES OF THE GINI INDEX AND ENTROPY MEASURE FOR A TWO‐CLASS CASE... FIGURE 9.5 SPLITTING THE 24 RECORDS FIRST BY INCOME VALUE OF 60 AND THEN BY... FIGURE 9.6 FINAL STAGE OF RECURSIVE PARTITIONING; EACH RECTANGLE CONSISTING... FIGURE 9.7 TREE REPRESENTATION OF FIRST SPLIT (CORRESPONDS TO FIGURE 9.3) FIGURE 9.8 TREE REPRESENTATION AFTER ALL SPLITS (CORRESPONDS TO FIGURE 9.6).... FIGURE 9.9 DEFAULT CLASSIFICATION TREE FOR THE LOAN ACCEPTANCE DATA USING TH... FIGURE 9.10 A FULL TREE FOR THE LOAN ACCEPTANCE DATA USING THE TRAINING SET... FIGURE 9.11 ERROR RATE AS A FUNCTION OF THE NUMBER OF SPLITS FOR TRAINING VS... FIGURE 9.12 PRUNED CLASSIFICATION TREE FOR THE LOAN ACCEPTANCE DATA USING CP... FIGURE 9.13 BEST‐PRUNED TREE OBTAINED BY FITTING A FULL TREE TO THE TRAINING... FIGURE 9.14 BEST‐PRUNED REGRESSION TREE FOR TOYOTA COROLLA PRICES FIGURE 9.15 SCATTER PLOT DESCRIBING A TWO‐ PREDICTOR CASE WITH TWO CLASSES. T... FIGURE 9.16 VARIABLE IMPORTANCE PLOT FROM RANDOM FOREST (PERSONAL LOAN EXAMP... FIGURE 9.17 ROC CURVES COMPARING PERFORMANCE OF BOOSTED TREE, RANDOM FOREST,... FIGURE 9.18 ROC CURVES OF BOOSTED TREE AND FOCUSED BOOSTED TREE (RIGHT PLOT... Chapter 10 FIGURE 10.1 (A) ODDS AND (B) LOGIT AS A FUNCTION OF P FIGURE 10.2 PLOT OF DATA POINTS (PERSONAL LOAN AS A FUNCTION OF INCOME) AND... FIGURE 10.3 (LEFT) CUMULATIVE GAINS CHART AND (RIGHT) DECILE‐WISE LIFT CHART... FIGURE 10.4 PROPORTION OF DELAYED FLIGHTS BY EACH OF THE SIX PREDICTORS. TIM... FIGURE 10.5 PERCENT OF DELAYED FLIGHTS (DARKER = HIGHER %DELAYS) BY DAY OF W... FIGURE 10.6 CONFUSION MATRIX AND CUMULATIVE GAINS CHART FOR THE FLIGHT DELAY... FIGURE 10.7 GAINS CHART FOR THE LOGISTIC REGRESSION MODEL WITH FEWER PREDICT... Chapter 11 FIGURE 11.1 MULTILAYER FEEDFORWARD NEURAL NETWORK FIGURE 11.2 NEURAL NETWORK FOR THE TINY EXAMPLE. CIRCLES REPRESENT NODES (“N... FIGURE 11.3 COMPUTING NODE OUTPUTS (ON THE RIGHT WITHIN EACH NODE) USING THE... FIGURE 11.4 NEURAL NETWORK FOR THE TINY EXAMPLE WITH FINAL WEIGHTS FROM R OU... FIGURE 11.5 LINE DRAWING, FROM A 1893 FUNK AND WAGNALLS PUBLICATION FIGURE 11.6 FOCUSING ON THE LINE OF THE MAN'S CHIN FIGURE 11.7 PIXEL REPRESENTATION OF LINE ON MAN'S CHIN USING SHADING (A) A... FIGURE 11.8 CONVOLUTION NETWORK PROCESS, SUPERVISED LEARNING: THE REPEATED F... FIGURE 11.9 AUTOENCODER NETWORK PROCESS: THE REPEATED FILTERING IN THE NETWO... FIGURE 11.10 FASHION MNIST DATA: SAMPLE OF 10 IMAGES FROM EACH CLASS FIGURE 11.11 LEARNING CURVE FOR THE DEEP LEARNING NETWORK: TRAINING SET (BLU... Chapter 12 FIGURE 12.1 SCATTER PLOT OF LOT SIZE VS. INCOME FOR 24 OWNERS AND NONOWNERS... FIGURE 12.2 PERSONAL LOAN ACCEPTANCE AS A FUNCTION OF INCOME AND CREDIT CARD... FIGURE 12.3 CLASS SEPARATION OBTAINED FROM THE DISCRIMINANT MODEL COMPARED T... Chapter 13 FIGURE 13.1 EXAMPLE OF BAGGING AND BOOSTING CLASSIFICATION TREES ON THE PERS... FIGURE 13.2 EXPLAINING DECISIONS USING LIME. TOP: TWO CASES WHERE THE MODEL... Chapter 14 FIGURE 14.1 REINFORCEMENT LEARNING USING A CONTEXTUAL MULTI‐ARM BANDIT (SIMU... FIGURE 14.2 SCHEMATIC OF REINFORCEMENT LEARNING, WHERE THE SOFTWARE AGENT “L... Chapter 15 FIGURE 15.1 RECOMMENDATIONS UNDER “FREQUENTLY BOUGHT TOGETHER” ARE BASED ON... Chapter 16 FIGURE 16.1 SCATTER PLOT OF FUEL COST VS. SALES FOR THE 22 UTILITIES FIGURE 16.2 TWO‐DIMENSIONAL REPRESENTATION OF SEVERAL DIFFERENT DISTANCE MEA... FIGURE 16.3 DENDROGRAM: SINGLE LINKAGE (TOP) AND AVERAGE LINKAGE (BOTTOM) FO... FIGURE 16.4 HEATMAP FOR THE 22 UTILITIES (IN ROWS). ROWS ARE SORTED BY THE S... FIGURE 16.5 VISUAL PRESENTATION (PROFILE PLOT) OF CLUSTER CENTROIDS FIGURE 16.6 COMPARING DIFFERENT CHOICES OF k IN TERMS OF OVERALL AVERAGE WIT... Chapter 17 FIGURE 17.1 MONTHLY RIDERSHIP ON AMTRAK TRAINS (IN THOUSANDS) FROM JANUARY 1... FIGURE 17.2 TIME PLOTS OF THE DAILY NUMBER OF VEHICLES PASSING THROUGH THE B... FIGURE 17.3 PLOTS THAT ENHANCE THE DIFFERENT COMPONENTS OF THE TIME SERIES.... FIGURE 17.4 NAIVE AND SEASONAL NAIVE FORECASTS IN A 3‐YEAR TEST SET FOR AMTR... FIGURE 17.5 AVERAGE ANNUAL WEEKLY HOURS SPENT BY CANADIAN MANUFACTURING WORK... Chapter 18 FIGURE 18.1 A LINEAR TREND FIT TO AMTRAK RIDERSHIP FIGURE 18.2 A LINEAR TREND FIT TO AMTRAK RIDERSHIP IN THE TRAINING PERIOD AN... FIGURE 18.3 EXPONENTIAL (AND LINEAR) TREND USED TO FORECAST AMTRAK RIDERSHIP FIGURE 18.4 QUADRATIC TREND MODEL USED TO FORECAST AMTRAK RIDERSHIP. PLOTS O... FIGURE 18.5 REGRESSION MODEL WITH SEASONALITY APPLIED TO THE AMTRAK RIDERSHI... FIGURE 18.6 REGRESSION MODEL WITH TREND AND SEASONALITY APPLIED TO AMTRAK RI... FIGURE 18.7 AUTOCORRELATION PLOT FOR LAGS 1–12 (FOR FIRST 24 MONTHS OF AMTRA... FIGURE 18.8 AUTOCORRELATION PLOT OF FORECAST ERRORS SERIES FROM FIGURE 18.6... FIGURE 18.9 FITTING AN AR(1) MODEL TO THE RESIDUAL SERIES FROM FIGURE 18.6 FIGURE 18.10 AUTOCORRELATIONS OF RESIDUALS‐ OF‐RESIDUALS SERIES FIGURE 18.11 S&P500 MONTHLY CLOSING PRICES SERIES FIGURE 18.12 SEASONALLY ADJUSTED PRE‐ SEPTEMBER‐11 AIR SERIES FIGURE 18.13 AVERAGE ANNUAL WEEKLY HOURS SPENT BY CANADIAN MANUFACTURING WOR... FIGURE 18.14 QUARTERLY REVENUES OF TOYS “R” US, 1992–1995... FIGURE 18.15 DAILY CLOSE PRICE OF WALMART STOCK, FEBRUARY 2001–2002 FIGURE 18.16 (TOP) AUTOCORRELATIONS OF WALMART STOCK PRICES AND (BOTTOM) AUT... FIGURE 18.17 DEPARTMENT STORE QUARTERLY SALES SERIES FIGURE 18.18 FIT OF REGRESSION MODEL FOR DEPARTMENT STORE SALES FIGURE 18.19 MONTHLY SALES AT AUSTRALIAN SOUVENIR SHOP IN DOLLARS (TOP) AND... FIGURE 18.20 QUARTERLY SHIPMENTS OF US HOUSEHOLD APPLIANCES OVER 5 YEARS FIGURE 18.21 MONTHLY SALES OF SIX TYPES OF AUSTRALIAN WINES BETWEEN 1980 AND... Chapter 19 FIGURE 19.1 SCHEMATIC OF CENTERED MOVING AVERAGE (TOP) AND TRAILING MOVING A... FIGURE 19.2 CENTERED MOVING AVERAGE (GREEN LINE) AND TRAILING MOVING AVERAGE... FIGURE 19.3 TRAILING MOVING AVERAGE FORECASTER WITH APPLIED TO AMTRAK RIDER... FIGURE 19.4 OUTPUT FOR SIMPLE EXPONENTIAL SMOOTHING FORECASTER WITH , APPLI... FIGURE 19.5 OUTPUT FOR HOLT–WINTERS EXPONENTIAL SMOOTHING APPLIED TO AMTRAK... FIGURE 19.6 SCHEMATIC OF RECURRENT NEURAL NETWORK (RNN) FIGURE 19.7 ACTUAL AND FORECASTED VALUES USING THE DEEP LEARNING LSTM MODEL... FIGURE 19.8 SEASONALLY ADJUSTED PRE‐ SEPTEMBER‐11 AIR SERIES FIGURE 19.9 DEPARTMENT STORE QUARTERLY SALES SERIES FIGURE 19.10 FORECASTS AND ACTUALS (TOP) AND FORECAST ERRORS (BOTTOM) USING... FIGURE 19.11 QUARTERLY SHIPMENTS OF US HOUSEHOLD APPLIANCES OVER 5 YEARS FIGURE 19.12 MONTHLY SALES OF A CERTAIN SHAMPOO FIGURE 19.13 QUARTERLY SALES OF NATURAL GAS OVER 4 YEARS FIGURE 19.14 MONTHLY SALES OF SIX TYPES OF AUSTRALIAN WINES BETWEEN 1980 AND... Chapter 20 FIGURE 20.1 TINY HYPOTHETICAL LINKEDIN NETWORK; THE EDGES REPRESENT CONNECTI... FIGURE 20.2 TINY HYPOTHETICAL TWITTER NETWORK WITH DIRECTED EDGES (ARROWS) S... FIGURE 20.3 EDGE WEIGHTS REPRESENTED BY LINE THICKNESS. FIGURE 20.4 DRUG LAUNDRY NETWORK IN SAN ANTONIO, TX FIGURE 20.5 TWO DIFFERENT LAYOUTS OF THE TINY LINKEDIN NETWORK PRESENTED IN... FIGURE 20.6 THE DEGREE 1 (LEFT) AND DEGREE 2 (RIGHT) EGOCENTRIC NETWORKS FOR... FIGURE 20.7 A RELATIVELY SPARSE NETWORK FIGURE 20.8 A RELATIVELY DENSE NETWORK FIGURE 20.9 THE NETWORKS FOR SUSPECT A (TOP), SUSPECT AA (MIDDLE), AND SUSPE... FIGURE 20.10 NETWORK FOR LINK PREDICTION EXERCISE Chapter 21 FIGURE 21.1 DECILE‐WISE LIFT CHART FOR AUTOS‐ ELECTRONICS DOCUMENT CLASSIFICA... FIGURE 21.2 ROC CURVE FOR THE SENTIMENT ANALYSIS MODEL TRAINED USING GLOVE W... Chapter 22 FIGURE 22.1 MODEL ACCURACY AND AUC FOR DIFFERENT GROUPS FIGURE 22.2 FALSE POSITIVE AND FALSE NEGATIVE RATES FOR DIFFERENT GROUPS FIGURE 22.3 PARTIAL DEPENDENCE PLOTS (PDP) FOR PREDICTOR PRIORS_COUNT MACHINE LEARNING FOR BUSINESS ANALYTICS Concepts, Techniques, and Applications in R Second Edition GALIT SHMUELI National Tsing Hua University PETER C. BRUCE statistics.com PETER GEDECK Collaborative Drug Discovery INBAL YAHAV Tel Aviv University NITIN R. PATEL Cytel, Inc. This edition first published 2023 © 2023 John Wiley & Sons, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions. The right of Galit Shmueli, Peter C. Bruce, Peter Gedeck, Inbal Yahav, and Nitin R. Patel to be identified as the authors of this work has been asserted in accordance with law. Registered Office John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com. Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats. Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book. Limit of Liability/Disclaimer of Warranty The publisher and the authors make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties; including without limitation any implied warranties of fitness for a particular purpose. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for every situation. In view of on-going research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of experimental reagents, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each chemical, piece of equipment, reagent, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. The fact that an organization or website is referred to in this work as a citation and/or potential source of further information does not mean that the author or the publisher endorses the information the organization or website may provide or recommendations it may make. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this works was written and when it is read. No warranty may be created or extended by any promotional statements for this work. Neither the publisher nor the author shall be liable for any damages arising here from. Library of Congress Cataloging‐in‐Publication Data Applied for Hardback: 9781119835172 Cover Design: Wiley Cover Image: © Hiroshi Watanabe/Getty Images The beginning of wisdom is this: Get wisdom, and whatever else you get, get insight. –Proverbs 4:7 Foreword by Ravi Bapna Converting data into an asset is the new business imperative facing modern managers. Each day the gap between what analytics capabilities make possible and companies' absorptive capacity of creating value from such capabilities increases. In many ways, data is the new gold—and mining this gold to create business value in today's context of a highly networked and digital society requires a skillset that we haven't traditionally delivered in business or statistics or engineering programs on their own. For those businesses and organizations that feel overwhelmed by today's Big Data, the phrase you ain't seen nothing yet comes to mind. Yesterday's three major sources of Big Data—the 20+ years of investment in enterprise systems (ERP, CRM, SCM, etc.), the three billion plus people on the online social grid, and the close to five billion people carrying increasingly sophisticated mobile devices—are going to be dwarfed by tomorrow's smarter physical ecosystems fueled by the Internet of Things (IoT) movement. The idea that we can use sensors to connect physical objects such as homes, automobiles, roads, and even garbage bins and streetlights to digitally optimized systems of governance goes hand in glove with bigger data and the need for deeper analytical capabilities. We are not far away from a smart refrigerator sensing that you are short on, say, eggs, populating your grocery store's mobile app's shopping list, and arranging a Task Rabbit to do a grocery run for you. Or the refrigerator negotiating a deal with an Uber driver to deliver an evening meal to you. Nor are we far away from sensors embedded in roads and vehicles that can compute traffic congestion, track roadway wear and tear, record vehicle use, and factor these into dynamic usage‐based pricing, insurance rates, and even taxation. This brave new world is going to be fueled by analytics and the ability to harness data for competitive advantage. Business Analytics is an emerging discipline that is going to help us ride this new wave. This new Business Analytics discipline requires individuals who are grounded in the fundamentals of business such that they know the right questions to ask; who have the ability to harness, store, and optimally process vast datasets from a variety of structured and unstructured sources; and who can then use an array of techniques from machine learning and statistics to uncover new insights for decision‐making. Such individuals are a rare commodity today, but their creation has been the focus of this book for a decade now. This book's forte is that it relies on explaining the core set of concepts required for today's business analytics professionals using real‐world data‐rich cases in a hands‐on manner, without sacrificing academic rigor. It provides a modern‐day foundation for Business Analytics, the notion of linking the x 's to the y 's of interest in a predictive sense. I say this with the confidence of someone who was probably the first adopter of the zeroth edition of this book (Spring 2006 at the Indian School of Business). The updated R version is much awaited. R is used by a wide variety of instructors in our MS‐Business Analytics program. The open‐ innovation paradigm used by R is one key part of the analytics perfect storm, the other components being the advances in computing and the business appetite for data‐driven decision‐ making. The new addition also covers causal analytics as experimentation (often called A/B testing in the industry), which is now becoming mainstream in the tech companies. Further, the authors have added a new chapter on Responsible Data Science, a new part on AutoML, more on deep learning and beefed up deep learning examples in the text mining and forecasting chapters. These updates make this new edition “state of the art” with respect to modern business analytics and AI. I look forward to using the book in multiple fora, in executive education, in MBA classrooms, in MS‐Business Analytics programs, and in Data Science bootcamps. I trust you will too! RAVI BAPNA Carlson School of Management, University of Minnesota, 2022 Foreword by Gareth James The field of statistics has existed in one form or another for 200 years and by the second half of the 20th century, had evolved into a well‐respected and essential academic discipline. However, its prominence expanded rapidly in the 1990s with the explosion of new, and enormous, data sources. For the first part of this century, much of this attention was focused on biological applications, in particular, genetics data generated as a result of the sequencing of the human genome. However, the last decade has seen a dramatic increase in the availability of data in the business disciplines and a corresponding interest in business‐related statistical applications. The impact has been profound. Fifteen years ago, when I was able to attract a full class of MBA students to my new statistical learning elective, my colleagues were astonished because our department struggled to fill most electives. Today, we offer a Masters in Business Analytics, which is the largest specialized masters program in the school and has application volume rivaling those of our MBA programs. Our department's faculty size and course offerings have increased dramatically, yet the MBA students are still complaining that the classes are all full. Google's chief economist, Hal Varian, was indeed correct in 2009 when he stated that “the sexy job in the next 10 years will be statisticians.” This demand is driven by a simple, but undeniable, fact. Business analytics solutions have produced significant and measurable improvements in business performance, on multiple dimensions, and in numerous settings, and as a result, there is a tremendous demand for individuals with the requisite skill set. However, training students in these skills is challenging given that, in addition to the obvious required knowledge of statistical methods, they need to understand business‐related issues, possess strong communication skills, and be comfortable dealing with multiple computational packages. Most statistics texts concentrate on abstract training in classical methods, without much emphasis on practical, let alone business, applications. This book has by far the most comprehensive review of business analytics methods that I have ever seen, covering everything from classical approaches such as linear and logistic regression to modern methods like neural networks, bagging and boosting, and even much more business‐specific procedures such as social network analysis and text mining. If not the bible, it is at the least a definitive manual on the subject. However, just as important as the list of topics, is the way that they are all presented in an applied fashion using business applications. Indeed the last chapter is entirely dedicated to 10 separate cases where business analytics approaches can be applied. In this latest edition, the authors have added an important new dimension in the form of the R software package. Easily the most widely used and influential open source statistical software, R has become the go‐to tool for such purposes. With literally hundreds of freely available add‐on packages, R can be used for almost any business analytics related problem. The book provides detailed descriptions and code involving applications of R in numerous business settings, ensuring that the reader will actually be able to apply their knowledge to real‐life problems. I would strongly recommend this book. I'm confident that it will be an indispensable tool for any MBA or business analytics course. GARETH JAMES Goizueta Business School, Emory University, 2022 Preface to the Second R Edition This textbook first appeared in early 2007 and has been used by numerous students and practitioners and in many courses, including our own experience teaching this material both online and in person for more than 15 years. The first edition, based on the Excel add‐in Analytic Solver Data Mining (previously XLMiner), was followed by two more Analytic Solver editions, a JMP edition, an R edition, a Python edition, a RapidMiner edition, and now this new R edition, with its companion website, www.dataminingbook.com. This new R edition, which relies on the free and open source R software, presents output from R, as well as the code used to produce that output, including specification of a variety of packages and functions. Unlike computer‐science or statistics‐oriented textbooks, the focus in this book is on machine learning concepts and how to implement the associated algorithms in R. We assume a basic familiarity with R. For this new R edition, a new co‐author, Peter Gedeck, comes on board bringing extensive data science experience in business. The new edition provides significant updates both in terms of R and in terms of new topics and content. In addition to updating R code and routines that have changed or become available since the first edition, the new edition provides the following: A stronger focus on model selection using cross‐validation with the use of the caret package Streamlined data preprocessing using tidyverse style Data visualization using ggplot Names of R packages, functions, and arguments are highlighted in the text, for easy readability. This edition also incorporates updates and new material based on feedback from instructors teaching MBA, MS, undergraduate, diploma, and executive courses, and from their students. Importantly, this edition includes several new topics: A dedicated section on deep learning in Chapter 11 , with additional deep learning examples in text mining ( Chapter 21 ) and time series forecasting ( Chapter 19 ). A new chapter on Responsible Data Science ( Chapter 22 ) covering topics of fairness, transparency, model cards and datasheets, legal considerations, and more, with an illustrative example. The Performance Evaluation exposition in Chapter 5 was expanded to include further metrics (precision and recall, F1). A new chapter on Generating, Comparing, and Combining Multiple Models ( Chapter 13 ) that covers ensembles, AutoML, and explaining model predictions. A new chapter dedicated to Interventions and User Feedback ( Chapter 14 ), that covers A/B tests, uplift modeling, and reinforcement learning. A new case (Loan Approval) that touches on regulatory and ethical issues. A note about the book's title: The first two editions of the book used the title Data Mining for Business Intelligence. Business intelligence today refers mainly to reporting and data visualization (“what is happening now”), while business analytics has taken over the “advanced analytics,” which include predictive analytics and data mining. Later editions were therefore renamed Data Mining for Business Analytics. However, the recent AI transformation has made the term machine learning more popularly associated with the methods in this textbook. In this new edition, we therefore use the updated terms Machine Learning and Business Analytics. Since the appearance of the (Analytic Solver‐based) second edition, the landscape of the courses using the textbook has greatly expanded: whereas initially the book was used mainly in semester‐ long elective MBA‐level courses, it is now used in a variety of courses in business analytics degrees and certificate programs, ranging from undergraduate programs to postgraduate and executive education programs. Courses in such programs also vary in their duration and coverage. In many cases, this textbook is used across multiple courses. The book is designed to continue supporting the general “predictive analytics” or “data mining” course as well as supporting a set of courses in dedicated business analytics programs. A general “business analytics,” “predictive analytics,” or “machine learning” course, common in MBA and undergraduate programs as a one‐semester elective, would cover Parts I – III , and choose a subset of methods from Parts IV and V. Instructors can choose to use cases as team assignments, class discussions, or projects. For a two‐ semester course, Part VII might be considered, and we recommend introducing Part VIII (Data Analytics). For a set of courses in a dedicated business analytics program, here are a few courses that have been using our book: Predictive Analytics—Supervised Learning : In a dedicated business analytics program, the topic of predictive analytics is typically instructed across a set of courses. The first course would cover Parts I – III , and instructors typically choose a subset of methods from Part IV according to the course length. We recommend including Part VIII : Data Analytics. Predictive Analytics—Unsupervised Learning : This course introduces data exploration and visualization, dimension reduction, mining relationships, and clustering ( Parts II and VI ). If this course follows the Predictive Analytics: Supervised Learning course, then it is useful to examine examples and approaches that integrate unsupervised and supervised learning, such as Part VIII on Data Analytics. Forecasting Analytics : A dedicated course on time series forecasting would rely on Part VI. Advanced Analytics : A course that integrates the learnings from predictive analytics (supervised and unsupervised learning) can focus on Part VIII : Data Analytics, where social network analytics and text mining are introduced, and responsible data science is discussed. Such a course might also include Chapter 13 , Generating, Comparing, and Combining Multiple Models from Part IV , as well as Part V , which covers experiments, uplift modeling, and reinforcement learning. Some instructors choose to use the cases ( Chapter 23 ) in such a course. In all courses, we strongly recommend including a project component, where data are either collected by students according to their interest or provided by the instructor (e.g., from the many machine learning competition datasets available). From our experience and other instructors' experience, such projects enhance the learning and provide students with an excellent opportunity to understand the strengths of machine learning and the challenges that arise in the process. GALIT SHMUELI, PETER C. BRUCE, PETER GEDECK, INBAL YAHAV, AND NITIN R. PATEL 2022 Acknowledgments We thank the many people who assisted us in improving the book from its inception as Data Mining for Business Intelligence in 2006 (using XLMiner, now Analytic Solver), its reincarnation as Data Mining for Business Analytics , and now Machine Learning for Business Analytics , including translations in Chinese and Korean and versions supporting Analytic Solver Data Mining, R, Python, SAS JMP, and RapidMiner. Anthony Babinec, who has been using earlier editions of this book for years in his data mining courses at Statistics.com , provided us with detailed and expert corrections. Dan Toy and John Elder IV greeted our project with early enthusiasm and provided detailed and useful comments on initial drafts. Ravi Bapna, who used an early draft in a data mining course at the Indian School of Business, and later at University of Minnesota, has provided invaluable comments and helpful suggestions since the book's start. Many of the instructors, teaching assistants, and students using earlier editions of the book have contributed invaluable feedback both directly and indirectly, through fruitful discussions, learning journeys, and interesting data mining projects that have helped shape and improve the book. These include MBA students from the University of Maryland, MIT, the Indian School of Business, National Tsing Hua University, and Statistics.com. Instructors from many universities and teaching programs, too numerous to list, have supported and helped improve the book since its inception. Several professors have been especially helpful with the first R edition: Hayri Tongarlak, Prashant Joshi (UKA Tarsadia University), Jay Annadatha, Roger Bohn, Sridhar Vaithianathan, Travis Greene, and Dianne Cook provided detailed comments and/or R code files for the companion website; Scott Nestler has been a helpful friend of this book project from the beginning. Kuber Deokar, instructional operations supervisor at Statistics.com , has been unstinting in his assistance, support, and detailed attention. We also thank Anuja Kulkarni, Poonam Patil, and Shweta Jadhav, assistant teachers. Valerie Troiano has shepherded many instructors and students through the Statistics.com courses that have helped nurture the development of these books. Colleagues and family members have been providing ongoing feedback and assistance with this book project. Vijay Kamble at UIC and Travis Greene at NTHU have provided valuable help with the section on reinforcement learning. Boaz Shmueli and Raquelle Azran gave detailed editorial comments and suggestions on the first two editions; Bruce McCullough and Adam Hughes did the same for the first edition. Noa Shmueli provided careful proofs of the third edition. Ran Shenberger offered design tips. Ken Strasma, founder of the microtargeting firm HaystaqDNA and director of targeting for the 2004 Kerry campaign and the 2008 Obama campaign, provided the scenario and data for the section on uplift modeling. We also thank Jen Golbeck, Professor in the College of Information Studies at the University of Maryland and author of Analyzing the Social Web , whose book inspired our presentation in the chapter on social network analytics. Randall Pruim contributed extensively to the chapter on visualization. Marietta Tretter at Texas A&M shared comments and thoughts on the time series chapters, and Stephen Few and Ben Shneiderman provided feedback and suggestions on the data visualization chapter and overall design tips. Susan Palocsay and Mia Stephens have provided suggestions and feedback on numerous occasions, as has Margret Bjarnadottir. We also thank Catherine Plaisant at the University of Maryland's Human–Computer Interaction Lab, who helped out in a major way by contributing exercises and illustrations to the data visualization chapter. Gregory Piatetsky‐Shapiro, founder of KDNuggets.com , was generous with his time and counsel in the early years of this project. We thank colleagues at the Sloan School of Management at MIT for their support during the formative stage of this book—Dimitris Bertsimas, James Orlin, Robert Freund, Roy Welsch, Gordon Kaufmann, and Gabriel Bitran. As teaching assistants for the data mining course at Sloan, Adam Mersereau gave detailed comments on the notes and cases that were the genesis of this book, Romy Shioda helped with the preparation of several cases and exercises used here, and Mahesh Kumar helped with the material on clustering. Colleagues at the University of Maryland's Smith School of Business: Shrivardhan Lele, Wolfgang Jank, and Paul Zantek provided practical advice and comments. We thank Robert Windle and University of Maryland MBA students Timothy Roach, Pablo Macouzet, and Nathan Birckhead for invaluable datasets. We also thank MBA students Rob Whitener and Daniel Curtis for the heatmap and map charts. Anand Bodapati provided both data and advice. Jake Hofman from Microsoft Research and Sharad Borle assisted with data access. Suresh Ankolekar and Mayank Shah helped develop several cases and provided valuable pedagogical comments. Vinni Bhandari helped write the Charles Book Club case. We are grateful to colleagues at UMass Lowell's Manning School of Business for their encouragement and support in developing data analytics courses at the undergraduate and graduate levels that led to the development of this edition: Luvai Motiwalla, Harry Zhu, Thomas Sloan, Bob Li, and Sandra Richtermeyer. We also thank Michael Goul (late), Dan Power (late), Ramesh Sharda, Babita Gupta, Ashish Gupta, and Haya Ajjan from the Association for Information System's Decision Support and Analytics (SIGDSA) community for ideas and advice that helped the development of the book. We would like to thank Marvin Zelen, L. J. Wei, and Cyrus Mehta at Harvard, as well as Anil Gore at Pune University, for thought‐ provoking discussions on the relationship between statistics and data mining. Our thanks to Richard Larson of the Engineering Systems Division, MIT, for sparking many stimulating ideas on the role of data mining in modeling complex systems. Over two decades ago, they helped us develop a balanced philosophical perspective on the emerging field of machine learning. Lastly, we thank the folks at Wiley for this successful journey of nearly two decades. Steve Quigley at Wiley showed confidence in this book from the beginning and helped us navigate through the publishing process with great speed. Curt Hinrichs' vision, tips, and encouragement helped bring the first edition of this book to the starting gate. Jon Gurstelle guided us through additional editions and translations. Brett Kurzman has taken over the reins and is now shepherding the project. Becky Cowan, Sarah Lemore, Kathleen Pagliaro, Katrina Maceda, and Kavya Ramu greatly assisted us in pushing ahead and finalizing this and the earlier R edition. We are also especially grateful to Amy Hendrickson, who assisted with typesetting and making this book beautiful. PART I Preliminaries CHAPTER 1 Introduction 1.1 WHAT IS BUSINESS ANALYTICS? Business Analytics (BA) is the practice and art of bringing quantitative data to bear on decision‐making. The term means different things to different organizations. Consider the role of analytics in helping newspapers survive the transition to a digital world. One tabloid newspaper with a working‐ class readership in Britain had launched a web version of the paper and did tests on its home page to determine which images produced more hits: cats, dogs, or monkeys. This simple application, for this company, was considered analytics. By contrast, the Washington Post has a highly influential audience that is of interest to big defense contractors: it is perhaps the only newspaper where you routinely see advertisements for aircraft carriers. In the digital environment, the Post can track readers by time of day, location, and user subscription information. In this fashion, the display of the aircraft carrier advertisement in the online paper may be focused on a very small group of individuals—say, the members of the House and Senate Armed Services Committees who will be voting on the Pentagon's budget. Business Analytics, or more generically, analytics , include a range of data analysis methods. Many powerful applications involve little more than counting, rule‐checking, and basic arithmetic. For some organizations, this is what is meant by analytics. The next level of business analytics, now termed Business Intelligence (BI), refers to data visualization and reporting for understanding “what happened and what is happening.” This is done by use of charts, tables, and dashboards to display, examine, and explore data. BI, which earlier consisted mainly of generating static reports, has evolved into more user‐friendly and effective tools and practices, such as creating interactive dashboards that allow the user not only to access real‐time data, but also to directly interact with it. Effective dashboards are those that tie directly into company data and give managers a tool to quickly see what might not readily be apparent in a large complex database. One such tool for industrial operations managers displays customer orders in a single two‐ dimensional display, using color and bubble size as added variables, showing customer name, type of product, size of order, and length of time to produce. Business Analytics now typically includes BI as well as sophisticated data analysis methods, such as statistical models and machine learning algorithms used for exploring data, quantifying and explaining relationships between measurements, and predicting new records. Methods like regression models are used to describe and quantify “on average” relationships (e.g., between advertising and sales), to predict new records (e.g., whether a new patient will react positively to a medication), and to forecast future values (e.g., next week's web traffic). Readers familiar with earlier editions of this book may have noticed that the book title has changed from Data Mining for Business Intelligence to Data Mining for Business Analytics and, finally, in this edition to Machine Learning for Business Analytics. The first change reflected the advent of the term BA, which overtook the earlier term BI to denote advanced analytics. Today, BI is used to refer to data visualization and reporting. The second change reflects how the term machine learning has overtaken the older term data mining. WHO USES PREDICTIVE ANALYTICS? The widespread adoption of predictive analytics, coupled with the accelerating availability of data, has increased organizations' capabilities throughout the economy. A few examples are as follows: Credit scoring: One long‐established use of predictive modeling techniques for business prediction is credit scoring. A credit score is not some arbitrary judgment of creditworthiness; it is based mainly on a predictive model that uses prior data to predict repayment behavior. Future purchases: A more recent (and controversial) example is Target's use of predictive modeling to classify sales prospects as “pregnant” or “not‐pregnant.” Those classified as pregnant could then be sent sales promotions at an early stage of pregnancy, giving Target a head start on a significant purchase stream. Tax evasion: The US Internal Revenue Service found it was 25 times more likely to find tax evasion when enforcement activity was based on predictive models, allowing agents to focus on the most‐likely tax cheats (Siegel, 2013 ). The Business Analytics toolkit also includes statistical experiments, the most common of which is known to marketers as A/B testing. These are often used for pricing decisions: Orbitz, the travel site, found that it could price hotel options higher for Mac users than Windows users. Staples online store found it could charge more for staplers if a customer lived far from a Staples store. Beware the organizational setting where analytics is a solution in search of a problem: a manager, knowing that business analytics and machine learning are hot areas, decides that her organization must deploy them too, to capture that hidden value that must be lurking somewhere. Successful use of analytics and machine learning requires both an understanding of the business context where value is to be captured and an understanding of exactly what the machine learning methods do. 1.2 WHAT IS MACHINE LEARNING? In this book, machine learning (or data mining) refers to business analytics methods that go beyond counts, descriptive techniques, reporting, and methods based on business rules. While we do introduce data visualization, which is commonly the first step into more advanced analytics, the book focuses mostly on the more advanced data analytics tools. Specifically, it includes statistical and machine learning methods that inform decision‐making, often in an automated fashion. Prediction is typically an important component, often at the individual level. Rather than “what is the relationship between advertising and sales,” we might be interested in “what specific advertisement, or recommended product, should be shown to a given online shopper at this moment?” Or we might be interested in clustering customers into different “personas” that receive different marketing treatment and then assigning each new prospect to one of these personas. The era of Big Data has accelerated the use of machine learning. Machine learning methods, with their power and automaticity, have the ability to cope with huge amounts of data and extract value. 1.3 MACHINE LEARNING, AI, AND RELATED TERMS The field of analytics is growing rapidly, both in terms of the breadth of applications and in terms of the number of organizations using advanced analytics. As a result, there is considerable overlap and inconsistency of definitions. Terms have also changed over time. The older term data mining itself means different things to different people. To the general public, it may have a general, somewhat hazy and pejorative meaning of digging through vast stores of (often personal) data in search of something interesting. Data mining , as it refers to analytic techniques, has largely been superseded by the term machine learning. Other terms that organizations use are predictive analytics , predictive modeling , and most recently machine learning and artificial intelligence (AI). Many practitioners, particularly those from the IT and computer science communities, use the term AI to refer to all the methods discussed in this book. AI originally referred to the general capability of a machine to act like a human, and, in its earlier days, existed mainly in the realm of science fiction and the unrealized ambitions of computer scientists. More recently, it has come to encompass the methods of statistical and machine learning discussed in this book, as the primary enablers of that grand vision, and sometimes the term is used loosely to mean the same thing as machine learning. More broadly, it includes generative capabilities such as the creation of images, audio, and video. Statistical Modeling vs. Machine Learning A variety of techniques for exploring data and building models have been around for a long time in the world of statistics: linear regression, logistic regression, discriminant analysis, and principal components analysis, for example. However, the core tenets of classical statistics—computing is difficult and data are scarce—do not apply in machine learning applications where both data and computing power are plentiful. This gives rise to Daryl Pregibon's description of “data mining” (in the sense of machine learning) as “statistics at scale and speed” (Pregibon, 1999 ). Another major difference between the fields of statistics and machine learning is the focus in statistics on inference from a sample to the population regarding an “average effect”—for example, “a $1 price increase will reduce average demand by 2 boxes.” In contrast, the focus in machine learning is on predicting individual records—“the predicted demand for person i given a $1 price increase is 1 box, while for person j it is 3 boxes.” The emphasis that classical statistics places on inference (determining whether a pattern or interesting result might have happened by chance in our sample) is absent from machine learning. Note also that the term inference is often used in the machine learning community to refer to the process of using a model to make predictions for new data, also called scoring , in contrast to its meaning in the statistical community. In comparison with statistics, machine learning deals with large datasets in an open‐ended fashion, making it impossible to put the strict limits around the question being addressed that classical statistical inference would require. As a result, the general approach to machine learning is vulnerable to the danger of overfitting , where a model is fit so closely to the available sample of data that it describes not merely structural characteristics of the data, but random peculiarities as well. In engineering terms, the model is fitting the noise, not just the signal. In this book, we use the term machine learning algorithm to refer to methods that learn directly from data, especially local patterns, often in layered or iterative fashion. In contrast, we use statistical models to refer to methods that apply global structure to the data that can be written as a simple mathematical equation. A simple example is a linear regression model (statistical) vs. a k ‐nearest neighbors algorithm (machine learning). A given record would be treated by linear regression in accord with an overall linear equation that applies to all the records. In k ‐nearest neighbors, that record would be classified in accord with the values of a small number of nearby records. 1.4 BIG DATA Machine learning and Big Data go hand in hand. Big Data is a relative term—data today are big by reference to the past and to the methods and devices available to deal with them. The challenge Big Data presents is often characterized by the four V's—volume, velocity, variety, and veracity. Volume refers to the amount of data. Velocity refers to the flow rate—the speed at which it is being generated and changed. Variety refers to the different types of data being generated (time stamps, location, numbers, text, images, etc.). Veracity refers to the fact that data is being generated by organic distributed processes (e.g., millions of people signing up for services or free downloads) and not subject to the controls or quality checks that apply to data collected for a study. Most large organizations face both the challenge and the opportunity of Big Data because most routine data processes now generate data that can be stored and, possibly, analyzed. The scale can be visualized by comparing the data in a traditional statistical analysis (say, 15 variables and 5000 records) to the Walmart database. If you consider the traditional statistical study to be the size of a period at the end of a sentence, then the Walmart database is the size of a football field. Moreover, that probably does not include other data associated with Walmart—social media data, for example, which comes in the form of unstructured text. If the analytical challenge is substantial, so can be the reward: OKCupid, the online dating site, uses statistical models with their data to predict what forms of message content are most likely to produce a response. Telenor, a Norwegian mobile phone service company, was able to reduce subscriber turnover 37% by using models to predict which customers were most likely to leave and then lavishing attention on them. Allstate, the insurance company, tripled the accuracy of predicting injury liability in auto claims by incorporating more information about vehicle type. The above examples are from Eric Siegel's book Predictive Analytics (2013, Wiley). Some extremely valuable tasks were not even feasible before the era of Big Data. Consider web searches, the technology on which Google was built. In early days, a search for “Ricky Ricardo Little Red Riding Hood” would have yielded various links to the I Love Lucy TV show, other links to Ricardo's career as a band leader, and links to the children's story of Little Red Riding Hood. Only once the Google database had accumulated sufficient data (including records of what users clicked on) would the search yield, in the top position, links to the specific I Love Lucy episode in which Ricky enacts, in a comic mixture of Spanish and English, Little Red Riding Hood for his infant son. 1.5 DATA SCIENCE The ubiquity, size, value, and importance of Big Data has given rise to a new profession: the data scientist. Data science is a mix of skills in the areas of statistics, machine learning, math, programming, business, and IT. The term itself is thus broader than the other concepts we discussed above, and it is a rare individual who combines deep skills in all the constituent areas. In their book Analyzing the Analyzers (Harris et al., 2013 ), the authors describe the skill sets of most data scientists as resembling a “T”—deep in one area (the vertical bar of the T) and shallower in other areas (the top of the T). At a large data science conference session (Strata+Hadoop World, October 2014), most attendees felt that programming was an essential skill, though there was a sizable minority who felt otherwise. Also, although Big Data is the motivating power behind the growth of data science, most data scientists do not actually spend most of their time working with terabyte‐size or larger data. Data of the terabyte or larger size would be involved at the deployment stage of a model. There are manifold challenges at that stage, most of them IT and programming issues related to data handling and tying together different components of a system. Much work must precede that phase. It is that earlier piloting and prototyping phase on which this book focuses—developing the statistical and machine learning models that will eventually be plugged into a deployed system. What methods do you use with what sorts of data and problems? How do the methods work? What are their requirements, their strengths, their weaknesses? How do you assess their performance? 1.6 WHY ARE THERE SO MANY DIFFERENT METHODS? As can be seen in this book or any other resource on machine learning, there are many different methods for prediction and classification. You might ask yourself why they coexist and whether some are better than others. The answer is that each method has advantages and disadvantages. The usefulness of a method can depend on factors such as the size of the dataset, the types of patterns that exist in the data, whether the data meet some underlying assumptions of the method, how noisy the data are, and the particular goal of the analysis. A small illustration is shown in Figure 1.1 , where the goal is to find a combination of household income level and household lot size that separates buyers (solid circles) from nonbuyers (hollow circles) of riding mowers. The first method (left panel) looks only for horizontal and vertical lines to separate buyers from nonbuyers, whereas the second method (right panel) looks for a single diagonal line. FIGURE 1.1 TWO METHODS FOR SEPARATING OWNERS FROM NONOWNERS Different methods can lead to different results, and their performance can vary. It is therefore customary in machine learning to apply several different methods and select the one that appears most useful for the goal at hand. 1.7 TERMINOLOGY AND NOTATION Because of the hybrid parentry of data science, its practitioners often use multiple terms to refer to the same thing. For example, in the machine learning and artificial intelligence fields, the variable being predicted is the output variable or target variable. A categorical target variable is often called a label. To a statistician or social scientist, the variable being predicted is the dependent variable or the response. Here is a summary of terms used: Algorithm A specific procedure used to implement a particular machine learning technique: classification tree, discriminant analysis, and the like. Attribute see Predictor. Case see Observation. Categorical variable A variable that takes on one of several fixed values, e.g., a flight could be on‐time, delayed, or canceled. Confidence A performance measure in association rules of the type “IF A and B are purchased, THEN C is also purchased.” Confidence is the conditional probability that C will be purchased IF A and B are purchased. Confidence also has a broader meaning in statistics ( confidence interval ), concerning the degree of error in an estimate that results from selecting one sample as opposed to another. Dependent variable see Response. Estimation see Prediction. Factor variable see Categorical variable Feature see Predictor. Holdout data (or Holdout set ) A sample of data not used in fitting a model, but instead used only at the end of the model building and selection process to assess how well the final model might perform on new data. This book uses the term holdout set instead of validation set and test set. Inference In statistics, the process of accounting for chance variation when making estimates or drawing conclusions based on samples; in machine learning, the term often refers to the process of using a model to make predictions for new data (see Score ). Input variable see Predictor. Label A categorical variable being predicted in supervised learning. Model An algorithm as applied to a dataset, complete with its settings (many of the algorithms have parameters that the user can adjust). Observation The unit of analysis on which the measurements are taken (a customer, a transaction, etc.), also called instance , sample , example , case , record , pattern , or row. In spreadsheets, each row typically represents a record; each column, a variable. Note that the use of the term “sample” here is different from its usual meaning in statistics, where it refers to a collection of observations. Outcome variable see Response. Output variable see Response. The conditional probability of event A occurring given that event B has occurred, read as “the probability that A will occur given that B has occurred.” Prediction The prediction of the numerical value of a continuous output variable; also called estimation. Predictor A variable, usually denoted by X , used as an input into a predictive model, also called a feature , input variable , independent variable , or, from a database perspective, a field. Profile A set of measurements on an observation (e.g., the height, weight, and age of a person). Record see Observation. Response A variable, usually denoted by Y , which is the variable being predicted in supervised learning, also called dependent variable , output variable , target variable , or outcome variable. Sample In the statistical community, “sample” means a collection of observations. In the machine learning community, “sample” means a single observation. Score A predicted value or class. Scoring new data means using a model developed with training data to predict output values in new data. Success class The class of interest in a binary outcome (e.g., purchasers in the outcome purchase/no purchase ); the outcome need not be favorable. Supervised learning The process of providing an algorithm (logistic regression, classification tree, etc.) with records in which an output variable of interest is known, and the algorithm “learns” how to predict this value for new records where the output is unknown. Target see Response. Test data (or Test set ) Sometimes used to refer to the portion of the data used only at the end of the model building and selection process to assess how well the final model might perform on new data. This book uses the term holdout set instead and uses the term validation to refer to certain validation checks (e.g., Cross‐Validation ) during the model‐ tuning phase. Training data (or Training set ) The portion of the data used to fit a model. Unsupervised learning An analysis in which one attempts to learn patterns in the data other than predicting an output value of interest. Validation data (or Validation set ) The portion of the data used to assess how well the model fits, to adjust models, and to select the best model from among those that have been tried. Variable Any measurement on the records, including both the input ( X ) variables and the output ( Y ) variable. 1.8 ROAD MAPS TO THIS BOOK The book covers many of the widely used predictive and classification methods as well as other machine learning tools. Figure 1.2 outlines machine learning from a process perspective and where the topics in this book fit in. Chapter numbers are indicated beside the topic. Table 1.1 provides a different perspective: it organizes supervised and unsupervised machine learning procedures according to the type and structure of the data. FIGURE 1.2 MACHINE LEARNING FROM A PROCESS PERSPECTIVE. NUMBERS IN PARENTHESES INDICATE CHAPTER NUMBERS TABLE 1.1 ORGANIZATION OF MACHINE LEARNING METHODS IN THIS BOOK, ACCORDING TO THE NATURE OF THE DATA a Supervised Unsupervised Continuous Categorical No response response response Continuous Linear Logistic Principal regression (6) regression (10) components (4) predictors Neural nets (11) Neural nets (11) Cluster analysis (16) k ‐Nearest Discriminant Collaborative neighbors (7) analysis (12) filtering (15) Ensembles (13) k ‐Nearest neighbors (7) Ensembles (13) Categorical Linear Neural nets (11) Association rules regression (6) (15) predictors Neural nets (11) Classification Collaborative trees (9) filtering (15) Regression trees Logistic (9) regression (10) Ensembles (13) Naive Bayes (8) Ensembles (13) a Numbers in parentheses indicate the chapter number. Order of Topics The book is divided into nine parts: Part I ( Chapters 1 and 2 ) gives a general overview of machine learning and its components. Part II ( Chapters 3 and 4 ) focuses on the early stages of data exploration and dimension reduction. Part III ( Chapter 5 ) discusses performance evaluation. Although it contains only one chapter, we discuss a variety of topics, from predictive performance metrics to misclassification costs. The principles covered in this part are crucial for the proper evaluation and comparison of supervised learning methods. Part IV includes eight chapters ( Chapters 6 – 13 ), covering a variety of popular supervised learning methods (for classification and/or prediction). Within this part, the topics are generally organized according to the level of sophistication of the algorithms, their popularity, and ease of understanding. The final chapter introduces ensembles and combinations of methods. Part V ( Chapter 14 ) introduces the notions of experiments, intervention, and user feedback. This single chapter starts with A/B testing and then its use in uplift modeling and finally expands into reinforcement learning, explaining the basic ideas and formulations that utilize user feedback for learning best treatment assignments. Part VI focuses on unsupervised mining of relationships. It presents association rules and collaborative filtering ( Chapter 15 ) and cluster analysis ( Chapter 16 ). Part VII includes three chapters ( Chapters 17 – 19 ), with the focus on forecasting time series. The first chapter covers general issues related to handling and understanding time series. The next two chapters present two popular forecasting approaches: regression‐ based forecasting and smoothing methods. Part VIII presents two broad data analytics topics: social network analysis ( Chapter 20 ) and text mining ( Chapter 21 ). These methods apply machine learning to specialized data structures: social networks and text. The final chapter on responsible data science ( Chapter 22 ) introduces key issues to consider for when carrying out a machine learning project in a responsible way. Finally, Part IX includes a set of cases. Although the topics in the book can be covered in the order of the chapters, each chapter stands alone. We advise, however, to read Parts I – III before proceeding to chapters in Parts IV – V. Similarly, Chapter 17 should precede other chapters in Part VI. USING R AND RSTUDIO To facilitate a hands‐on machine learning experience, this book uses R, a free software environment for statistical computing and graphics, and RStudio, an integrated development environment (IDE) for R. The R programming language is widely used in academia and industry for machine learning and data analysis. R offers a variety of methods for analyzing data, provided by a variety of separate packages. Among the numerous packages, R has extensive coverage of statistical and machine learning techniques for classification, prediction, mining associations and text, forecasting, and data exploration and reduction. It offers a variety of supervised machine learning tools: neural nets, classification and regression trees, k ‐nearest‐neighbor classification, naive Bayes, logistic regression, linear regression, and discriminant analysis, all for predictive modeling. R's packages also cover unsupervised algorithms: association rules, collaborative filtering, principal components analysis, k ‐means clustering, and hierarchical clustering, as well as visualization tools and data‐handling utilities. Often, the same method is implemented in multiple packages, as we will discuss throughout the book. The illustrations, exercises, and cases in this book are written in relation to R. Download: To download R and RStudio, visit www.r-project.org and www.rstudio.com/products/RStudio and follow the instructions there. Installation: Install both R and RStudio. Note that R releases new versions fairly often. When a new version is released, some packages might require a new installation of R (this is rare). Use: To start using R, open RStudio, then open a new script under File New File R Script. RStudio contains four panels as shown in Figure 1.3 : Script (top left), Console (bottom left), Environment (top right), and additional information, such as plot and help (bottom right). To run a selected code line from the Script panel, press ctrl+r. Code lines starting with # are comments. Package Installation: To start using an R package, you will first need to install it. Installation is done via the information panel (tab “packages”) or using command install.packages(). New packages might not support old R versions and require a new R installation. Source Code Availability: The code is available from the accompanying web sites https://www.dataminingbook.com/ and https://github.com/gedeck/mlba-R-code ; data are available in the R package mlba. FIGURE 1.3 RSTUDIO SCREEN CHAPTER 2 Overview of the Machine Learning Process In this chapter, we give an overview of the steps involved in machine learning (ML), starting from a clear goal definition and ending with model deployment. The general steps are shown schematically in Figure 2.1. We also discuss issues related to data collection, cleaning, and preprocessing. We introduce the notion of data partitioning , where methods are trained on a set of training data and then their performance is evaluated on a separate set of holdout data, as well as explain how this practice helps avoid overfitting. Finally, we illustrate the steps of model building by applying them to data. FIGURE 2.1 SCHEMATIC OF THE DATA MODELING PROCESS 2.1 INTRODUCTION In Chapter 1 , we saw some very general definitions of business analytics and machine learning. In this chapter, we introduce a variety of machine learning methods. The core of this book focuses on what has come to be called predictive analytics , the tasks of classification and prediction as well as pattern discovery, which have become key elements of a “business analytics” function in most large firms. These terms are described next. 2.2 CORE IDEAS IN MACHINE LEARNING Classification Classification is perhaps the most basic form of predictive analytics. The recipient of an offer can respond or not respond. An applicant for a loan can repay on time, repay late, or declare bankruptcy. A credit card transaction can be normal or fraudulent. A packet of data traveling on a network can be benign or threatening. A bus in a fleet can be available for service or unavailable. The victim of an illness can be recovered, still be ill, or be deceased. A common task in machine learning is to examine data where the classification is unknown or will occur in the future, with the goal of predicting what that classification is or will be. Similar data where the classification is known are used to develop rules, which are then applied to the data with the unknown classification. Prediction Prediction is similar to classification, except that we are trying to predict the value of a numerical variable (e.g., amount of purchase) rather than a class (e.g., purchaser or nonpurchaser). Of course, in classification we are trying to predict a class, but the term prediction in this book refers to the prediction of the value of a continuous numerical variable. Sometimes in the machine learning literature, the terms estimation and regression are used to refer to the prediction of the value of a continuous variable, and prediction may be used for both continuous and categorical data. Association Rules and Recommendation Systems Large databases of customer transactions lend themselves naturally to the analysis of associations among items purchased, or “what goes with what.” Association rules , or affinity analysis , is designed to find such general associations patterns between items in large databases. The rules can then be used in a variety of ways. For example, grocery stores can use such information for product placement. They can use the rules for weekly promotional offers or for bundling products. Association rules derived from a hospital database on patients’ symptoms during consecutive hospitalizations can help find “which symptom is followed by what other symptom” to help predict future symptoms for returning patients. Online recommendation systems, such as those used on Amazon and Netflix, use collaborative filtering , a method that uses individual users’ preferences and tastes given their historic purchase, rating, browsing, or any other measurable behavior indicative of preference, as well as other users’ history. In contrast to association rules that generate rules general to an entire population, collaborative filtering generates “what goes with what” at the individual user level. Hence, collaborative filtering is used in many recommendation systems that aim to deliver personalized recommendations to users with a wide range of preferences. Predictive Analytics Classification, prediction, and, to some extent, association rules and collaborative filtering constitute the analytical methods employed in predictive analytics. The term predictive analytics is sometimes used to also include data pattern identification methods such as clustering. Data Reduction and Dimension Reduction The performance of some machine learning algorithms is often improved when the number of variables is limited, and when large numbers of records can be grouped into homogeneous groups. For example, rather than dealing with thousands of product types, an analyst might wish to group them into a smaller number of groups and build separate models for each group. Or a marketer might want to classify customers into different “personas” and must therefore group customers into homogeneous groups to define the personas. This process of consolidating a large number of records (or cases) into a smaller set is termed data reduction. Methods for reducing the number of cases are often called clustering. Reducing the number of variables is typically called dimension reduction. Dimension reduction is a common initial step before deploying machine learning methods, intended to improve predictive power, manageability, and interpretability. Data Exploration and Visualization One of the earliest stages of engaging with data is exploring it. Exploration is aimed at understanding the global landscape of the data and detecting unusual values. Exploration is used for data cleaning and manipulation as well as for visual discovery and “hypothesis generation.” Methods for exploring data include looking at various data aggregations and summaries, both numerically and graphically. This includes looking at each variable separately as well as looking at relationships among variables. The purpose is to discover patterns and exceptions. Exploration by creating charts and dashboards is called data visualization or visual analytics. For numerical variables, we use histograms and boxplots to learn about the distribution of their values, to detect outliers (extreme observations), and to find other information that is relevant to the analysis task. Similarly, for categorical variables, we use bar charts. We can also look at scatter plots of pairs of numerical variables to learn about possible relationships, the type of relationship, and, again, to detect outliers. Visualization can be greatly enhanced by adding features such as color and interactive navigation. Supervised and Unsupervised Learning A fundamental distinction among machine learning techniques is between supervised and unsupervised methods. Supervised learning algorithms are those used in classification and prediction. We must have data available in which the value of the outcome of interest (e.g., purchase or no purchase) is known. Such data are also called “labeled data,” since they contain the label (outcome value) for each record. The use of the term “label” reflects the fact that the outcome of interest for a record may often be a characterization applied by a human: a document may be labeled as relevant, or an object in an X‐ray may be labeled as malignant. These training data are the data from which the classification or prediction algorithm “learns,” or is “trained,” about the relationship between predictor variables and the outcome variable. Once the algorithm has learned from the training data, it is then applied to another sample of data (the validation data ) where the outcome is known, to see how well it does in comparison to other models (either a different algorithm or different parameter values of the same algorithm). If many different models are being tried out, it is prudent to save a third sample, which also includes known outcomes (the holdout data ) to use with the model finally selected to predict how well it will do. The model can then be used to classify or predict the outcome of interest in new cases where the outcome is unknown. Simple linear regression is an example of a supervised learning algorithm (although rarely called that in the introductory statistics course where you probably first encountered it). The Y variable is the (known) outcome variable, and the X variable is a predictor variable. A regression line is drawn to minimize the sum of squared deviations between the actual Y values and the values predicted by this line. The regression line can now be used to predict Y values for new values of X for which we do not know the Y value. Unsupervised learning algorithms are those used where there is no outcome variable to predict or classify. Hence, there is no “learning” from cases where such an outcome variable is known. Association rules, dimension reduction methods, and clustering techniques are all unsupervised learning methods. Supervised and unsupervised methods are sometimes used in conjunction. For example, unsupervised clustering methods are used to separate loan applicants into several risk‐level groups. Then, supervised algorithms are applied separately to each risk‐level group for predicting propensity of loan default. SUPERVISED LEARNING REQUIRES GOOD SUPERVISION In some cases, the value of the outcome variable (the “label”) is known because it is an inherent component of the data. Web logs will show whether a person clicked on a link or not. Bank records will show whether a loan was paid on time or not. In other cases, the value of the known outcome must be supplied by a human labeling process to accumulate enough data to train a model. E‐mail must be labeled as spam or legitimate, documents in legal discovery must be labeled as relevant or irrelevant. In either case, the machine learning algorithm can be led astray if the quality of the supervision is poor. Gene Weingarten reported in the January 5, 2014 Washington Post magazine how the strange phrase “defiantly recommend” is making its way into English via autocorrection. “Defiantly” is closer to the common misspelling definatly than definitely , so Google, in the early days, offered it as