AC488-AC651-AC685_Chapter.2.pdf
Document Details
Uploaded by ResoundingBoston
University of Alabama in Huntsville
Full Transcript
DATA MINING & ANALYTICS Tommie Singleton, Ph.D, CPA, CITP, CISA (256) 762-5252 CHAPTER 2 [email protected] 1 CHAPTER 2: DATA MINING DEFINITION Types of Data Mini...
DATA MINING & ANALYTICS Tommie Singleton, Ph.D, CPA, CITP, CISA (256) 762-5252 CHAPTER 2 [email protected] 1 CHAPTER 2: DATA MINING DEFINITION Types of Data Mining Questions (2.1) What is contained in the data bases? What kinds of patterns are possibly discerned from the complex maze of data? How can all these data be used to our future benefit? 2 CHAPTER 2: DATA MINING DEFINITION Population & Sample (2.1.1) Data sets can be extremely large in volume, millions of records/transactions Different types of industries will have different ranges of data volume Web apps – XXL Loyalty clubs – S CRM – M/L Laws and industry customs vary, but often one can purchase, rent, or access freely detailed or summary information / datasets. 3 CHAPTER 2: DATA MINING DEFINITION Population & Sample (2.1.1) Data Mining uses the scientific method of exploration and application Usually dealing with a mass of potential data and sometimes need to consider the whole population Other times you may only have access to a (hopefully) large sample dataset If dataset is < 10k, probably best to use whole set and not a sample If dataset is very large, we may choose to work with a subset or sample If we use a sample, must take precautions that results can effectively be extrapolated to the population (which depends on constraints of model/stat test) Sample must be representative of population and unbiased Random samples sometimes best way to collect a sample (vs. directed sample, or 2-phase sample-PPC) 4 CHAPTER 2: DATA MINING DEFINITION Population & Sample (2.1.1) Sampling is a discipline in itself Sometimes consider a portion, slice of the population Buying behavior around Christmas, summer, etc. In this case, the subset is referred to as a “sampling frame” where further samples will be selected 5 CHAPTER 2: DATA MINING DEFINITION Data Preparation (2.1.2) Data is often considered homogenous when it is not for mining purpose Data is seen as matter of fact, concrete, reliable, suitable AUTOMATICALLY But data (numbers) always have SOME inherent variation Example: two transactions for sale of the same product on the same day – BUT they were sold at different prices There are nuances to data that should be considered Some data preparation may be needed to make the data MEANINGFUL and RELIABLE for the purpose of the mining/analysis 6 CHAPTER 2: DATA MINING DEFINITION MODELS: Supervised vs. Unsupervised (2.1.3) Data Mining/Analytics: uses a variety of data analysis methods to discover the Unknown Unexpected Interesting And (most importantly) relevant patterns and relationships The results of which may be used to make valid and accurate predictions Generally, there are 2 methods for data analysis: SUPERVISED UNSUPERVISED 7 CHAPTER 2: DATA MINING DEFINITION MODELS: Supervised vs. Unsupervised (2.1.3) Both methods, a sample of empirical (observed) data is required Termed “training sample” Training sample is used by data mining/analytics to learn patterns in the data 8 CHAPTER 2: DATA MINING DEFINITION MODELS: Supervised vs. Unsupervised (2.1.3) Independent variables: factoids / data we know that SHOULD or MIGHT have an impact on the TARGET variable we want to affect TARGET Variable: A.K.A. Dependent Variable (the value depends on its relationship to the independent variables) The “Learning System” is suitable method to determine relationships between independent variables and dependent (target) variable The Learning System output is then compared to the output from the sample Using a sample of actual data, what is the OUTPUT variable (target variable) to PREDICT (who will respond to the sales campaign) The difference between learning system OUTPUT and the sample OUTPUT is ERROR (deviation) Which is used to adjust the learning system until error rate is acceptable (C.I.) 9 Output Variable 10 CHAPTER 2: DATA MINING DEFINITION Sample case: Management has asked us to develop a model for predicting total sales amount for a particular flyer campaign being planned In the brainstorming session with data scientists and management, the following variables with known values were identified as predictors of sales How much existing customer spent with us in the last 6 months Whether customer responded to a prior flyer campaign For customers bought Category of item being promoted (6 months), how many have they bought We surveyed the experts in the room and developed percentages of how much each variable affects purchases In order: 40%, 40%, 20% 11 SUPERVISED: TRAINING PHASE Y (Total Sales) = a1*x1 + a2*x2 + a3*x3 Supervised DM Dependent AKA Learning System Variable a1 -.40 a2 -.40 a3 -.20 x1 - $ Customer Spent 6 months Y = $$ from sales campaign x2 - Customer Response (1=did, 0=did not) x3 - How many items bought in category 6 months Dr. S / Stat Version 12 CHAPTER 2: DATA MINING DEFINITION The “Learning System” is suitable method to determine relationships between independent variables and dependent (target) variable The Learning System output is then compared to the output from a historical sample Using a sample of actual data, what is the OUTPUT variable (target variable) to PREDICT (who will respond to the sales campaign) The difference between learning system OUTPUT and the sample OUTPUT is ERROR (deviation) Which is used to adjust the learning system until error rate is acceptable (C.I.) CASE: A sample is tested with the (prediction) formula and is off by 15% (error rate) of actual data What do we do next? 13 SUPERVISED: TRAINING PHASE Y (Total Sales) = a1*x1 + a2*x2 + a3*x3 Supervised DM Dependent AKA Learning System Variable a1 -.40 to.35 a2 -.40 to.35 a3 -.20 to.30 x1 - $ Customer Spent 6 months Y = $$ from sales campaign x2 - Customer Response (1=did, 0=did not) x3 - How many items bought in category 6 months Dr. S / Stat Version 14 1 2 3 Independent Variables Supervised DM Dependent AKA Learning System Variable 4 5 6 1. Item purchased DV = Respond to sales campaign 2. Location bought 3. Date of purchase 4. Cost of item 5. Season of year 6. Age of customer Dr. S / Stat Version 15 CHAPTER 2: DATA MINING DEFINITION MODELS: Supervised vs. Unsupervised (2.1.3) Unsupervised does NOT use training, adjusting phase of SUPERVISED Therefore, the relationships or patterns discovered in the learning system IS the result That is there is no TARGET Variable – just a model, formula, or other output from the learning system Intent is to discover unseen “structures”, relationships, patterns in the data E.g., Xn to Y, X1 to X2 That is, instead of asking experts for the relationships and cyclical testing and tweaking, use stat testing to determine WHAT THE RELATIONSHIPS ARE! Therefore, Supervised will generally require more time than unsupervised all other things being equal 16 17 CHAPTER 2: DATA MINING DEFINITION Knowledge – Discovery Techniques (2.1.4): Statistical methods: multiple regression, logistic regression , analysis of variance and log-linear regression models and Bayesian inference Decision trees and decision rules: Classification And Regression Tree (CART) algorithms and pruning algorithms Cluster analysis: divisible algorithm, agglomerative algorithms, hierarchical clustering, partitional clustering and incremental clustering Association rules: market basket analysis, a priori algorithm and sequence patterns and social network analysis Artificial neural networks: multilayer perceptrons with back-propagation learning, radial networks, Self-Organizing Maps (SOM) and Kohonen networks Genetic algorithms: used as a methodology for solving hard optimization problems Fuzzy inference systems: based on theory of fuzzy sets and fuzzy logics N-dimensional visualization methods: geometric, icon-based, pixel-oriented and hierarchical techniques Case-Based Reasoning (CBR): based on comparing new cases with stored cases, uses similarity measurements and can be used when only a few cases are available 18 MEAN, STANDARD DEVIATION, CONFIDENCE INTERVAL THEORY / Dr. S. 19 CHAPTER 2: DATA MINING DEFINITION Data Mining Process (2.2) Directly related to solving business needs or problems Logical first step is to understand business needs and identify and prioritize areas needing attention Examples: Too many dropout customers Disappointing sales Geographic areas with unnecessarily poor returns Areas with quality issues How to turn potential customers into customers How to develop areas of business with opportunities (new products, new customers) 20 GENERAL DATA MINING PROCESS STEPS 1. Clarification of the Objective/Question 2 Provisioning, Processing of the Data 3 Analysis of the Data 4. Evaluate and validate during analysis 5. Application of DM results and learning from the experience 21 CHAPTER 2: DATA MINING DEFINITION Business Task (step 1): Clarifying the Problem (2.3) Subjects to discuss and understand: Planned target group or object Budgeted or planned production Extent and kind of promotion or mailshot (number of pages, with good presentation, coupons, discounts, etc.) Involved industries/departments Goods/Items involved in the promotion Presentation scenario, for example, ‘Garden party’ Transmitted image - for example; aggressive pricing, brand competence or innovation Pricing structure 22 CHAPTER 2: DATA MINING DEFINITION Business Task (step 1): Clarifying the Problem (2.3) Example: PROBLEM: reactivate frequent buyers who have not bought during the last year TARGET: What is “frequent” Who is “buyer” Do we include buy & return, buy and not pay Which goods/items are included Is there a price window or cut-off Does the channel matter Does the location of purchase matter How to classify a frequent buyer 10 years ago but who quit 3 years ago, bought 3 times and quit recently All of these affect the target group and the model chosen or use of the model 23 CHAPTER 2: DATA MINING DEFINITION Business Task (step 1): Clarifying the Problem (2.3) Necessary Information Common specifications for the primary objective: Turnover activation Reactivation of inactive customers Cross selling Clarification of the different possible applications (goals): To estimate a potential target group To estimate for a mailout Commitment to the action period and application period Consideration of any seasonal influences to be noted Consideration of any comparable actions in the past Will need to become adept at gathering such information, not necessarily easy or obvious 24 CHAPTER 2: DATA MINING DEFINITION Business Task (step 1): Clarifying the Problem (2.3) Common pitfalls: Client has not fixed all the details in time for the initial discussion Things change between the briefing and the action without the data miner being informed Marketing colleagues prefer not to be seen to be too precise as they may feel that it limits their flexibility (leads to inaccurate or incomplete information) May take a long time but is worth it This problem definition step in adding value and determining the level of success for the project Failure to establish a reliable baseline KPIs are known and accurately measured Response rate Cost of mailouts Purchase “frequency” Measurable goals should be defined and agreed upon by all Exact goals may not apply in data mining hypothesis testing (learning testing) 25 CHAPTER 2: DATA MINING DEFINITION Business Task (step 1): Clarifying the Problem (2.3) Adequate, accurate, timely communication are critical success factor in any project (mgmt.) The problem definition process can take a lot of time, but it will be worth it in the long run The problem definition process is “decisive in adding value and determining whether they will be successful or not” A bit of psychology can be useful 26 CHAPTER 2: DATA MINING DEFINITION Data (step 2): Provisioning & Processing the Data (2.4) To determine the required data for mining & analytics The analysis period of time Basic unit of interest Estimation methods The variables needed Data partition that will generate the learning / testing data Data partition that will generate the appropriate random samples 27 CHAPTER 2: DATA MINING DEFINITION Data (step 2): Provisioning & Processing the Data (2.4) The analysis time (2.4.1) In deployment, there most likely will be a gap between using the model and carrying out the activity Example: determine a target group for a mailout but those people do not receive the mailout for several hours, days or weeks until after the target group is identified Analysis period includes two periods Base Period: for the input variables and testing Target Period: for the output (target) variable, deployment of results Therefore, there is a time gap between BASE PERIOD (running a model) and TARGET PERIOD (using results) From past historical data, decide how long (big) the gap is likely to be, then include it in modeling data For example, input variables (age, location, segment, purchase behavior) need to be from a time period ahead of target variables (purchasing action) For instance, period 10 for input variables and period 14 for target variables resulting in a gap of 3 periods (where period is days, weeks, months, quarters) 28 CHAPTER 2: DATA MINING DEFINITION Data (step 2): Provisioning & Processing the Data (2.4) The analysis time (2.4.1) – Example of gap application Objective is for a Christmas season mailing so APPLICATION PERIOD is next Dec. 1-31 The TARGET PERIOD is often about 1 year earlier to get effects of seasonality (Dec. 1-31 prior year) Printing, handling, and delivery of mailout is about 4 weeks which would end at end of November End of BASE PERIOD would be Oct. 31 last year Therefore, we use input variables up to October last year and test target variables from December 1-31 last year In application period, we use input variables from current October to determine who should receive promotional material this year in December (November is left for printing, process, mail) 29 CHAPTER 2: DATA MINING DEFINITION Data (step 2): Provisioning & Processing the Data (2.4) The analysis time (2.4.1) – Gotchas In the application step, one or more data sets is not available yet Major components (industry, department, new products replaced those sold last year) have changed between the analysis time (BASE PERIOD and TARGET PERIOD) and APPLICATION PERIOD 30 31 CHAPTER 2: DATA MINING DEFINITION Data (step 2): Provisioning & Processing the Data (2.4) Basic Unit of Interest (2.4.2) Person, place or thing Customers, prospects Company, location Invoice Marketing is usually a person A UNIT (case) could be a day’s worth of data (base and target periods may be simultaneous) A UNIT could be materials making up a manufactured product, and target is quality of product 32 CHAPTER 2: DATA MINING DEFINITION Data (step 2): Provisioning & Processing the Data (2.4) Target Variables (2.4.3) Sometimes an effective target variable is not readily available from the data, and most be derived Purchase amount ($) or quantity Not specific items (pink cups) but generic (all cups, or all cookery) Decided in planning but depends on available data and available data that fits data mining model Target variable must be measurable, precise, robust as well as relevant Too much precision could be irrelevant In predictive models, less variation in target variable is preferred Binary variables (dichotomous) and categorical variables work best (these require more data for stat analysis) Stat models prefer more variation and thus a continuous variable works better (less data needed) 33 CHAPTER 2: DATA MINING DEFINITION Data (step 2): Provisioning & Processing the Data (2.4) Input/Explanatory Variables (2.4.4) Only applies in BASE PERIOD Need to be used in the data mining process as they were at the end of the BASE PERIOD Can be problematic when variables are not static but subject to change (e.g., address) These variables should be used with caution even if they are usually static, slow to change More stable models are obtained by classifying continuous variables When variables such as turnover or purchase amount are classified, it stresses the differences in business process If someone spends $100 compared to someone spending $200, it has difference implications Mathematically, these quantities tend to be similar, but in our business application, $200 shows more interest in our company $0 could really mean no interest Without classification, the difference between no purchase and a small purchase would be undervalued (it is NOT dichotomous) At other end, a customer who belongs to top 10% of buyers is more important than $500 versus $200 (it is likely to be significant) 34 CHAPTER 2: DATA MINING DEFINITION Modeling (step 3): Analysis of the Data (2.5) The core of data modeling is choosing the most EFFECTIVE method / model A Model with shorter timeline (more efficient) is likely to be better than a model that takes longer to develop / employ and predict (i.e., is technically more effective) Data mining tools are relatively easy to use The process for effectual data mining and analysis is NOT relatively easy – it is a challenge There are many DM software tools available, some even freeware Some are more visual, user friendly and may require minimal programming skills Look for DM software that includes sound tools for data preparation and transformation Data Warehouse tools are likely to be helpful here 35 CHAPTER 2: DATA MINING DEFINITION Evaluation & Validation (step 4): During Analysis Phase (2.6) Three ways to assess the quality of the calculated model: Using a test sample having the same split (between target=0 and target=1) as the training sample (“normalize”) Using a test sample that has the same split as the whole dataset Using a test sample that has a different stratification Generate a number of candidate models using regressions, decision trees, etc., then compare the models by applying each model to the test sample and comparing results Some DM software do this automatically or provides a tool to do the comparisons 36 Typical Chart LIFT Chart to Compare Models 37 Figure 2.7 Shows model 1 from Figure 2.5 in finer detail. There may be an unstable area around 40%. If that is the area of interest, then the model is unsuitable – should NOT be used. It has three unstable areas in the middle. However, if we just need the top 20% (or bottom 40%) of cases, the model is still stable enough to use. 38 CHAPTER 2: DATA MINING DEFINITION Evaluation & Validation (step 4): During Analysis Phase (2.6) The BEST model depends on the BUSINESS QUESTION. Using figure 2.6, two models have similar results. If we want good discrimination of the BEST customers, we choose the dark line because the first 20% of customers have the higher response … If we are interested in good discrimination for half of the people, both models are similar If we are interested in the worst 10%, then both are similar 39 Figure 2.6 40 Figure 2.8 Confusion Matrix Model comparison In this “Confusion Matrix”, the values are similar which is good. A slight 41 difference can be OK, but a model with a big difference is not desirable. Figure 2.9 - Confusion Matrix (using Excel) 42 CHAPTER 2: DATA MINING DEFINITION Evaluation & Validation (step 4): During Analysis Phase (2.6) Sometimes, the ability of the model to rank the customers in a relevant way (such as category, rankings) is more important than the statistical quality of the models Example: Difference between ratings, and rankings (e.g., fraud detection methods) A useful model is one that gives credible rank ordering of customers in terms of associated variables Another way to validate models is cross-validation (where prediction is critical factor) How results of stat model will generalize to an independent data set a.k.a. Out-of-sample testing, rotation estimation, multiple subset samples (known data) used (as training samples) to validate model Especially appropriate for small datasets or highly reliable data Most important aspect of validation is to make sure model makes sense for business and its question, and that the results are reliable and useful for the benefit of the business 43 CHAPTER 2: DATA MINING DEFINITION Application (step 5): Analysis Results & Learning (2.7) What is the point of data mining? Use the results and derive action(s) that solve problems or meet needs for the entity Find the best customers (for advertising flyer campaign) Score/rate-rank relevant factors that influence target group All variables must be carried across time periods between analysis and application of model results with careful thinking If period between analysis and application is one year, at that time we would transform the age variable by creating a new variable that represents the corresponding age of the person during the training analysis period (subtract 1 year from age variable or recalculate using DOB) Same true for all other variables that are affected by the 1 year Some information / data available during training period may not be available in application period If known in advance, then variables related to such information should be omitted from the model If a feature of a variable is likely to change, replace it with more generic variable (“yellow pen” to “pen”) 44 Tommie Singleton, Ph.D, CPA, CITP, CISA DATA MINING & ANALYTICS (256) 762-5252 [email protected] 45