IS4242 Intelligent Systems & Techniques Lecture Notes PDF
Document Details

Uploaded by PersonalizedRubidium
National University of Singapore
Aditya Karanam
Tags
Related
- Choice Modeling: Marketing Engineering Technical Note 1 PDF
- Market Segmentation, Targeting, and Positioning PDF
- Principle Of Marketing/Management Lecture Notes PDF
- Le Ciblage & Le Positionnement 12.11.24 - Les stratégies marketing
- Unit 1 Strategic Marketing PDF
- Managing Customer Value Lectures 1 & 2 PDF
Summary
This document covers lecture notes on intelligent systems and techniques, focusing on targeting current customers. The topics covered include consumer lifetime value, logistic regression, and support vector machines. It also discusses different types of marketing and why targeting existing customers is important.
Full Transcript
IS4242 INTELLIGENT SYSTEMS & TECHNIQUES L3 – Targeting Current Customers Aditya Karanam © Copyright National...
IS4242 INTELLIGENT SYSTEMS & TECHNIQUES L3 – Targeting Current Customers Aditya Karanam © Copyright National University of Singapore. All Rights Reserved. Announcements ▸ Programming Assignment – 1 will be released today ‣ Due: September 10, 11:59 PM ‣ Penalty for late submission, please start as early as possible ▸ Three members per group. ‣ We have 78 students – 26 groups. ‣ Please form your groups. ‣ https://piazza.com/class/lzjrnf1agra66a/post/5 ▸ We will use SR3 for both lecture (from Week-3) and Tutorial (from Week 4) IS4242 (Aditya Karanam) 2 In this Class ▸ Consumer Lifetime Value ▸ Logistic Regression ▸ Support Vector Machine ▸ Optimal threshold for classification in the context of marketing campaigns IS4242 (Aditya Karanam) 3 Different Types of Marketing ▸ Mass marketing treats all customers as one group ▸ One-to-one marketing focuses on one customer at a time ▸ Target marketing to selected groups of customers or market segments ‣ Lies between mass marketing and one-to-one marketing ▸ Target marketing involves direct marketing to those customers who are most likely to buy ‣ Target marketing increases customer expenditures with the firm IS4242 (Aditya Karanam) 4 Why Target Current Customers? ▸ Extracting profit from the existing customer is much easier than the acquiring a new customer ‣ “Acquiring a new customer can cost five to seven times more than retaining an old one” ▸ Retaining is expensive as well ‣ Mailings, phone calls, Google or Facebook targeting, etc. ▸ Target and retain valuable customer IS4242 (Aditya Karanam) 5 Who is a target? ▸ A target is a customer who is worth pursuing ‣ Profitable customer – sales revenue from the target exceed costs of sales and support ▸ Customer with a positive lifetime value ‣ Over the course of a company’s relationship with the customer, more money comes into the business than goes out of the business ▸ How do we calculate Customer Lifetime Value (LTV)? IS4242 (Aditya Karanam) 7 Customer Lifetime Value ▸ Lifetime value is the expected net present value of future profit contributions by a customer after the acquisition 𝐸(𝑉𝑡 ) 𝐸(𝑅𝑡 −𝐶𝑡 ) ‣ LTV = ∞ σ𝑡=0 = ∞ σ𝑡=0 (1+𝛿)𝑡 (1+𝛿)𝑡 ‣ 𝛿: discount rate ‣ 𝑅𝑡 & 𝐶𝑡 represent revenue and cost from the consumer at t, respectively ‣ Customer subscript is ignored in the above notation ▸ 𝐸(𝑉𝑡 ) depends on whether the consumer stays with the company until time t ‣ 𝐸 𝑉𝑡 = 𝑅𝑡 − 𝐶𝑡 𝑃 𝐶𝑢𝑠𝑡𝑜𝑚𝑒𝑟 𝑠𝑢𝑟𝑣𝑖𝑣𝑒𝑠 𝑢𝑛𝑡𝑖𝑙 𝑡 = 𝑅𝑡 − 𝐶𝑡 𝑆(𝑡) IS4242 (Aditya Karanam) 8 Customer Lifetime Value ▸ Let T be a random variable representing the time customer attritions or leaves the company ‣ f(t): probability density function, 𝐹 𝑡 : cumulative distribution function ▸ Let 𝑆(𝑡) be the probability that the customer attritions after time t ‣ 𝑆(𝑡) = P(T > t) = 1–P(T ≤ t) = 1 − 𝐹(𝑡) IS4242 (Aditya Karanam) 9 Customer Lifetime Value: Geometric Distribution ▸ Most often, T is assumed to follow Geometric distribution ‣ Measures the probability of success after a given number of trails ‣ Success, in this case, is the consumer leaving the company! ▸ 𝑓 𝑡 = 𝑝(1 − 𝑝)𝑡 ▸𝑆 𝑡 = 𝑃 𝑇 > 𝑡 ‣ P 𝑇 > 𝑡 = σ∞ 𝑖=𝑡+1 𝑝(1 − 𝑝) 𝑖−1 = (1 − 𝑝) 𝑡 ▸ Retention rate: r = 1 − 𝑝 ( a constant, hence t subscript is ignored) ‣ 𝑆 𝑡 = r𝑡 IS4242 (Aditya Karanam) 10 Calculate the retention rate by 𝐸(𝑉𝑡 ) (𝑅𝑡 −𝐶𝑡 )r𝑡 ▸ LTV = σ∞ 𝑡=0 = σ∞ 𝑡=0 (1+𝛿)𝑡 (1+𝛿)𝑡 ▸ Assuming constant revenue and cost across time, (𝑅−𝐶)(1+𝛿) ‣ LTV in the infinite horizon: 1+𝛿−𝑟 ∞ (𝑅𝑡 −𝐶𝑡 )r𝑡 ∞ 𝑚 𝑡 r𝑡 ▸ LTV for a consumer: σ𝑡=0 = σ𝑡=0 (1+𝛿)𝑡 (1+𝛿)𝑡 ‣ 𝑚𝑡 : customer’s profit contribution in time t IS4242 (Aditya Karanam) 11 Increasing Marginal Revenues through Targeting ▸ By targeting current customers, we can improve marginal revenues in two ways: cross selling and upselling ▸ Cross-selling: Firms sell different products to its customers ‣ For example, the customer uses Intuit’s TurboTax software, and the company tries to sell the customer Quicken. ▸ Up-selling: selling “more” (higher volume, upgrades) of products they already are buying from the company ‣ For example, a customer has $300,000 in term life insurance, and the company tries to sell the customer a $500,000 policy IS4242 (Aditya Karanam) 12 Models for Targeting Current Customers ▸ Models that focus on what product the customer is likely to buy next ▸ Models that consider when the product is likely to be bought ▸ Models that consider how likely the customer is to respond to the cross-selling or up-selling offers, such as membership offers. ‣ Our focus today ‣ Techniques: Regression, Classification, etc. IS4242 (Aditya Karanam) 13 Targeting Current Customers: Classification Techniques © Copyright National University of Singapore. All Rights Reserved. 14 Application: Modeling Response to Superstore Marketing ▸ A superstore is planning for the year-end sale. ▸ They want to launch a new offer - gold membership, that gives a 20% discount on all purchases. ▸ It will be valid only for existing customers, and the campaign through phone calls is currently being planned for them ▸ The management feels that the best way to reduce the cost of the campaign is to make a predictive model to identify customers who might purchase the offer IS4242 (Aditya Karanam) 15 Data Description: Attributes ▸ Outcome: Response (target) - 1 if customer accepted the offer in the last campaign, 0 otherwise ▸ Predictors: ‣ Customer characteristics: Year_Birth - Age, Education, Marital status, Income, Kidhome - number of small children, Teenhome - number of teenagers in customer's household ‣ Customer purchase behavior: amount spent on fruits, fish and meat products, Sweets, Wines, Gold, No. of Catalog Purchases, Web purchases, Website visits, Deal purchases. ‣ No of Complains, Recency, etc. IS4242 (Aditya Karanam) 16 Application ▸ Build a model to predict the probability that a customer will respond positively. ‣ Task: Classification ▸ The management will use this model to target customers through phone calls IS4242 (Aditya Karanam) 17 Data Mining Preprocessing feature extraction Data Features Business Question Data Ming Task Analyze/Explore/Predict Model IS4242 (Aditya Karanam) 18 Classification Tasks ▸ Outcome: Response (binary variable) ▸ Simple way: look at the distribution of predictors for each class and try to identify the variables that have significantly different distribution ▸ This is good for identifying the variables that matter. ‣ But it is difficult to make predictions ▸ Classification models: Logistic Regression and SVM IS4242 (Aditya Karanam) 19 Why not Linear Regression? ▸ 𝑃( 𝑅𝑒𝑠𝑝𝑜𝑛𝑠𝑒 = 𝑌𝑒𝑠 | 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑜𝑟 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠) Response Meat products Meat products ▸ Left: linear regression → negative probabilities! ▸ Right: All probabilities lie between 0 and 1 IS4242 (Aditya Karanam) 20 Logistic Regression 𝑃(𝑌) 𝛽0 +𝛽1 𝑋1 + …+𝛽𝑝 𝑋𝑝 𝑒 𝛽0 +𝛽1 𝑋1 + …+𝛽𝑝 𝑋𝑝 ▸ Odds: =𝑒 ⇒ 𝑃(𝑌) = (sigmoid) 1−𝑃(𝑌) 1+𝑒 𝛽0 +𝛽1 𝑋1 + …+𝛽𝑝 𝑋𝑝 𝑃 𝑌 ▸ Logit or log-odds: log = 𝛽0 + 𝛽1 𝑋1 + … + 𝛽𝑝 𝑋𝑝 1−𝑃 𝑌 ▸ Logistic function or sigmoid: lies between 0 and 1 ▸ Unit increase in 𝑋1 changes log-odds by 𝛽1 or multiplies the odds by 𝑒 𝛽1 IS4242 (Aditya Karanam) 21 Estimating Regression Coefficients ▸ Approach: Maximum Likelihood Estimation 𝑃 𝑌 ▸ Ex: Logistic regression with one variable: log = 𝛽0 + 𝛽1 𝑋1 1−𝑃 𝑌 ‣ L(𝛽0 , 𝛽1 ) = ς𝑖:𝑦𝑖 =1 𝑝(𝒙𝑖 ) ς𝑗:𝑦𝑖 =0(1 − 𝑝(𝒙𝑗 )) Notation: 𝒙𝑖 : feature vectors of data point i 𝑝 𝒙𝑖 is sigmoid function ‣ Find the coefficients that maximize likelihood 𝑒 𝛽0 +𝛽1 𝑋1 + …+𝛽𝑝 𝑋𝑝 𝑃(𝑦ෝ𝑖 ) = 1+𝑒 𝛽0 +𝛽1 𝑋1 + …+𝛽𝑝 𝑋𝑝 ‣ By default: 𝑦ෝ𝑖 = 1 𝑖𝑓 𝑃 𝑦ෝ𝑖 > 0.5 𝑒𝑙𝑠𝑒 0 IS4242 (Aditya Karanam) 22 SVM and Generation of ML Algorithms ▸ Pre 1980: ‣ Almost all learning methods learned linear decision surfaces – strong theoretical properties. ▸ 1980’s ‣ Decision trees and NNs allowed efficient learning of non-linear decision surfaces ‣ Little theoretical basis, and all suffer from local minima ▸ 1990’s ‣ Efficient learning algorithms for non-linear functions based on computational learning theory ‣ Have strong theoretical properties. ▸ 2010’s ‣ Deep Neural Nets allow extremely efficient learning of non-linear decision surfaces ‣ There is little theoretical basis, and all suffer from local minima IS4242 (Aditya Karanam) 23 Support Vector Machine ▸ Constructs a maximum margin separator: a decision boundary with the largest possible distance to data points ‣ This separator is linear and is also called hyperplane ‣ 𝑊∙𝑋+𝑏=0 ‣ Margin can be considered as a width of the street separating positive and negative training data points Notation: 𝑋1 , 𝑋2 : features of the data IS4242 (Aditya Karanam) 24 Support Vector Machine: Notation ▸ Notation in SVM: ‣ Class labels are +1 and -1 instead of +1 and 0 ‣ Intercept as a separate parameter: b 𝑋+ ▸ W: vector perpendicular to the separator (blue line) ▸ For all positive data points or vectors : 𝑋− ‣ 𝑓 𝑋+ = 𝑊 ∙ 𝑋+ + 𝑏 ≥ +1 ▸ For all negative vectors: ‣ 𝑓(𝑋− ) = 𝑊 ∙ 𝑋− + 𝑏 ≤ −1 ▸ Simply for all observations: 𝑦𝑖 𝑊 ∙ 𝑋𝑖 + 𝑏 ≥ 1 Notation: 𝑋𝑖 : the data point IS4242 (Aditya Karanam) 25 What is the margin? ▸ Let 𝑋+ and 𝑋− are the datapoints closest to separator ‣ 𝑊 ∙ 𝑋+ + 𝑏 = +1 𝑋+ ‣ 𝑊 ∙ 𝑋− + 𝑏 = −1 𝑋− ‣ Subtracting: 𝑊 ∙ (𝑋+ − 𝑋− ) = 2 ▸ Dividing by the length of 𝑊 produces the distance between the lines: 𝑊 2 ‣ (𝑋+ − 𝑋− ) = ||𝑊|| ||𝑊|| ▸ We want to maximize this distance IS4242 (Aditya Karanam) 26 SVM: Honoring the Constraints 2 ▸ Maximize: ||𝑊|| ||𝑊||2 ‣ Equivalently, minimize: ||𝑊|| or you are essentially minimizing the distance from the hyperplane to the nearest data 2 points of the classes, which is equivalent to ‣ Why this form? maximizing the margin. – For mathematical convenience ▸ Constraints: 𝑦𝑖 𝑊 ∙ 𝑋𝑖 + 𝑏 ≥ 1 ▸ How do you solve it? ‣ Maximization problem solved using Lagrange method IS4242 (Aditya Karanam) 27 SVM: Optimization Problem ▸ Using LaGrange’s method with 𝛼𝑖 as the Lagrange parameter for each observation to obtain a quadratic optimization problem: 1 ‣ 𝑎𝑟𝑔𝑚𝑎𝑥𝛼 σ𝑗 𝛼𝑗 − σ𝑗,𝑘 𝛼𝑗 𝛼𝑘 𝑦𝑗 𝑦𝑘 (𝑋𝑗 ∙ 𝑋𝑘 ) 2 ‣ Subject to: 𝛼𝑗 ≥ 0 for all j and σ𝑗 𝛼𝑗 𝑦𝑗 = 0 ▸ Maximized if 𝛼𝑗 ’s that correspond to the support vectors (observations close to the separating hyperplane) are non-zero ‣ Those that ‘matter’ in fixing the maximum margin IS4242 (Aditya Karanam) 28 Case 1: Two Completely Dissimilar Vectors ▸ 2 dissimilar (orthogonal) vectors: 𝑋𝑗 , 𝑋𝑘 , don’t count at all 𝑋𝑘 𝑋𝑗 IS4242 (Aditya Karanam) 29 Case 2.1: Two Alike Vectors from Different Class ▸ 2 very similar 𝑋𝑗 , 𝑋𝑘 vectors that predict different classes tend to maximize the margin width 𝑋𝑘 𝑋𝑗 IS4242 (Aditya Karanam) 30 Case 2: Two Alike Vectors from Same Class ▸ 2 vectors 𝑋𝑗 , 𝑋𝑘 that are similar but predict the same class are redundant 𝑋𝑗 𝑋𝑘 IS4242 (Aditya Karanam) 31 SVM: Prediction ▸ 𝑊 = σ𝑗 𝛼𝑗 𝑦𝑗 𝑋𝑗 ▸ For a test vector 𝑋: ‣ Class: 𝑠𝑖𝑔𝑛 (𝑊 ∙ 𝑋 − 𝑏) ‣ = 𝑠𝑖𝑔𝑛(σ𝑗 𝛼𝑗 𝑦𝑗 𝑋𝑗 ∙ 𝑋 − 𝑏) IS4242 (Aditya Karanam) 32 What if the data is non-linearly separable? ▸ Transform the data into a different space ‣ Gain linear separation by mapping the data to a higher dimensional space ▸ Ex: Data can be separated by a quadratic transformation Notation: 𝑥1 , 𝑥2 are features here IS4242 (Aditya Karanam) 33 Objective Function in Transformed Space ▸ What we have: 1 ▸ 𝑎𝑟𝑔𝑚𝑎𝑥𝛼 σ𝑗 𝛼𝑗 − σ𝑗,𝑘 𝛼𝑗 𝛼𝑘 𝑦𝑗 𝑦𝑘 (𝜑(𝑋𝑗 ) ∙ 𝜑(𝑋𝑘 )) 2 ‣ Subject to: 𝛼𝑗 ≥ 0 for all j and σ𝑗 𝛼𝑗 𝑦𝑗 = 0 IS4242 (Aditya Karanam) 34 Objective Function in Transformed Space 1 ▸ 𝑎𝑟𝑔𝑚𝑎𝑥𝛼 σ𝑗 𝛼𝑗 − σ𝑗,𝑘 𝛼𝑗 𝛼𝑘 𝑦𝑗 𝑦𝑘 (𝜑(𝑋𝑗 ) ∙ 𝜑(𝑋𝑘 )) 2 ‣ Subject to: 𝛼𝑗 ≥ 0 for all j and σ𝑗 𝛼𝑗 𝑦𝑗 = 0 ▸ Simply, we can define a kernel function 𝐾 𝑋𝑗 , 𝑋𝑘 = 𝜑(𝑋𝑗 ) ∙ 𝜑(𝑋𝑘 ) ▸ Generalize the form of kernel as some function computes similarity in the transformed space ‣ Polynomial kernel: 𝐾 𝑋𝑗 , 𝑋𝑘 = (1 + 𝑋𝑗 ∙ 𝑋𝑘 )𝑑 −𝛾|𝑋𝑗 −𝑋𝑘 |2 ‣ Radial Basis Function: 𝐾 𝑋𝑗 , 𝑋𝑘 = 𝑒 ▸ SVM with non-linear kernels are helpful when the data are not linearly separable IS4242 (Aditya Karanam) 35 SVM with Non-Linear Kernels and Noisy Data ▸ We can allow for some errors by slightly changing the constraints ▸ Soft-margin: 1 ‣ 𝑎𝑟𝑔𝑚𝑎𝑥𝛼 σ𝑗 𝛼𝑗 − σ𝑗,𝑘 𝛼𝑗 𝛼𝑘 𝑦𝑗 𝑦𝑘 𝐾(𝑋𝑗 , 𝑋𝑘 ) 2 ‣ Subject to: 0 ≤ 𝛼𝑗 ≤ 𝑪, for all j and σ𝑗 𝛼𝑗 𝑦𝑗 = 0 ‣ C is treated as a tuning parameter that is generally chosen via cross- validation. ‣ Discuss more about the cross validation techniques in the future IS4242 (Aditya Karanam) 36 Evaluation Metrics for the Binary Classifier ▸ Binary Classification Problem: ▸ Counts of: Truth ‣ True Positive (TP) True False FP ‣ False Positive (FP) True TP ‣ True Negative (TN) Prediction ‣ False Negative (FN) False FN TN TP+TN ▸ Accuracy: TP+TN+FP+FN IS4242 (Aditya Karanam) 37 Sensitivity and Specificity TP Truth ▸ Sensitivity: TP+FN True False FP True TP Prediction TN ▸ Specificity: False FN TN TN+FP IS4242 (Aditya Karanam) 38 Precision & Recall TP ▸ Precision (P): TP+FP Truth True False TP ▸ Recall (R): (same as sensitivity) True TP FP TP+FN Prediction False FN TN 2∗P∗R ▸ F1-Measure: P+R ‣ F1-measure is used when we have imbalance data IS4242 (Aditya Karanam) 39 Performance of Logistic Regression and SVM ▸ Performing poorly on class 1 ‣ Poor in identifying customers who will respond ‣ One way: Change the thresholds of classifier IS4242 (Aditya Karanam) 40 What is Optimal Threshold for Classification? ▸ Average response rate in the test data: 0.17 Lift measures how much better a model is at predicting positive outcomes compared to random guessing. ▸ The default cut-off values such as 0.5, does not work given the low base rate of responses TPR = Sensitivity (recall) = TP/ (TP+ FN) Youden's index = Sensitivity + Specificity -1 TNR = Specificity= TN / (TN + FP) ▸ Base the threshold value based on financial performance of the model. ‣ Lift value ‣ Youden’s index Interpretation: A Lift value > 1 means the model is better than random guessing. A Lift value = 1 means the model performs as well as random guessing. A Lift value < 1 means the model is worse than random guessing. Youden's Index ranges from 0 to 1: A value of 0 means the model is as good as random guessing (either sensitivity or specificity is zero). A value of 1 indicates perfect sensitivity and specificity. Higher values indicate a better trade-off between identifying positive cases and correctly rejecting negative cases. IS4242 (Aditya Karanam) 41 Optimal Threshold for Classification: Lift Value ▸ Lift calculates the response rate that predictive model provides over the average response rate in the data ▸ We calculate deciles based on the probability of responding that is ordered from highest to lowest ▸ Calculate the lift for each decile ‣ Ratio of response rate to the average response rate, for each decile ▸ Use the probability value corresponding to the lift value of 2 as the threshold ‣ These customers are twice as likely to respond compared to the average customer in the data https://chatgpt.com/share/66f50e26-5090-8009-be76-90f4d90d5a0e IS4242 (Aditya Karanam) 42 Optimal Threshold for Classification: Youden’s Index ▸ Receiver Operating Characteristics (ROC) curve gives the classification performance at all thresholds ‣ True positive rate: sensitivity ‣ False positive rate: 1- specificity ▸ Identify the threshold at which the ROC curve is farthest from the random classifier performance ‣ Youden’s index or Youden’s J statistic ‣ Calculated as: true positive rate – false positive rate = 1-specificity ‣ Optimal Threshold: Threshold at which TPR = Sensitivity Youden’s index is maximum TNR = Specificity FPR = 1 - specificity FNR = 1 - sensitivity IS4242 (Aditya Karanam) 43 Classification with Different Thresholds ▸ Lift value: Logistic Regression SVM with linear kernel ▸ Youden’s index: IS4242 (Aditya Karanam) 44 Thank You © Copyright National University of Singapore. All Rights Reserved.