Base Analysis Handout PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document is a handout on base analysis in machine learning. It covers assessing data quality, understanding feature distributions, and conducting exploratory data analysis (EDA).
Full Transcript
Base Analysis Handout What is Base Analysis? In machine learning, it refers to the ini3al step of evalua3ng and understanding the founda3onal components of a dataset before applying more complex modelling t...
Base Analysis Handout What is Base Analysis? In machine learning, it refers to the ini3al step of evalua3ng and understanding the founda3onal components of a dataset before applying more complex modelling techniques. It involves assessing data quality, understanding feature distribu3ons, and conduc3ng exploratory data analysis to prepare for effec3ve machine learning. Assessing Data Quality What It Means: checking if your data is reliable and useful. Just like a recipe needs good ingredients, your machine learning model needs good data. Key Checks: Missing Data: Are there any gaps in data? While analyzing customer data, do datasets have missing entries? Inconsistencies: Are there errors or inconsistencies? If a customer’s age is listed as 200, that's clearly a mistake. Duplicates: Are there repeated entries? Duplicate data can skew your results, so it's important to iden3fy and remove them. Understanding Feature Distribu>ons What It Means: Features are the variables or aLributes in dataset that are used to make predic3ons. Understanding their distribu3ons means knowing how the data is spread out and what paLerns it shows. Why It’s Important: Iden>fying Outliers: An outlier is a data point that is very different from the others. For example, if most customers spend between Rs. 10 and Rs. 100, but one customer spends Rs. 10,000, that's an outlier. Normal vs. Skewed Distribu>ons: If ploOng a feature like "age" or "income," it may follow a normal distribu3on (a bell curve), but some3mes it’s skewed. Knowing this helps to decide if there is need to transform the data before modeling. Conduc>ng Exploratory Data Analysis (EDA) What It Means: EDA is like detec3ve work on your data. You explore and visualize the data to uncover paLerns, trends, and rela3onships between different features. Steps in EDA: Visualizing Data: Create plots and charts like histograms, scaLer plots, or box plots to see how the data is distributed and how features relate to each other. Finding Correla>ons: Check if any features are strongly related. For example, income and spending are posi3vely correlated—when one goes up, so does the other. Checking for Bias: Look for any biases in the data that could affect model. Most of data comes from a par3cular region or demographic, model might not work well for others. 1 Preparing the Data for Modeling What It Means: AZer having a good understanding of the data, now we need to prepare it for modeling. This step is like cleaning and organizing workspace before star3ng a project. Key Tasks: Handling Missing Data: Decide how to deal with gaps. You can fill them with averages, medians, or other methods. Feature Scaling: Ensure that features are on a similar scale, especially if model depends on distance calcula3ons, like in K-means clustering. For example, if one feature is in the range of 1-10 and another is in thousands, you might scale them to a similar range. Encoding Categorical Variables: If dataset have categories like "male" and "female," you need to convert them into numbers (e.g., 0 and 1) so the model can process them. Importance of Base Analysis in B2C and B2B B2C B2B Tailored consumer Improved Informed business Enhanced relationship insights Segmentation decisions management Steps of Base Analysis Model Selection Feature Analysis Cross-Validation Assess data quality, Create simple models and Comparison Optimize model Record findings and identify missing to establish parameters for methodologies for values, and examine Evaluate feature performance Choose appropriate improved Validate models using transparency and distributions. importance and benchmarks. algorithms based on performance. techniques like k-fold future reference. correlations to target data characteristics to ensure robustness. variables. and objectives. Understanding Baseline Model Hyperparameter Documentation the Dataset Evaluation Tuning and Reporting 2 Understanding the Dataset B2C Focus on customer demographics, purchase history, and behaviour to identify Context trends and preferences B2B Analyse business profiles, transaction history, and industry metrics to understand Context client relationships and market dynamics Key Considera>ons: Ensure data quality by addressing missing values and detec3ng outliers to maintain accuracy in analysis. Feature Analysis Feature Importance: o B2C: Determine the essen3al factors that affect consumer behaviour, including their preferences and purchasing habits. o B2B: Understand rela3onships between companies, focusing on factors like partnerships and interac3ons. Correla>on Analysis: o B2C: Analyse price sensi3vity, marke3ng channels, and customer sa3sfac3on to op3mize strategies. o B2B: Examine client size, order frequency, and contract length to enhance business rela3onships and forecasts. Correlaon Coefficient: The main output of correla3on analysis is the correla>on coefficient, usually denoted as "r". This value ranges between -1 and 1. o Posi>ve Correla>on (r > 0): As one variable increases, the other also increases. o Nega>ve Correla>on (r < 0): As one variable increases, the other decreases. o No Correla>on (r ≈ 0): No rela3onship exists between the two variables. Ø Formula of Correla>on: ∑(𝒙𝒊 $𝒙 %)(𝒚𝒊 $𝒚 % )) 𝒓= %)𝟐 ∑(𝒚𝒊 $𝒚 (∑(𝒙𝒊 $𝒙 % )𝟐 Where - 𝑟 : Correla3on Coefficient 𝑥) : i*+ value of x 𝑥̅ : mean of x 𝑦) : i*+ value of y 𝑦( : mean of y Y Y X X Positive Correlation Negative Correlation Y X Zero Correlation 4 Key Concepts Ø Strength of Correla>on: Correlation Coefficient Correlation Strength Correlation Type -.7 to -1 Very strong Negative -.5 to -.7 Strong Negative -.3 to -.5 Moderate Negative 0 to -.3 Weak Negative 0 None Zero 0 to.3 Weak Positive.3 to.5 Moderate Positive.5 to.7 Strong Positive.7 to 1 Very strong Positive Example Scenario: Emma a data analyst at a retail company that sells products both directly to consumers (B2C) and to other businesses (B2B). She is interested in understanding how different factors correlate with sales performance to op3mize your marke3ng and sales strategies. Ø B2C Context: In the B2C context, Emma has data on customer sa3sfac3on scores and the number of repeat purchases. She wants to analyze if there's a correla3on between customer sa3sfac3on and repeat purchases. Solu>on: Data: Customer Sa3sfac3on Scores (on a scale of 1 to 10) Number of Repeat Purchases Objec>ve: Determine if higher customer sa3sfac3on scores are associated with more repeat purchases. Steps Collect Data: Calculate Correlation: Visualize: Gather a dataset with customer satisfaction Use Pearson correlation coefficient to measure Create a scatter plot to visualize the scores and repeat purchase numbers the strength and direction of the linear relationship relationship between the two variables 5 Python Code Output 6 Example Ø B2B Context: In the B2B context, Emma has data on the volume of orders placed by businesses and the length of their contracts. She wants to analyze if there's a correla3on between order volume and contract length. Solu>on: Data: Volume of Orders (in dollars) Contract Length (in months) Objec>ve: Determine if a higher order volume is associated with a longer contract length. Steps: Visualize: Create a scatter plot to visualize the relationship Calculate Correlation: Use Pearson correlation coefficient to measure the Collect Data: strength and direction of the linear relationship between Gather a dataset with order the two variables volumes and contract lengths Python Code 7 Output Business Model Evaluaon: 𝟏 𝒇(𝒙) = 𝟏 + 𝐞$𝒙 Where - e = base of natural logarithms value = numerical value one wishes to transform Formula of Logis>c Regression: 𝐞𝒃𝟎 /𝒃𝟏 𝒙 𝒚= 𝟏 + 𝐞𝒃𝟎 /𝒃𝟏 𝒙 Where - x = input value y = predicted output b0 = bias or intercept term b1 = coefficient for input (x) Model Selecon Gain (IG): The reduc3on in entropy aZer a dataset is split on an aLribute |𝑺 | 𝑮𝒂𝒊𝒏 (𝑺, 𝑨) = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 (𝑺) − ∑ 𝒗 ∈ 𝒗𝒂𝒍𝒖𝒆𝒔(𝑨) |𝑺|𝒗 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 (𝑺𝒗 ) Where - 𝑺 is the set of samples. 𝑨 is the aLribute. 𝑺𝒗 is the subset of S with A and has a value of v. 𝒗𝒂𝒍𝒖𝒆𝒔(𝑨) is the set of all possible values for aLribute A. |𝑺| is the number of samples in set S. |𝑺𝒗 | is the number of samples in subset 𝑆<. Gini Impurity: Another measure of impurity. 𝒏 𝑮𝒊𝒏𝒊(𝑺) = 𝟏 − L(𝒑𝒊 )𝟐 𝒊?𝟏 Where - S is the set of samples 𝒏 is the number of classes. 𝒑𝒊 is the propor3on of samples belonging to class i. Example Case Scenario: A B2C e-commerce company wants to improve customer reten3on by iden3fying which customers are likely to churn (stop buying). Problem Statement: The company needs to classify customers into "likely to churn" and "likely to stay" categories based on their behavior. Process: Data Collection: Training the Model: Making Predictions: The company The decision tree is For each customer, collects data on trained on historical the model asks customer behavior data, learning questions (e.g., Has For example, patterns like the customer purchase "customers who purchased in the frequency, average haven't purchased last 3 months?) and order value, and in 3 months and classifies them time since the last have low average based on the purchase order value are answers likely to churn" 11 Given: The company has data on 100 customers. 30 customers have churned, and 70 customers have stayed. One important feature is the "3me since last purchase" (e.g., whether a customer purchased in the last 3 months). Ø What is the ini3al entropy of the dataset with respect to customer churn? 𝟑𝟎 p(churned)= 𝟏𝟎𝟎 = 0.3 𝟕𝟎 p(stayed)= 𝟏𝟎𝟎 = 0.7 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑺) = -[0.7 × 𝐥𝐨𝐠 𝟐 (0.7) + 0.3 × 𝐥𝐨𝐠 𝟐 (0.3) ] Value of 𝐥𝐨𝐠 𝟐 : 𝐥𝐨𝐠 𝟐 𝟎. 𝟕 ≈ −𝟎. 𝟓𝟏𝟓 and 𝐥𝐨𝐠 𝟐 𝟎. 𝟑 ≈ −𝟏. 𝟕𝟑𝟕 By pugng values in formula: 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = -[(0.7 × − 0.515) + (0.3 × − 1.737)] 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = -(-0.3605 -0.5211) 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = 0.8816 So, the value of entropy is approximately 0.8816. Ø If we split the data based on whether customers purchased in the last 3 months, how does this affect entropy, and what is the informa3on gain? Let’s assume the split results in: Group 1: 40 customers purchased in the last 3 months (10 churned, 30 stayed). Group 2: 60 customers did not purchase in the last 3 months (20 churned, 40 stayed). Calculate Entropy for Each Group: For Group 1: BC p(churned)= DC = 0.25 EC p(stayed)= DC = 0.75 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = -(0.75× log F (0.75) +0.25× log F (0.25)) Value of 𝐥𝐨𝐠 𝟐 : log F 0.75 ≈ −0.2 and log F 0.25 ≈ −0.415 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = -(0.75× -0.2 +0.25× − 0.415) 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = 0.811 For Group 2: FC p(churned)= GC = 0.33 DC p(stayed)= GC = 0.67 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = -(0.67× log F (0.67) +0.33× log F (0.33)) Value of 𝐥𝐨𝐠 𝟐 : log F 0.67 ≈ -0.585 and log F 0.33 ≈ -1.585 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = -(0.67× -0.585+0.33× -1.585) 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = 0.934 12 Calculate Weighted Average of Entropy Aher Split: DC GC 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(HIJKL MNO)J) = BCC × 0. 811 + BCC × 0.934 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(HIJKL MNO)J) = 0.3244 + 0.5604 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(HIJKL MNO)J) = 0.8848 Calculate Informa>on Gain: 𝐺𝑎𝑖𝑛 (𝑆, 𝑠𝑝𝑙𝑖𝑡) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(HIJKL MNO)J) 𝐺𝑎𝑖𝑛 (𝑆, 𝑠𝑝𝑙𝑖𝑡) = 0.8816 − 0.8848 𝐺𝑎𝑖𝑛 (𝑆, 𝑠𝑝𝑙𝑖𝑡) = −0.0032 In this case, the informa3on gain is nega3ve, which suggests that this par3cular split might not be useful. This means that the chosen feature may not be the best to split on, and another feature might provide a higher gain. Case Scenario Business Applica>on and Growth: The company uses the decision tree model to iden3fy at-risk customers and targets them with personalized offers or reminders. By focusing on retaining these customers, the company can reduce churn and increase revenue. Let’s take a case scenario on B2B model “Regression Model”. What is Regression Model? Regression analysis is a sta3s3cal method for examining the rela3onship between a dependent variable and one or more independent variables. It is widely used for predic3on and forecas3ng. Dependent variable: The dependent variable is the outcome or the main factor that you're trying to predict or understand in a study or model. It is also known as the response variable, target variable, or output variable. Independent variables: An independent variable is one that influences or causes changes in the dependent variable. It’s also known as the predictor, explanatory, or input variable. Formula of Regression: 𝒀 = 𝜷𝟎 + 𝜷𝟏 𝑿𝟏 + 𝜷𝟐 𝑿𝟐 + 𝜷𝟑 𝑿𝟑 +.. +∈ Where - 𝒀 is the Dependent Variable. 𝑿𝟏 , 𝑿𝟐 , 𝑿𝟑 is the Independent Variables. 𝜷𝟎 is the Intercept of Regression Line. 𝜷𝟏 , 𝜷𝟐 , 𝜷𝟑 are the Slopes. ∈ is the error term. 13 Case Scenario: A real estate company wants to predict house prices in a certain area based on various factors such as the size of the house, the number of bedrooms, and the distance from the city center. Problem Statement: The company needs to build a regression model that predicts the price of a house based on these factors to beLer price homes and guide poten3al buyers. Process: Data Collection: Model Training: Coefficient Interpretation: 1.The company A linear The model gathers data on regression model provides orders, marketing is trained to coefficients that spend, and understand how indicate the economic each factor impact of each indicators. influences orders. factor. Given: Dependent Variable (𝒀): House Price (in thousands of dollars) Independent Variables (𝑿𝟏 , 𝑿𝟐 , 𝑿𝟑 ): 𝑿𝟏 : Size of the house in square feet 𝑿𝟐 : Number of bedrooms 𝑿𝟑 : Distance from the city center in miles Understanding the Regression Formula: 𝑌 = 𝛽C + 𝛽B 𝑋B + 𝛽F 𝑋F + 𝛽E 𝑋E +∈ Where - 𝑌 is the predicted house price. 𝑋B is the size of the house. 𝑋F is the number of bedrooms. 𝑋E is the distance from the city center. 𝛽C , 𝛽B , 𝛽F , 𝛽E are the coefficients that represent the impact of each variable. ∈ is the error term, represen3ng the difference between the actual and predicted values. Collecng House Prices: Now, let’s predict the price of a house that is: 2000 square feet, Has 4 bedrooms, Is 4 miles from the city center. Using the regression equa3on, 𝐏𝐫𝐞𝐝𝐢𝐜𝐭𝐞𝐝 𝐏𝐫𝐢𝐜𝐞 = 𝟓𝟎 + 𝟎. 𝟐(𝟐𝟎𝟎𝟎) + 𝟑𝟎(𝟒) − 𝟏𝟓(𝟒) Predicted Price= 50 + 0.2(2000) + 120 − 60 Predicted Price = 50 + 400 + 120 − 60 Predicted Price = 510,000 dollars Business Decisions: The real estate company can use this model to predict house prices in various scenarios, helping them: Set accurate lis3ng prices. Advise sellers on how certain factors (like adding a bedroom) might increase their property value. Assist buyers in understanding how loca3on and size affect prices. Hyperparameter Tuning B2C Context: Adjus3ng parameters to enhance the accuracy of marke3ng models. B2B Context: Op3mizing supply chain models to improve efficiency and reduce costs. Techniques Grid Search: Conducts a comprehensive search across a defined grid of parameters Random Search: Samples parameter combinations randomly within defined ranges Bayesian Optimization: Uses probabilistic models to identify the best parameters efficiently 15 Grid Search Ø Concept: Grid Search involves defining a grid of hyperparameter values and then training the model for each combina3on of these values. It exhaus3vely searches through all possible combina3ons to find the best performing set of hyperparameters. Ø Process: 1. Define the Grid: Specify ranges or lists of values for each hyperparameter you want to tune. 2. Train Models: Train a model for every combina3on of hyperparameters 3. Evaluate Models: Use a performance metric (e.g., accuracy, F1-score) to evaluate each model on a valida3on set. 4. Select Best Parameters: Choose the hyperparameters that yield the best performance metric. Ø Explana>on: Grid Search evaluates all possible combina3ons of hyperparameters, so if you have kkk hyperparameters each with nnn values, it requires nkn^knk models to be trained. This can be computa3onally expensive for large grids. Random Search Ø Concept: Random Search samples hyperparameter values from specified distribu3ons rather than evalua3ng every possible combina3on. This method is less exhaus3ve but can be more efficient. Ø Process: 1. Define Distribu3ons: Specify distribu3ons for hyperparameters (e.g., uniform, log- uniform) 2. Sample Hyperparameters: Randomly sample combina3ons from these distribu3ons 3. Train Models: Train models for each sampled set of hyperparameters 4. Evaluate Models: Assess performance on a valida3on set and select the best combina3on. Ø Explana>on: Random Search does not guarantee finding the global op3mum but can be effec3ve if the search space is large. It typically requires fewer evalua3ons than Grid Search but might miss the op3mal set of hyperparameters. Bayesian Op