Lecture 7 8 Data MiningUp.pdf

Data mining We will cover Data mining concepts and applications Process Methods data mining myths and blunders Meaning The process of digging through data to discover hidden connections and predict future trends Data mining is the process of finding patterns and correlations within large data sets to predict outcomes. Using a broad range of techniques, business can use this information to increase revenues, cut costs, improve customer relationships, reduce risks and more. Data mining is the process of sorting through large data sets to identify patterns and relationships that can help solve business problems through data analysis. Data mining techniques and tools enable enterprises to predict future trends and make more-informed business decisions Example of Data Mining Scenario: A retail company wants to increase sales by understanding customer purchasing behavior. 1. Data Collection: 1. The company collects data on customer purchases over the last two years. This data includes information like customer ID, items purchased, purchase date, price, and payment method. 2. Data Preparation: 1. The raw data is cleaned to remove duplicates, handle missing values, and normalize the data. For instance, if some purchases have missing product IDs or prices, those are corrected or removed from the dataset. Data Mining Techniques: The company applies different data mining techniques to find patterns. For example: Association Rule Learning: This technique might reveal that customers who buy diapers often also buy baby wipes (e.g., 60% of the time). Classification: The data is used to create a model that predicts whether a customer is likely to purchase a product based on their past behavior. For example, the company could predict if a customer is likely to buy a new phone within the next six months. Clustering: Customers are grouped into clusters based on their buying habits. For instance, one cluster may represent customers who frequently buy electronic gadgets, while another may represent those who buy groceries regularly. Pattern Evaluation: The company evaluates the patterns found to ensure they are meaningful. For example, the association rule between diapers and baby wipes could lead to the company placing these items closer together in stores or offering bundle deals. Decision Making: Based on the insights gained, the company implements strategies to increase sales, such as targeted marketing campaigns for different customer segments, personalized discounts, and optimizing store layouts. Result: After implementing these strategies, the company observes an increase in sales and customer satisfaction, showing that data mining helped AND them make better decisions. Example of Data Mining Grocery stores are well-known users of data mining techniques. Many supermarkets offer free loyalty cards to customers that give them access to reduced prices not available to non-members. The cards make it easy for stores to track who is buying what, when they are buying it and at what price. After analyzing the data, stores can then use this data to offer customers coupons targeted to their buying habits and decide when to put items on sale or when to sell them at full price. Applications Data Mining Applications Customer Relationship Management – Maximize return on marketing campaigns – Improve customer retention (churn analysis) – Maximize customer value (cross-, up-selling) – Identify and treat most valued customers Real-life Scenario: A telecommunications company uses data mining to identify customers who are likely to churn. By analyzing call records, customer complaints, and payment history, the company creates personalized retention strategies, offering special deals to those at risk of leaving. Banking and Other Financial – Automate the loan application process – Detecting fraudulent transactions – Maximize customer value (cross-, up-selling) – Optimizing cash reserves with forecasting Real life scenario Credit Scoring Problem: Assessing the creditworthiness of loan applicants. Data Mining Solution: By analyzing past repayment behaviors, employment history, and financial data, banks can create models that predict the likelihood of a customer defaulting on a loan. This helps in making informed lending decisions. Market Basket Analysis: Application: Retailers use data mining to understand the purchasing patterns of customers. This helps in organizing store layouts, optimizing product placement, and developing targeted marketing strategies. Real-life Scenario: A supermarket chain uses data mining to analyze transaction data from its stores. It discovers that customers who buy bread often purchase butter and milk. This insight leads to bundling these products together in promotions, resulting in increased sales. Data Mining Applications (cont.) Retailing and Logistics – Optimize inventory levels at different locations – Improve the store layout and sales promotions – Optimize logistics by predicting seasonal effects – Minimize losses due to limited shelf life Real life scenario Demand Forecasting: Scenario: A clothing retailer needs to anticipate demand for seasonal products. Application: Historical sales data, along with external factors like weather forecasts and trends, can be mined to predict future demand. This helps in inventory management, reducing overstock and stockouts. Manufacturing and Maintenance – Predict/prevent machinery failures – Identify anomalies in production systems to optimize the use manufacturing capacity – Discover novel patterns to improve product quality Supply Chain Optimization Scenario: Manufacturers use data mining to optimize their supply chains by analyzing data from suppliers, production, and distribution. Application: By understanding patterns in demand, lead times, and supplier reliability, manufacturers can optimize inventory levels, reduce lead times, and improve the overall efficiency of the supply chain. Outcome: Lower inventory costs, improved customer satisfaction, and more efficient use of resources. Data Mining Applications Brokerage and Securities Trading – Predict changes on certain bond prices – Forecast the direction of stock fluctuations – Assess the effect of events on market movements – Identify and prevent fraudulent activities in trading Real scenario Risk Management Scenario: Brokerage firms use data mining techniques to assess the risk associated with different investments or trading strategies. Application: Analyzing historical data on market fluctuations, economic indicators, and client portfolios helps firms identify potential risks and adjust their strategies accordingly. Insurance – Forecast claim costs for better business planning – Determine optimal rate plans – Optimize marketing to specific customers – Identify and prevent fraudulent claim activities Real Scenario Fraud Detection Scenario: An insurance company is concerned about the rising number of fraudulent claims and wants to minimize losses due to fraud. Data Mining Applications (cont.) Computer hardware and software Science and engineering Government and defense Homeland security and law enforcement Travel industry Healthcare Highly popular application Medicine areas for data mining Entertainment industry Sports Etc. Which of the following is NOT a common application of data mining? A) Market Basket Analysis B) Customer Segmentation C) Predictive Text Input D) Weather Forecasting In data mining, which technique is commonly used for classifying data into predefined categories? A) Clustering B) Regression C) Classification D) Association Rule Mining uestion: Which of the following is an application of data mining in the retail industry? A) Predicting customer churn B) Identifying fraudulent transactions C) Recommending products to customers D) All of the above In healthcare, what is one of the primary uses of data mining? A) Drug discovery B) Web content filtering C) Image enhancement D) Document summarization What is the main goal of using data mining techniques in social media analytics? A) To improve hardware performance B) To understand user sentiments and trends C) To increase storage capacity D) To enhance network security Process Data Mining Process A systematic way to conduct DM projects Different groups has different versions Most common standard processes: – CRISP-DM (Cross-Industry Standard Process for Data Mining) – SEMMA (Sample, Explore, Modify, Model, and Assess) – KDD (Knowledge Discovery in Databases) CRISP-DM Phases CRISP-DM (Cross Industry Standard Process for Data Mining) is a widely used methodology for data mining and analytics projects. It provides a structured approach to planning, executing, and evaluating data mining tasks. CRISP-DM is industry-neutral, meaning it can be applied to various sectors and domains. Data Mining Process: CRISP-DM 1 2 Business Data Understanding Understanding 3 Data Preparation Data Sources 6 4 Deployment Model Building 5 Testing and Evaluation Data Mining Process: CRISP-DM Step 1: Business Understanding Accounts for ~85% of total Step 2: Data Understanding project time Step 3: Data Preparation Step 4: Model Building Step 5: Testing and Evaluation Step 6: Deployment The process is highly repetitive and experimental Case Study: Reducing Customer Churn for a Telecommunications Company Using CRISP-DM Background A large telecommunications company faces high customer churn rates, which is impacting its revenue. The company decides to use data mining to identify customers at risk of churning and to implement strategies to retain them. The company employs the CRISP-DM methodology to guide this data mining project 1. Business Understanding Objective: Understand the project objectives and requirements from a business perspective. Key Activities: Define the problem or business objective. Assess the current situation, including resources, risks, and potential impacts. Determine the success criteria from a business standpoint. Develop a project plan. Outputs: Business objectives. Project plan and success criteria. Detailed project requirements and constraints. Objective: Define the project objectives and requirements from a business perspective and translate them into a data mining problem definition. Steps: Identify the primary business objective (e.g., reduce churn). Determine success criteria (e.g., reduce churn rate by 10% within six months). Translate the business goal into a data mining goal (e.g., develop a predictive model to identify customers likely to churn). 1. Business Understanding Objective: The company’s business goal is to reduce churn by 15% within a year by identifying customers likely to leave and proactively engaging them with retention offers. Steps: Business Objectives: Reduce customer churn, increase customer retention, and improve customer satisfaction. Success Criteria: A 15% reduction in churn within 12 months, leading to increased customer lifetime value. Data Mining Goals: Develop a predictive model to identify customers at risk of churning. Key Questions: What factors contribute most to customer churn? How can the company intervene to prevent churn? Data Understanding Objective: Collect, describe, explore, and verify the data. Key Activities: Initial data collection: Gather the data necessary for the project. Data description: Identify the data format, structure, and properties. Data exploration: Use statistical and visualization techniques to understand patterns, anomalies, and initial insights. Data quality verification: Check for missing values, inconsistencies, and outliers. Outputs: Data description report. Initial exploration findings. Data quality report. Objective: Collect initial data and become familiar with it. Identify data quality issues, uncover initial insights, and detect interesting subsets. Steps: Gather data from various sources (e.g., transaction databases, customer surveys). Explore the data to understand its structure, quality, and potential insights (e.g., analyze purchase frequency, identify missing values). Assess the relevance of the data to the business goal. 2. Data Understanding Objective: Collect and explore the data to gain insights into factors contributing to churn. Data Sources: Customer Demographics: Age, gender, location, tenure with the company. Service Usage: Call records, data usage, SMS usage. Customer Interactions: Customer service calls, complaints, billing issues. Contract Information: Contract type, payment methods, billing cycle. Steps: Data Collection: Gather data from various sources like CRM systems, billing databases, and customer service logs. Exploratory Data Analysis (EDA): Understand distributions, correlations, and patterns in the data. Identify common characteristics of customers who have churned. Data Quality Assessment: Identify missing values, outliers, and inconsistencies in the data. 3. Data Preparation Objective: Prepare the final dataset that will be used for modeling. Key Activities: Data selection: Choose the relevant data for the analysis. Data cleaning: Address any data quality issues, such as handling missing values or outliers. Data transformation: Apply transformations like normalization, aggregation, or creating new derived attributes. Data integration: Combine data from different sources. Data formatting: Arrange the data into a structure that is suitable for the modeling tools. Outputs: Final dataset. Data transformation scripts. Documented data preparation steps. Data Preparation – A Critical DM Task Real-world Data · Collect data Data Consolidation · Select data · Integrate data · Impute missing values Data Cleaning · Reduce noise in data · Eliminate inconsistencies · Normalize data Data Transformation · Discretize/aggregate data · Construct new attributes · Reduce number of variables Data Reduction · Reduce number of cases · Balance skewed data Well-formed Data Objective: Prepare the final dataset that will be used in the modeling process. This phase involves selecting relevant data, cleaning it, constructing necessary features, and integrating data from different sources. Steps: Clean the data (e.g., handle missing values, remove duplicates). Transform data into a format suitable for analysis (e.g., normalize numeric features, encode categorical variables). Select the most relevant features for modeling (e.g., selecting variables that are most predictive of churn). 3. Data Preparation Objective: Prepare the data for modeling by cleaning, transforming, and selecting relevant features. Steps: Data Cleaning: Handle missing values (e.g., impute or remove). Remove duplicate records. Correct inconsistencies (e.g., standardize data formats). Feature Engineering: Create new variables, such as "complaints per month" or "average data usage." Transform categorical variables into numerical formats (e.g., one-hot encoding). Feature Selection: Select the most relevant features for predicting churn (e.g., contract type, customer service calls). Remove irrelevant or redundant features. Prepared Dataset: A clean, structured dataset ready for modeling, with key features like tenure, contract type, service usage, and customer interactions. 4. Modeling Objective: Apply various modeling techniques to the prepared data and select the best models. Key Activities: Select modeling techniques: Choose the appropriate algorithms based on the problem type (e.g., classification, regression, clustering). Model building: Train the models on the prepared dataset. Model assessment: Evaluate the models using relevant performance metrics. Model refinement: Optimize model parameters and select the best- performing model. Outputs: Trained models. Model performance evaluation metrics. Model documentation. Objective: Apply various modeling techniques to the prepared data and calibrate the models to achieve optimal results. Steps: Select appropriate modeling techniques (e.g., classification for churn prediction). Train the models using the prepared dataset. Fine-tune model parameters to improve performance (e.g., adjusting the depth of a decision tree). 4. Modeling Objective: Develop and train models to predict customer churn. Steps: Model Selection: Evaluate various models like logistic regression, decision trees, random forests, and gradient boosting. Use techniques such as cross-validation to assess model performance. Model Training: Train models using the prepared dataset. Adjust hyperparameters to optimize model performance. Model Evaluation: Use metrics like accuracy, precision, recall, F1-score, and AUC-ROC to assess the model. Selected Model: The random forest model is chosen for its ability to handle complex interactions and its high recall, meaning it effectively identifies customers likely to churn. 5. Evaluation Objective: Ensure that the models meet the business objectives and are suitable for deployment. Key Activities: Review model results: Compare the models' performance against the business success criteria. Determine next steps: Decide whether the models are ready for deployment or if further iteration is needed. Validate the models: Ensure the models perform well not only on training data but also on unseen data (e.g., through cross-validation). Outputs: Model evaluation report. Business evaluation report. Go/No-Go decision on deployment. Objective: Assess the models to ensure they meet the business objectives and are suitable for deployment. Steps: Evaluate the model against the success criteria defined in the Business Understanding phase. Compare models to select the best one (e.g., choosing the model with the highest accuracy). Validate the model using unseen data or cross- validation to ensure generalization. 5. Evaluation Objective: Assess the model’s effectiveness in predicting churn and ensuring it aligns with business objectives. Steps: Performance Evaluation: Evaluate the model’s predictive power using the test set. Compare performance against business success criteria (e.g., reduction in churn rate). Model Validation: Validate model using unseen data or through A/B testing. Assess the impact of false positives (incorrectly predicting churn for loyal customers). Outcome: The model successfully identifies over 80% of the customers who are likely to churn, meeting the business objective of reducing churn by 15%. 6. Deployment Objective: Implement the models in a production environment and monitor their performance. Key Activities: Deployment planning: Develop a strategy for deploying the models into the business process. Model integration: Embed the models into the operational systems. Model monitoring: Set up processes to monitor the performance of the models over time. Maintenance: Update models as new data becomes available or business requirements change. Reporting: Provide final documentation and reports to stakeholders. Outputs: Deployed models. Monitoring and maintenance plans. Final project report. Lessons learned and recommendations. Objective: Implement the model in the production environment where it can be used to make decisions or automate processes. Monitor and maintain the model’s performance over time. Steps: Deploy the model in a production environment (e.g., integrating with CRM systems). Create dashboards or reports for business stakeholders to monitor model predictions. Continuously monitor the model’s performance and retrain it as needed. 6. Deployment Objective: Deploy the model into the production environment to start predicting and reducing churn. Steps: Model Deployment: Integrate the model into the company’s CRM system. Use the model to score customers on their likelihood to churn in real time. Retention Strategies: Implement retention strategies based on model predictions, such as targeted offers, discounts, or personalized communication. Monitoring: Monitor the model’s performance over time to ensure it remains effective. Regularly update the model with new data to adapt to changing customer behavior. Business Impact: Within six months of deployment, the company sees a 12% reduction in churn, with ongoing improvements as the model continues to be refined. This results in significant cost savings and higher customer satisfaction. Conclusion CRISP-DM Iterative Process CRISP-DM is not strictly linear; it is an iterative process. You might need to revisit earlier phases based on findings in later stages. For example, insights during the modeling phase might lead to further data preparation or even a reassessment of business objectives. Importance and Benefits of CRISP-DM Flexibility: It can be adapted to various industries and problem types. Structure: Provides a clear, organized framework for executing data mining projects. Focus on Business Goals: Emphasizes the alignment of data mining efforts with business objectives. Iterative Process: Allows for refinement and optimization throughout the project lifecycle. Challenges and Considerations Resource Intensive: Requires careful planning and sufficient resources to manage the entire process. Complexity: May become complex with large datasets or advanced modeling techniques. Changing Requirements: Business objectives or data availability may change, requiring adjustments to the process. CRISP-DM is a foundational methodology for data scientists and analysts, ensuring that data mining projects are carried out effectively and aligned with business goals. 1. What does CRISP-DM stand for? A) Cross Industry Standard Process for Data Mining B) Critical Industry Standard Process for Data Modeling C) Comprehensive Research in Statistical Process Data Mining D) Centralized Risk Industry Standard Process for Data Mining What is the main focus of the Data Preparation phase in CRISP-DM? A) Defining the business problem B) Analyzing data quality and exploring data C) Transforming raw data into the final dataset for modeling D) Evaluating model performance Which of the following is a key activity in the Evaluation phase of CRISP-DM? A) Understanding the business context B) Cleaning the data C) Assessing model accuracy and generalization D) Implementing the model into the production environment During which phase might you return to a previous phase for refinement in CRISP-DM? A) Only in the Business Understanding phase B) Only in the Modeling phase C) In any phase, as CRISP-DM is iterative D) None of the above Which phase involves translating the business objectives into a data mining problem definition? A) Data Preparation B) Modeling C) Business Understanding D) Evaluation The primary goal of the Deployment phase is to: A) Explore the data and identify patterns B) Prepare the data for analysis C) Evaluate the model’s performance D) Implement the model in a real-world setting Many organizations across various industries use the CRISP-DM methodology for data mining and analytics projects. Some of the notable companies and sectors that have adopted CRISP-DM include: 1. IBM Industry: Technology and Consulting Application: IBM has been a significant proponent of CRISP-DM and has used it in various data mining projects, including customer analytics, fraud detection, and business intelligence. They also offer tools like IBM SPSS Modeler that are designed around the CRISP-DM framework. Royal Bank of Scotland (RBS) Industry: Banking and Financial Services Application: RBS uses CRISP-DM for credit scoring, fraud detection, and customer relationship management. The bank relies on data mining to enhance decision-making and improve the accuracy of their predictive models. Conclusion By following the CRISP-DM methodology, the telecommunications company was able to systematically approach the problem of customer churn, leading to a successful reduction in churn rates. The iterative nature of CRISP-DM allowed the company to continuously refine the model and its business strategies, resulting in a positive impact on the bottom line One telecom company known for using CRISP-DM to address customer churn is Vodafone. Vodafone has utilized the CRISP-DM methodology in its data analytics projects to predict and reduce customer churn. By applying CRISP-DM, Vodafone has been able to analyze customer behavior, identify factors that contribute to churn, and implement targeted retention strategies to improve customer loyalty and reduce turnover. Example: Predicting Customer Churn Using CRISP-DM 1. Business Understanding: Objective: A telecom company wants to reduce customer churn. The goal is to build a model that predicts whether a customer will leave the company based on their usage patterns, demographic information, and other relevant factors. Key Questions: What are the main factors leading to customer churn? How can the company use these insights to retain customers? Success Criteria: The model should achieve at least 80% accuracy in predicting customer churn. 2. Data Understanding: Data Characteristics: The dataset includes demographic information (Gender, Age), usage metrics (Monthly Charges, Total Charges, Tenure), contract type, and whether the customer churned.. Data Preparation: Cleaning: Handle missing values (e.g., replacing missing Total Charges with the average). Convert categorical variables (Gender, Contract Type) into numerical values. Feature Selection: Select relevant features: Age, Monthly Charges, Total Charges, Tenure, Contract Type. Data Transformation: Normalize numerical features like Monthly Charges, Total Charges, and Tenure to bring them to a common scale. 4. Modeling: Model Selection: Use a decision tree classifier to predict churn. Model Training: Train the model using 80% of the data and test on the remaining 20%. Model Parameters: Use default parameters for the decision tree in this basic example. 5. Evaluation: Metrics: Evaluate the model using accuracy, precision, recall, and F1-score. Results: Suppose the model achieves 85% accuracy, with a precision of 80% and a recall of 75%. Analysis: The model is performing well, but there’s room for improvement, especially in recall. The company should focus on reducing false negatives (customers predicted not to churn but actually churn). 6. Deployment: Application: Deploy the model in a production environment where it can be used to predict churn for new customers. Action: The telecom company can use the predictions to target customers at high risk of churning with retention offers or improved services. Monitoring: Continuously monitor the model’s performance and retrain it periodically to adapt to new customer behavior patterns. Conclusion: By following the CRISP-DM process, the telecom company can effectively build and deploy a model that predicts customer churn, allowing them to take proactive measures to retain customers. This example illustrates how each phase of CRISP-DM can be applied to a real-world data mining project, even using fake data. Data Mining Process: SEMMA Sample (Generate a representative sample of the data) Assess Explore (Evaluate the accuracy and (Visualization and basic usefulness of the models) description of the data) SEMMA Model Modify (Use variety of statistical and (Select variables, transform machine learning models ) variable representations) Developed by SAS Institute as part of their data mining software offerings. Stands for Sample, Explore, Modify, Model, Assess. Primarily associated with the SAS software suite. Sample: Generate a representative sample of the data Explore: Visualization and basic description of the data Modify: Select variables, transform variable representations Model: Use variety of statistical and machine learning models Assess: Evaluate the accuracy and usefulness of the models 1. Sample Purpose: The first step involves sampling the data. This involves selecting a subset of data that is representative of the entire dataset. Sampling is crucial to ensure that the dataset is manageable and can be processed efficiently. It helps in reducing the volume of data while preserving the integrity and characteristics of the original dataset. Activities: Extract a representative sample from the large dataset. Ensure that the sample size is sufficient to make inferences about the whole dataset. 2. Explore Purpose: After sampling, the next step is to explore the data. This involves visualizing and examining the data to detect patterns, trends, and relationships. Exploration helps in understanding the nature of the data, identifying anomalies, and getting insights that may not be immediately apparent. Activities: Use statistical techniques and visualization tools. Identify trends, patterns, and relationships in the data. Understand the data distribution and spot anomalies or outliers. 3. Modify Purpose: The Modify phase involves preparing the data for modeling. This includes cleaning the data, transforming variables, creating new variables, and selecting the most relevant features. Modifying the data ensures that it is in the best possible format for building predictive models. Activities: Cleanse the data by handling missing values and correcting inconsistencies. Transform variables (e.g., normalization, scaling). Feature engineering: Create new features or select the most relevant ones. Integrate or aggregate data from multiple sources if necessary. Model Purpose: This step is where predictive modeling takes place. Various statistical or machine learning models are applied to the prepared data to find patterns or predict outcomes. The goal is to identify the best model that provides the most accurate predictions or insights. Activities: Apply different modeling techniques (e.g., regression, classification, clustering). Train models using the modified data. Tune model parameters to improve performance. 5. Assess Purpose: The final step involves assessing the models created in the previous step. This includes evaluating the performance of the models using various metrics, such as accuracy, precision, recall, and other relevant performance indicators. The goal is to ensure that the model meets the business objectives and can be deployed effectively. Activities: Evaluate model performance using statistical metrics and business criteria. Compare different models to select the best one. Validate the model using testing data or cross-validation techniques. Assess whether the model is robust and reliable for deployment. Summary SEMMA provides a structured approach to data mining, guiding the process from initial data sampling to model assessment. This methodology ensures that data is processed and analyzed systematically, leading to more accurate and actionable insights. Case Study: TeleCom's Customer Churn Prediction Let's consider a case study involving a telecommunications company, TeleCom, which wants to reduce customer churn by predicting which customers are likely to leave the service. We'll walk through the SEMMA (Sample, Explore, Modify, Model, Assess) process to see how the company can use data mining to tackle this issue. Background: TeleCom, a large telecommunications provider, is facing high customer churn rates, which significantly impact its revenue. The company wants to use its customer data to predict which customers are most likely to churn and implement targeted retention strategies to reduce churn. Step 1: Sample The first step in the SEMMA process is to create a representative sample from the company’s customer database. TeleCom has data on millions of customers, but for analysis, they choose a manageable sample of 50,000 customers. Sample Data: Total Dataset: 50,000 customers Features in the Dataset: CustomerID Age Gender Tenure (Number of months with the company) MonthlyCharges TotalCharges ContractType (Month-to-month, One year, Two year) InternetService (DSL, Fiber optic, None) TechSupport (Yes/No) StreamingTV (Yes/No) PaymentMethod (Electronic check, Mailed check, Bank transfer, Credit card) Churn (Yes/No) Step 2: Explore Next, TeleCom explores the sample data to understand the underlying patterns and relationships. This involves using statistical tools and visualizations. Data Exploration: Visualizations: Histogram of Tenure: Reveals that customers who churn tend to have shorter tenures. Boxplot of MonthlyCharges: Shows that customers with higher monthly charges are more likely to churn. Bar Chart of ContractType: Indicates that customers on month-to-month contracts are more likely to churn compared to those on longer contracts. Statistical Insights: Correlation Matrix: Shows a strong positive correlation between Churn and MonthlyCharges and a negative correlation with Tenure. Customer Segmentation: Segmenting customers by contract type and payment method reveals that customers on month-to-month contracts paying by electronic check have the highest churn rates. Step 3: Modify After exploring the data, the company modifies the data to make it suitable for modeling. This includes cleaning the data, transforming variables, and creating new features. Data Modification: Handling Missing Values: TotalCharges has some missing values. These are filled by calculating the product of MonthlyCharges and Tenure. Feature Engineering: Create a new feature AverageChargePerMonth by dividing TotalCharges by Tenure. Create dummy variables for categorical variables like ContractType, InternetService, PaymentMethod.Normalization: Normalize continuous variables like MonthlyCharges, Tenure, and TotalCharges to ensure they are on the same scale. Step 4: Model With the data now ready, TeleCom applies various modeling techniques to predict customer churn. They test several models to find the best predictor. Modeling Techniques: Model 1: Logistic Regression Purpose: To predict the probability of churn based on all the variables. Variables Used: All features, including ContractType, MonthlyCharges, Tenure, etc. Model 2: Decision Tree Purpose: To create a tree-based model that segments customers into different groups based on their likelihood to churn. Model 3: Random Forest Purpose: An ensemble method that builds multiple decision trees and combines their predictions to improve accuracy. Model 4: Support Vector Machine (SVM) Purpose: To find the hyperplane that best separates churners from non- churners. Step 5: Assess Finally, the company assesses the performance of the different models to determine which one best predicts churn. Model Assessment: Evaluation Metrics: Accuracy: Percentage of correct predictions. Precision: The ratio of true positive churn predictions to all predicted as churn. Recall: The ratio of true positives to all actual positives. F1-Score: Harmonic mean of precision and recall. AUC-ROC Curve: To evaluate the trade-off between sensitivity and specificity. Results: Logistic Regression: Accuracy = 78%, Precision = 75%, Recall = 70%, AUC = 0.80. Decision Tree: Accuracy = 75%, Precision = 72%, Recall = 68%, AUC = 0.75. Random Forest: Accuracy = 82%, Precision = 80%, Recall = 78%, AUC = 0.85. SVM: Accuracy = 79%, Precision = 76%, Recall = 72%, AUC = 0.81. Model Selection: Random Forest is selected as it has the highest accuracy and a good balance between precision and recall. It also handles the complexities and interactions of the variables well. Implementation: TeleCom uses the Random Forest model to predict which customers are likely to churn. The marketing team then targets these customers with retention campaigns such as special discounts, contract renewals, or personalized offers. Outcome: After implementing the model, TeleCom monitors the results and finds a significant reduction in churn rates, leading to higher customer retention and improved revenue. Summary In this case study, TeleCom used the SEMMA process to systematically analyze and model customer churn. By following the steps of Sample, Explore, Modify, Model, and Assess, the company was able to develop a predictive model that helped them effectively reduce customer churn. 1. What does SEMMA stand for in the context of data mining? a) Sample, Explore, Model, Modify, Assess b) Sample, Explore, Modify, Model, Assess c) Sample, Evaluate, Model, Measure, Assess d) Select, Explore, Modify, Model, Assess Which of the following activities is NOT typically associated with the "Explore" step in SEMMA? a) Identifying trends and patterns in the data b) Creating new features through feature engineering c) Detecting outliers and anomalies d) Visualizing the data using graphs and charts During the "Modify" step of the SEMMA process, what is the primary goal? a) To finalize the selection of the best model b) To apply transformations and create new features c) To explore the relationships in the data d) To sample the data for model training Which of the following best describes the purpose of the "Model" step in the SEMMA process? a) To assess the performance of different models b) To explore the data and find patterns c) To apply statistical or machine learning techniques to make predictions d) To clean and preprocess the data Which of the following is a typical output of the "Assess" step in SEMMA? a) A cleaned and modified dataset b) A representative sample of the data c) Performance metrics like accuracy, precision, and recall d) Visualizations like histograms and scatter plots When would you typically create dummy variables in the SEMMA process? a) During the Sample step b) During the Explore step c) During the Modify step d) During the Assess step Summary CRISP-DM is a broader, more comprehensive framework, ideal for managing data mining projects with a strong focus on aligning with business objectives. SEMMA is a more technical, model-centric methodology, closely associated with the SAS system and focused primarily on the statistical aspects of data mining. Data Mining Myths and Blunders Data Mining Myths Data mining … – provides instant solutions/predictions – is not yet viable for business applications – requires a separate, dedicated database – can only be done by those with advanced degrees – is only for large firms that have lots of customer data – is another name for the good-old statistics Common Data Mining Mistakes 1. Selecting the wrong problem for data mining 2. Ignoring what your sponsor thinks data mining is and what it really can/cannot do 3. Not leaving insufficient time for data acquisition, selection and preparation 4. Looking only at aggregated results and not at individual records/predictions 5. Being sloppy about keeping track of the data mining procedure and results Common Data Mining Mistakes 6. Ignoring suspicious (good or bad) findings and quickly moving on 7. Running mining algorithms repeatedly and blindly, without thinking about the next stage 8. Naively believing everything you are told about the data 9. Naively believing everything you are told about your own data mining analysis 10. Measuring your results differently from the way your sponsor measures them Data Mining Video https://youtu.be/R-sGvh6tI04 https://youtu.be/W44q6qszdqY

Lecture 7 8 Data MiningUp.pdf

Document Details

Tags

Related

Full Transcript