Document Details

TrustedIris

Uploaded by TrustedIris

Mindanao State University - Iligan Institute of Technology

Tags

business analytics data mining CRISP-DM business framework

Summary

This document presents a business analytics framework, specifically the CRISP-DM methodology, for data mining. It covers various stages of the process, including data understanding, preparation, and modeling. It also examines examples of data mining tasks and emphasizes the need for a standardized approach to ensure reliability and repeatability.

Full Transcript

Business Analytics Framework CRoss-Industry Standard Process for Data Mining 1 The Hard Reality of Data Enormous of data being restored in the databases Businesses are increasingly becoming data-rich, yet,...

Business Analytics Framework CRoss-Industry Standard Process for Data Mining 1 The Hard Reality of Data Enormous of data being restored in the databases Businesses are increasingly becoming data-rich, yet, paradoxically, they remain knowledge-poor “We are drowning in information, but starving for knowledge” - John Naisbett Unless it is used to improve business practices, data is a liability, not an asset Standard data analysis techniques are useful but insufficient and may miss valuable insight Department of Information Technology 2 Examples Consider the enormous amounts of data generated by: Transactional data by credit card companies Searches on google and other search engines Social media data Department of Information Technology 3 What is Data Mining? Deployment of business processes, supported by adequate analytical techniques, to: Take further advantage of data Discover relevant knowledge Act on the results Department of Information Technology 4 Data Mining Tasks Summarization Classification/Prediction ○ Classification, Concept Learning, Regression Clustering Dependency modeling Anomaly detection Department of Information Technology 5 Summarization To find a compact description for a subset of the data ○ Producing the average down time of all plant equipments in a given month, computing total income generated by each sales representative per region per year Techniques: ○ Statistics, Information theory, etc Department of Information Technology 6 Prediction To learn a function that associates a data item with the value of a response variable. If the response variable is discrete, we talk of classification learning; if the response variable is continuous, we talk of regression learning. ○ Assessing credit worthiness in a loan underwriting business, assessing the probability of response to a direct marketing campaign Techniques: ○ Decision trees, Neural networks, Naive Bayes, etc. Department of Information Technology 7 Clustering To identify a set of (meaningful) categories or clusters to describe the data. Clustering relies on some notion of similarity among data items and strives to maximize intra-cluster similarity while minimizing inter-cluster similarity. ○ Segmenting a business’ customer base, building a taxonomy of animals in a zoological application Techniques: ○ K-Means, Hierarchical clustering, Kohonen SOM, etc. Department of Information Technology 8 Dependency Modeling To find a model that describes significant dependencies, associations or affinities among variables. ○ Analyzing market baskets in consumer goods retail, uncovering cause-effect relationships in medical treatments Techniques: ○ Association rules, Graphical modeling, etc Department of Information Technology 9 Anomaly Detection To discover the most significant changes in the data from previously measured or normative values. ○ Detecting fraudulent credit card usage, detecting anomalous turbine behavior in nuclear plants. Techniques: ○ Novelty detectors, Probability density models, etc. Department of Information Technology 10 WHY SHOULD THERE BE A STANDARD PROCESS? The data mining process must be reliable and repeatable by people with little data mining background. Framework for recording experience Allows projects to be replicated Aid to project planning and management “Comfort factor” for new adopters Department of Information Technology 11 WHY SHOULD THERE BE A STANDARD PROCESS? CRISP-DM framework is an invaluable tool for anyone looking to undertake a data mining project. Its structured approach to planning, executing, and evaluating such projects provides a clear roadmap for success. By following the CRISP-DM process, data miners can ensure that their projects are well-defined, well-executed, and well-documented. Department of Information Technology 12 CRoss-Industry Standard Process for Data Mining Department of Information Technology 13 CRISP-DM Phases Department of Information Technology 14 Phase I: Business Understanding Focuses on understanding the project objectives and requirements from a business perspective, then converting this knowledge into a data mining problem definition and a preliminary plan designed to achieve the objectives. 1. Determine Business Objective - a business goal states objectives in business terminology Example: Increase catalog sales to existing customers. 1. Assess the situation - current business process 2. Determine Data Mining Objective - a data mining goal states project objectives in technical terms Example: Predict how many widgets a customer will buy, given their purchases over the past three years, demographic information (age, salary, city) and the price of the item. 1. Produce a project plan - describe the intended plan for achieving the data mining goals and the business goals Department of Information Technology 15 Phase II: Data Understanding Starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data or to detect interesting subsets to form hypotheses for hidden information. 1. Collect initial data 2. Describe data 3. Explore data 4. Verify data quality Department of Information Technology 16 Phase III: Data Preparation Covers all activities to construct the final dataset from the initial raw data. Takes usually over 90% of the time - Collection, Assessment, Consolidation and Cleaning, Data selection, Transformations 1. Select data - decide on the data to be used for analysis a. criteria include relevance to the data mining goals, quality and technical constraints such as limits on data volume or data types b. covers selection of attributes as well as selection of records in a table 2. Clean data - may involve selection of clean subsets of the data, the insertion of suitable defaults or more ambitious techniques such as the estimation of missing data by modeling 3. Construct data - constructive data preparation operations such as the production of derived attributes, entire new records or transformed values for existing attributes Department of Information Technology 17 Phase III: Data Preparation 4. Integrate data - methods whereby information is combined from multiple tables or records to create new records or values 5. Format data - formatting transformations refer to primarily syntactic modifications made to the data that do not change its meaning, but might be required by the modeling too Department of Information Technology 18 Phase IV: Modeling Various modeling techniques are selected and applied and their parameters are calibrated to optimal values. 1. Select the modeling technique - based upon the data mining objective 2. Generate test design - before actually building a model, generate a procedure or mechanism to test the model’s quality and validity 3. Build model - parameter settings 4. Assess model - rank the models Department of Information Technology 19 Phase V: Evaluation Thoroughly evaluate the model and review the steps executed to construct the model to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. 1. Evaluate the results - assesses the degree to which the model meets the business objectives 2. Review process - do a more thorough review of the data mining engagement in order to determine if there is any important factor or task that has somehow been overlooked; review the quality assurance issues 3. Determine next steps - decides how to proceed at this stage ○ decides whether to finish the project and move on to deployment if appropriate or whether to initiate further iterations or set up new data mining projects Department of Information Technology 20 Phase VI: Deployment The knowledge gained will need to be organized and presented in a way that the customer can use it. 1. Plan deployment - in order to deploy the data mining result(s) into the business, takes the evaluation results and concludes a strategy for deployment. 2. Plan monitoring and maintenance - important if the data mining results become part of the day-to-day business and it environment; needs a detailed on monitoring process 3. Produce final report - the project leader and his team write up a final report. ○ may be only a summary of the project and its experiences ○ may be a final and comprehensive presentation of the data mining result(s) Department of Information Technology 21 UNDERSTANDING CRISP-DM METHODOLOGY BY ANALYSING SEATTLE AIRBNB DATA Department of Information Technology 22 Business Understanding With the respect of the main business of Airbnb, we are interested in solving the following questions: 1. What is the distribution of the listings price? 2. Which neighborhood has the most listings? 3. Can we predict the price of a new listing based on some of its attributes? Department of Information Technology 23 Data Understanding The Seattle Airbnb dataset includes 3 CSV files: calendar.csv: Detailed Calendar Data for listings in Seattle. Including listing id and the price and availability for that day. Contains availability, listings.csv: Detailed Listings data for Seattle. This contains listing ids, details about the listing, details about their host, the location and the price for each listing, reviews.csv: including unique id for each reviewer and detailed comments. Department of Information Technology 24 Data Preparation This part of the methodology focuses on preprocessing the data, transforming the data to a more usable form and filter out the irrelevant data. The preprocessing includes dealing with missing values, currency columns, and categorical columns. For missing values, drop columns with missing values greater than 30% of the whole value of that column; then, for numerical columns, using column mean value to fill in missing values for these columns. For each currency column, remove the currency symbols $ and , from the value string, and then convert the column data type into numerical data type. For categorical columns, create dummy columns for each categorical columns. Department of Information Technology 25 Data Analysis What is the distribution of the listings price? Listings price histogram Above is the statistics of listing price. As we can see from the bar chart and the left table, the highest price is $1000, the lowest is $20, and the median is $100. Department of Information Technology 26 Data Analysis Which neighborhood has the most listings? Total number of listings for neighborhoods The bar chart shows the total number of listings for the top 10 neighborhoods. As we can see from the bar chart, Capitol Hill, the highest, possesses 351 listings and followed by Ballard and Belltown, which possess 213 and 204 listings respectively. Department of Information Technology 27 Data Modeling Can we predict the price of a new listing based on some of its attributes? For this problem, linear regression model was used to predict the price of a new listing. First, select all numerical columns that possibly related to the price column, then compute a correlation matrix to show how strong these columns correlate to each other. Department of Information Technology 28 Evaluation and Development The diagram shows how the predict prices deviates from the true prices, the r² score for the model on test dataset is 0.56. Department of Information Technology 29 Group Reporting Instructions: Dataset will be sent to your email addresses Online consultations will be available and will be channeled ONLY through our email thread (from the email that contains the dataset) Feel free to improvised and be creative but only based your analyzations from the data that will be given to you. Department of Information Technology 30 Important Dates: October 15-19 (work with your group in finalizing your assigned CRISP-DM Report) October 22 - 25 (lab break) October 26 - Reporting Proper October 28 - Nov 3 (Asynchronous Week - Review week for the Major Exam) November 13 - Major Exam: Multiple Choice 100 items. Department of Information Technology 31

Use Quizgecko on...
Browser
Browser