Podcast
Questions and Answers
Which statement accurately describes unsupervised learning?
Which statement accurately describes unsupervised learning?
What is a primary characteristic of supervised learning compared to unsupervised learning?
What is a primary characteristic of supervised learning compared to unsupervised learning?
Which of the following is NOT a statistical method used in knowledge discovery techniques?
Which of the following is NOT a statistical method used in knowledge discovery techniques?
Which method is primarily used for market basket analysis?
Which method is primarily used for market basket analysis?
Signup and view all the answers
Which statement about decision trees and algorithms is correct?
Which statement about decision trees and algorithms is correct?
Signup and view all the answers
Which technique is based on comparing new cases with stored cases using similarity measurements?
Which technique is based on comparing new cases with stored cases using similarity measurements?
Signup and view all the answers
What type of clustering method is identified by a bottom-up approach?
What type of clustering method is identified by a bottom-up approach?
Signup and view all the answers
Which of the following is a component of fuzzy inference systems?
Which of the following is a component of fuzzy inference systems?
Signup and view all the answers
Which of the following best describes the primary focus of supervised learning methods?
Which of the following best describes the primary focus of supervised learning methods?
Signup and view all the answers
What characterizes unsupervised learning methods in data mining?
What characterizes unsupervised learning methods in data mining?
Signup and view all the answers
Which statement about error measurement in data mining is correct?
Which statement about error measurement in data mining is correct?
Signup and view all the answers
What is a key component of predictive modeling techniques in data mining?
What is a key component of predictive modeling techniques in data mining?
Signup and view all the answers
What is the primary difference between supervised and unsupervised learning methods?
What is the primary difference between supervised and unsupervised learning methods?
Signup and view all the answers
Which method is primarily used for knowledge discovery in data mining?
Which method is primarily used for knowledge discovery in data mining?
Signup and view all the answers
Which of the following is true regarding the 'training sample' in data mining?
Which of the following is true regarding the 'training sample' in data mining?
Signup and view all the answers
What are independent variables in the context of data mining?
What are independent variables in the context of data mining?
Signup and view all the answers
Why is data often perceived as homogenous even when it is not?
Why is data often perceived as homogenous even when it is not?
Signup and view all the answers
In predictive modeling techniques, what is the significance of identifying unknown or unexpected patterns?
In predictive modeling techniques, what is the significance of identifying unknown or unexpected patterns?
Signup and view all the answers
What is the purpose of the Learning System in data mining?
What is the purpose of the Learning System in data mining?
Signup and view all the answers
How is the error rate in a Learning System calculated?
How is the error rate in a Learning System calculated?
Signup and view all the answers
Which of the following best describes unsupervised learning methods?
Which of the following best describes unsupervised learning methods?
Signup and view all the answers
Which of the following is a dependent variable in a supervised learning system?
Which of the following is a dependent variable in a supervised learning system?
Signup and view all the answers
What must be carefully considered to ensure data is meaningful and reliable for mining?
What must be carefully considered to ensure data is meaningful and reliable for mining?
Signup and view all the answers
What might be an appropriate next step if a sample is tested and the prediction is off by 15%?
What might be an appropriate next step if a sample is tested and the prediction is off by 15%?
Signup and view all the answers
How do data mining techniques utilize empirical data?
How do data mining techniques utilize empirical data?
Signup and view all the answers
What kind of variables are x1, x2, and x3 in the supervised training phase?
What kind of variables are x1, x2, and x3 in the supervised training phase?
Signup and view all the answers
What is an important outcome of effective data mining?
What is an important outcome of effective data mining?
Signup and view all the answers
What term is commonly used to refer to the subset from which further samples are selected?
What term is commonly used to refer to the subset from which further samples are selected?
Signup and view all the answers
Which method is NOT typically associated with supervised learning?
Which method is NOT typically associated with supervised learning?
Signup and view all the answers
In predictive modeling, which parameter is NOT typically included when analyzing customer behavior?
In predictive modeling, which parameter is NOT typically included when analyzing customer behavior?
Signup and view all the answers
What is a main characteristic of supervised machine learning?
What is a main characteristic of supervised machine learning?
Signup and view all the answers
Which of the following variables is least likely to be a predictor in a data mining model focused on sales?
Which of the following variables is least likely to be a predictor in a data mining model focused on sales?
Signup and view all the answers
What aspect of data mining does the term 'error' pertain to?
What aspect of data mining does the term 'error' pertain to?
Signup and view all the answers
What is the primary focus of the initial step in the data mining process?
What is the primary focus of the initial step in the data mining process?
Signup and view all the answers
In the data mining process, what is typically developed during the Analysis of the Data step?
In the data mining process, what is typically developed during the Analysis of the Data step?
Signup and view all the answers
Which of the following is NOT a common pitfall when clarifying the business problem in data mining?
Which of the following is NOT a common pitfall when clarifying the business problem in data mining?
Signup and view all the answers
What should target variables in data mining ideally be?
What should target variables in data mining ideally be?
Signup and view all the answers
What is a key factor in determining the success of a data mining project?
What is a key factor in determining the success of a data mining project?
Signup and view all the answers
During the data provisioning step, what is the purpose of partitioning data?
During the data provisioning step, what is the purpose of partitioning data?
Signup and view all the answers
In predictive modeling, which type of variables are preferred due to their lower data requirements?
In predictive modeling, which type of variables are preferred due to their lower data requirements?
Signup and view all the answers
What is the minimum requirement for a successful data mining model?
What is the minimum requirement for a successful data mining model?
Signup and view all the answers
What is a critical aspect to consider during the evaluation and validation phase of data mining?
What is a critical aspect to consider during the evaluation and validation phase of data mining?
Signup and view all the answers
How is the term 'base period' defined in the context of data mining?
How is the term 'base period' defined in the context of data mining?
Signup and view all the answers
Study Notes
Unsupervised Learning
- Unsupervised learning does not use training or adjusting phases like supervised learning.
- Patterns discovered in unsupervised learning are based on the relationship and structures found in the data.
- There is no target variable in unsupervised learning, only a model, formula, or output from the learning system.
- Unsupervised learning seeks to uncover hidden structures, relationships, and patterns in data.
- Examples of patterns found in unsupervised learning include the relationship between Xn to Y, or X1 to X2.
- Unsupervised learning relies on statistical testing to determine relationships instead of expert input.
- Unsupervised learning generally requires less time than supervised learning for analysis, all things being equal.
Knowledge Discovery Techniques
- Statistical methods: Multiple regression, logistic regression, analysis of variance, log-linear regression models, and Bayesian inference.
- Decision trees and decision rules: Classification and Regression Tree (CART) algorithms and pruning algorithms.
- Cluster analysis: Divisible algorithm, agglomerative algorithms, hierarchical clustering, partitional clustering, and incremental clustering.
- Association rules: Market basket analysis, a priori algorithm, sequence patterns, and social network analysis.
- Artificial neural networks: Multilayer perceptrons with back-propagation learning, radial networks, Self-Organizing Maps (SOM), and Kohonen networks.
- Genetic algorithms: Used to solve complex optimization problems.
- Fuzzy inference systems: Based on the theory of fuzzy sets and fuzzy logics.
- N-dimensional visualization methods: Geometric, icon-based, pixel-oriented, and hierarchical techniques.
- Case-Based Reasoning (CBR): Based on comparing new cases with stored cases using similarity measurements. This method is useful when only a few cases are available.
The Learning System and Error
- The learning system identifies relationships between independent variables and a dependent (target) variable.
- The output from the learning system is compared to the output of a historical sample.
- The output variable (target variable) is used to predict outcomes.
- The difference between the learning system output and the sample output is considered error or deviation.
- The learning system is adjusted until the error rate reaches an acceptable level.
Supervised Learning: The Training Phase
- Supervised learning uses a learning system, referred to as a training phase, to adjust variables and find patterns in the data.
- The goal of supervised learning is to find relationships that predict the target variable.
- The target (dependent) variable is the outcome to be predicted.
- Independent variables are factors or data that may have an impact on the target variable.
Data Preparation
- Data is often assumed to be homogenous for general purposes, however, it is rarely accurate for data mining.
- Data mining requires accounting for inherent variation in data.
- Data preparation steps are necessary to ensure data is meaningful and reliable for analysis.
Data Mining/ Analytics
- Data Mining/ Analytics uses various data analysis methods to discover unknown, unexpected, interesting, and relevant patterns and relationships.
- Data mining/analytics enables making predictions using discovered patterns.
- There are two primary methods for data analysis: supervised and unsupervised.
Training Sample
- Both supervised and unsupervised learning methods require a sample of empirical (observed) data.
- The sample used for training is called a training sample.
- The training sample allows data mining/analytics to learn patterns in the data.
Independent Variables and Target Variable
- Independent variables are factors or data we know could impact the target variable we are trying to predict.
- The target variable is the dependent variable, which is the outcome we want to predict.
Data Mining Process
- The data mining process aims to solve business needs and problems.
- The first step in the data mining process is to understand business needs and identify areas for improvement.
- Examples of business needs/problems include:
- High customer drop-out rates
- Disappointing sales figures
- Poor returns in specific geographic areas
- Quality issues
- Converting potential customers into paying customers
- Developing an area of business with opportunities
General Stages in the Data Mining Process
- Clarification of the objective/question
- Provisioning and processing of data
- Analysis of the data
- Evaluation and validation during analysis
- Application of data mining results and learning from the experience
Business Task: Clarifying The Problem
- The goal of the problem clarification step is to understand the business goal as thoroughly as possible.
- Key areas for clarification include:
- Target group or object
- Production budget
- Promotion extent and kind (number of pages, presentation, coupons, discounts)
- Involved industries/departments
- Goods/items involved in the promotion
- Presentation scenario (e.g., garden party)
- Transmitted image (e.g., aggressive pricing, brand competence or innovation)
- Pricing structure
Business Task: Example Problem
- Example problem: Reactivate frequent buyers who haven't purchased in the last year.
- Questions to ask when defining this problem to determine target group:
- What is "frequent?"
- Who is a "buyer?"
- Does the definition include buy & return, buy and not pay?
- Which goods/items are included?
- Is there a price window or cut-off for inclusion?
- Does the channel matter (online, in-store, etc.)?
- Does the location of purchase matter?
- How to classify a frequent buyer who stopped buying several years ago but has recently purchased a few times?
Necessary Information for the Business Task
- Common specifications for the main objective:
- Turnover activation
- Reactivation of inactive customers
- Cross-selling
- Clarification of the different possible applications (goals):
- Estimating a potential target group
- Estimating for a mailout
- Commitment to the action period and application period
- Consideration of any seasonal influences to be noted
- Consideration of any comparable actions in the past
Common Pitfalls in Defining the Business Problem
- Client has not fixed all the details in time for the initial discussion.
- Things change between the briefing and the action without the data miner being informed.
- Marketing colleagues prefer not to be seen as too precise, limiting their flexibility, which leads to inaccurate or incomplete information.
- The problem definition step is essential for adding value and determining the level of success for the project.
Key Performance Indicators (KPIs)
- Key performance indicators need to be clear and measurable.
- Examples of KPIs include:
- Response rate
- Cost of mailouts
- Purchase "frequency"
- Measurable goals need to be agreed upon by all involved.
Communication and Psychology in the Data Mining Process
- Adequate, accurate, and timely communication are vital for the success of any project.
- It is important to remember that the problem definition process can take a lot of time, but it is worth it in the long run.
- The problem definition process is “decisive in adding value and determining whether they will be successful or not”
- A bit of psychology may be needed to work effectively with clients.
Data: Provisioning & Processing the Data
- The data provisioning and processing step involves determining the required data for mining and analytics.
- Areas to consider include
- The analysis period
- The basic unit of interest
- Estimation methods
- The variables needed
- Data partition to generate learning/testing data
- Data partition to generate appropriate random samples
Analysis Time
- In deployment, there will likely be a gap between using the model and carrying out the activity.
- Example: Determining a target group for a mailout, but those people don't receive the mailout for several hours, days, or weeks until after the target group is identified.
- The analysis period includes two periods:
- Base Period: For input variables and testing
- Target Period: For the output (target) variable, deployment of results
- There is a time gap between the Base Period (running a model) and the Target Period (using the results).
- Determine the likely gap, then include that gap in the modeling data.
- Example: Input variables (age, location, segment, purchase behavior) need to be from a time period ahead of target variables (purchasing action).
- In the application period, use input variables from the current time period to determine who should receive promotional materials.
Example of Gap Application
- Objective: Christmas season mailing. The application period is December 1-31.
- The target period is typically about one year earlier to capture seasonal effects, so December 1-31 of the previous year.
- Printing, handling, and delivery of mailout take about 4 weeks and would end at the end of November.
- The Base Period would end on October 31 of the previous year.
- Use input variables up to October of the previous year and test target variables from December 1-31 of the previous year.
- In the application period, use input variables from the current October to determine who should receive promotional materials in December of the current year (November is left for printing, processing, and mailing).
Gotchas with the Analysis Time
- In the application step, one or more data sets may not be available yet.
- Major components (industry, department, new products replacing those sold last year) may have changed between the analysis time (Base and Target Periods) and the application period.
Basic Unit of Interest
- The basic unit of interest can be a person, place, or thing.
- Examples include:
- Customers, prospects
- Company, location
- Invoice
- Marketing is usually focused on individuals as the basic unit of interest.
- A unit (case) could be a day's worth of data (base and target periods can be simultaneous).
- A unit could be material making up a manufactured product, and the target is the quality of the product.
Target Variables
- An effective target variable may not be readily available from the data and may need to be derived.
- Examples of Target Variables include:
- Purchase amount ($) or quantity
- Generic categories (all cups, all cutlery) rather than specific items (pink cups).
- The target variable must be measurable, precise, robust, and relevant.
- In predictive models, less variation in the target variable is preferred.
- Binary variables (dichotomous) and categorical variables work best (they require more data for statistical analysis).
- Statistical models prefer more variation and thus a continuous variable works better (less data needed).
Input/Explanatory Variables
- The input/explanatory variables are used to inform the analysis.
- These variables are only used in the Base Period.
- Ensure that these variables are used in the data mining process as they were at the end of the Base Period.
- It can be challenging if variables are not static but subject to change (e.g., address).
- Use these variables with caution, even if they are typically static or slow to change.
- More stable models are obtained by classifying continuous variables.
- Classifying variables such as turnover or purchase amount highlights the differences in business processes.
Modeling: Analysis of the Data
- The core of data modeling involves choosing the most effective method or model.
- A shorter timeline (more efficient) model is likely to be better than a longer (more technically effective) model.
- Data mining tools are relatively easy to use, but the process of effectual data mining and analysis is a challenge.
- There are many data mining software tools available, some even freeware.
- Look for data mining software that includes sound tools for data preparation and transformation.
- Data warehouse tools can be helpful during the data preparation and transformation stages.
Evaluation & Validation: During Analysis
- Three ways to assess the quality of the calculated model:
- Using a test sample with the same split (between target = 0 and target = 1) as the training sample (normalization).
- Using a test sample that has the same split as the whole dataset.
- Using a test sample with a different stratification.
- Generate several candidate models using regressions, decision trees, etc., and compare the models by applying each model to the test sample and comparing results.
- Some data mining software automates this process or provides a tool to compare models.
Data Mining Definition
- Data mining addresses questions about content, patterns, and future applications of data.
- Data sets can be extremely large, sometimes millions of records or transactions.
- Different industries have varying data volume, with web apps having the largest datasets.
- Data laws and customs can vary, but data sets can often be purchased, rented, or accessed freely.
Data Mining: Population & Sample
- Data mining utilizes the scientific method.
- Entire population datasets may be considered, or only a sample subset may be available.
- For datasets under 10,000 records, it may be best to use the whole set.
- For large datasets, a sample or subset may be used, but it must be representative and unbiased.
- A random sample is often the best method to ensure representativeness (vs. directed or two-phase samples).
- Sampling is a specialized discipline, and sometimes a portion of the population may be studied, such as buying behavior around specific seasons.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
Explore the fundamentals of unsupervised learning and knowledge discovery techniques. This quiz covers the key concepts, methods, and applications of statistical analysis in uncovering patterns and relationships within data. Test your understanding of how unsupervised learning differs from supervised learning.