Podcast
Questions and Answers
What is the main purpose of using association rules in data mining?
What is the main purpose of using association rules in data mining?
Which method is primarily used to predict categorical outcomes?
Which method is primarily used to predict categorical outcomes?
What type of learning involves using unlabeled data without a specified outcome?
What type of learning involves using unlabeled data without a specified outcome?
What does the support in association rule mining represent?
What does the support in association rule mining represent?
Signup and view all the answers
How is classification different from clustering in data mining?
How is classification different from clustering in data mining?
Signup and view all the answers
Which of the following is a technique used in clustering?
Which of the following is a technique used in clustering?
Signup and view all the answers
What does confidence indicate in association rule mining?
What does confidence indicate in association rule mining?
Signup and view all the answers
What is the primary assessment criteria for supervised learning models?
What is the primary assessment criteria for supervised learning models?
Signup and view all the answers
Study Notes
BA 3551 Review - Data Mining Methods
- Data Mining Methods are categorized into Association Rules, Cluster Analysis, Classification, and Numeric Prediction. Each approach has distinct characteristics and applications.
Association Rules
- General Idea: Investigating the co-occurrence of items, events, or variables.
- Association Rule: Example: {Milk, Diapers} → {Coke}. This means customers who buy milk and diapers often also buy coke.
- Support Metrics:
- Support Count: The number of transactions containing a particular itemset.
- Support Frequency (or Percentage): The proportion of transactions containing a particular itemset.
- Confidence: The probability of buying Y given that X was bought.
- Lift: Measures how much more likely an itemset co-occurs than pure chance.
Apriori Algorithm
- Screening Algorithm: Helps reduce the number of association rules to consider.
- Min Support and Min Confidence: Thresholds to filter rules.
- Filtering: Items with insufficient support are excluded.
- Rule Generation: Rules are generated from the remaining item sets that meet the min support criteria.
- Min Confidence: Filter for rules with insufficient confidence.
Cluster Analysis
- General Idea: Grouping data into clusters based on similarities.
- Distance Measures:
- Euclidean: Calculates the straight-line distance between data points.
- Manhattan: Calculates distance by summing the absolute differences along each dimension.
- Matching/ Jaccard: Specific methods for binary data
- Data Normalization: Necessary when attributes have values in significantly different ranges, e.g., using min-max scaling or standardization.
Hierarchical Clustering (Agglomerative)
- Linkage Methods: Different strategies for merging clusters:
- Single Linkage: Uses the minimum distance between data points across clusters.
- Complete Linkage: Uses the maximum distance between data points across clusters.
- Average Linkage: Uses the average distance between data points across clusters.
- Centroid Linkage: Measures the distance between cluster centroids.
- Ward's Linkage: Aims to minimize the variance within clusters.
K-Means Clustering
- Initialization: Selects initial cluster centroids randomly.
- Assignment: Assigns data points to the nearest centroid.
- Update: Recalculates centroids based on the assigned points.
- Iteration: Repeats assignment and update steps until convergence.
Clusters Evaluation
- Intra-similarity (cohesion): Measured by 'Within Sum of Squared Errors (WSS)'. Lower WSS indicates cohesive clustering.
- Inter-similarity (separation): Measured by 'Between Sum of Squares (BSS)'. Higher BSS means better separation.
- Elbow Plot: Visualizes relationship between number of clusters and WSS to guide cluster selection.
Hierarchical Clustering - Pros and Cons
- Pros: Data-driven clustering, variety of solution options.
- Cons: Computationally demanding, Dendrogram results can vary with linkage method type.
K-Means Clustering - Pros and Cons
- Pros: Computationally less demanding compared to hierarchical clustering.
- Cons: Can be sensitive to initial centroid selection, not ideal for clusters with irregular shapes, outliers or clusters with different densities, might take multiple runs with different centroid initializations.
Predictive Analytics
- Objective: Building prediction models with a defined outcome variable.
- Data Splitting: Dividing data into training and testing sets.
- Model Training: Training an algorithm on the training set.
- Model Evaluation: Evaluating the model's performance on the test set.
k-NN Classification and Prediction
- Classification: Classifies new data points based on the majority class among the k nearest neighbors. If K is an uneven number of neighbors there can be a tie in which case a random selection for the new data point's class is made.
- Numeric Prediction: Predicts the new data point's outcome by averaging the outcomes for the k-nearest neighbors.
- Model Evaluation:
- Accuracy: Percentage of correctly classified instances overall.
- RMSE (Root Mean Square Errors), MAE (Mean Absolute Error), MAPE (Mean Absolute Percent Error), Mean Error (ME): Used for evaluating numeric prediction models.
Decision Trees
- Classification: Based on attribute-value combinations that minimize entropy to determine the majority class.
- Regression: Based on minimizing Sum of Squared Deviations to find the average outcome value for each leaf node.
Stopping Criteria in Trees
- Decision Trees:
- End when all data points fall into one class.
- Stop further splitting when entropy does not decrease after a threshold.
- Regression Trees:
- End splitting when all data points have the same outcome.
- Stop when there's no useful attribute-value combination to further split.
k-NN - Pros and Cons
- Pros: Straightforward, good at capturing relationships.
- Cons: Computationally expensive (Lazy Learner), sensitive to dimensionality.
Trees - Pros and Cons
- Pros: Good variable selection capability, robust to outliers.
- Cons: Can be unstable, splitting purely based on one dimension at a time can lead to missing some relevant relations.
Model Evaluation: Classification
- Accuracy: Correct classifications across all data points.
- Recall: Percentage of positive cases correctly retrieved for individual classes.
- Precision: Proportion of retrieved instances correctly classified amongst the predicted positive instances.
- F1-Score: Harmonic mean of precision and recall; important for unbalanced data.
Model Evaluation: Numeric Prediction
- Prediction Error: Differences between predicted and actual values.
- Mean Error (ME): Average prediction error.
- Mean Absolute Error (MAE): Average of absolute prediction error magnitudes.
- Mean Absolute Percent Error (MAPE): Average prediction error expressed as a percentage of the actual values.
- Root Mean Square Error (RMSE): Square root of the average of squared prediction errors (larger errors penalized more).
- Confusion Matrices: Tables used to evaluate the quality of the predictive model based on the predictions for each category.
Tree Evaluation
- Entropy: The degree of uncertainty in each node. Lower entropy means more certainty.
- Sum of Squared Errors (SSE): The total square deviation from the mean for the nodes in a prediction. Lower values indicate better prediction.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.