Recent Lessons

Show all results for ""

Untitled Quiz

Untitled Quiz

Choose a study mode

Play Quiz

Study Flashcards

Spaced Repetition

Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

What is the main purpose of using association rules in data mining?

To predict continuous values of an outcome variable
To classify categorical values based on labeled data
To investigate the co-occurrence of items or events (correct)
To assess clustering quality by comparing labels

Which method is primarily used to predict categorical outcomes?

K-NN
Regression Trees
Decision Trees (correct)
Cluster Analysis

What type of learning involves using unlabeled data without a specified outcome?

Supervised Learning
Predictive Learning
Unsupervised Learning (correct)
Exploratory Learning

What does the support in association rule mining represent?

<p>The ratio of transactions that include item X to the total transactions (A)</p> Signup and view all the answers

How is classification different from clustering in data mining?

<p>Classification uses labeled data while clustering does not (B)</p> Signup and view all the answers

Which of the following is a technique used in clustering?

<p>K-means (C)</p> Signup and view all the answers

What does confidence indicate in association rule mining?

<p>The likelihood of Y occurring if X is present (C)</p> Signup and view all the answers

What is the primary assessment criteria for supervised learning models?

<p>Objective performance metrics (B)</p> Signup and view all the answers

Flashcards

Association Rules

Discovering relationships between items, events, or variables in a dataset.

Support

The frequency of occurrence of an itemset in a dataset.

Confidence

The probability of item Y occurring given item X.

Apriori Algorithm

An algorithm used to discover association rules.

Signup and view all the flashcards

Unsupervised Learning

Finding patterns in data without a predefined outcome.

Signup and view all the flashcards

Supervised Learning

Predicting an outcome based on known data.

Signup and view all the flashcards

Cluster Analysis

Grouping similar data points into clusters based on their features.

Signup and view all the flashcards

Classification

Predicting categorical outcomes.

Signup and view all the flashcards

Study Notes

BA 3551 Review - Data Mining Methods

Data Mining Methods are categorized into Association Rules, Cluster Analysis, Classification, and Numeric Prediction. Each approach has distinct characteristics and applications.

Association Rules

General Idea: Investigating the co-occurrence of items, events, or variables.
Association Rule: Example: {Milk, Diapers} → {Coke}. This means customers who buy milk and diapers often also buy coke.
Support Metrics:
Support Count: The number of transactions containing a particular itemset.
Support Frequency (or Percentage): The proportion of transactions containing a particular itemset.
Confidence: The probability of buying Y given that X was bought.
Lift: Measures how much more likely an itemset co-occurs than pure chance.

Apriori Algorithm

Screening Algorithm: Helps reduce the number of association rules to consider.
Min Support and Min Confidence: Thresholds to filter rules.
Filtering: Items with insufficient support are excluded.
Rule Generation: Rules are generated from the remaining item sets that meet the min support criteria.
Min Confidence: Filter for rules with insufficient confidence.

Cluster Analysis

General Idea: Grouping data into clusters based on similarities.
Distance Measures:
Euclidean: Calculates the straight-line distance between data points.
Manhattan: Calculates distance by summing the absolute differences along each dimension.
Matching/ Jaccard: Specific methods for binary data
Data Normalization: Necessary when attributes have values in significantly different ranges, e.g., using min-max scaling or standardization.

Hierarchical Clustering (Agglomerative)

Linkage Methods: Different strategies for merging clusters:
Single Linkage: Uses the minimum distance between data points across clusters.
Complete Linkage: Uses the maximum distance between data points across clusters.
Average Linkage: Uses the average distance between data points across clusters.
Centroid Linkage: Measures the distance between cluster centroids.
Ward's Linkage: Aims to minimize the variance within clusters.

K-Means Clustering

Initialization: Selects initial cluster centroids randomly.
Assignment: Assigns data points to the nearest centroid.
Update: Recalculates centroids based on the assigned points.
Iteration: Repeats assignment and update steps until convergence.

Clusters Evaluation

Intra-similarity (cohesion): Measured by 'Within Sum of Squared Errors (WSS)'. Lower WSS indicates cohesive clustering.
Inter-similarity (separation): Measured by 'Between Sum of Squares (BSS)'. Higher BSS means better separation.
Elbow Plot: Visualizes relationship between number of clusters and WSS to guide cluster selection.

Hierarchical Clustering - Pros and Cons

Pros: Data-driven clustering, variety of solution options.
Cons: Computationally demanding, Dendrogram results can vary with linkage method type.

K-Means Clustering - Pros and Cons

Pros: Computationally less demanding compared to hierarchical clustering.
Cons: Can be sensitive to initial centroid selection, not ideal for clusters with irregular shapes, outliers or clusters with different densities, might take multiple runs with different centroid initializations.

Predictive Analytics

Objective: Building prediction models with a defined outcome variable.
Data Splitting: Dividing data into training and testing sets.
Model Training: Training an algorithm on the training set.
Model Evaluation: Evaluating the model's performance on the test set.

k-NN Classification and Prediction

Classification: Classifies new data points based on the majority class among the k nearest neighbors. If K is an uneven number of neighbors there can be a tie in which case a random selection for the new data point's class is made.
Numeric Prediction: Predicts the new data point's outcome by averaging the outcomes for the k-nearest neighbors.
Model Evaluation:
Accuracy: Percentage of correctly classified instances overall.
RMSE (Root Mean Square Errors), MAE (Mean Absolute Error), MAPE (Mean Absolute Percent Error), Mean Error (ME): Used for evaluating numeric prediction models.

Decision Trees

Classification: Based on attribute-value combinations that minimize entropy to determine the majority class.
Regression: Based on minimizing Sum of Squared Deviations to find the average outcome value for each leaf node.

Stopping Criteria in Trees

Decision Trees:
End when all data points fall into one class.
Stop further splitting when entropy does not decrease after a threshold.
Regression Trees:
End splitting when all data points have the same outcome.
Stop when there's no useful attribute-value combination to further split.

k-NN - Pros and Cons

Pros: Straightforward, good at capturing relationships.
Cons: Computationally expensive (Lazy Learner), sensitive to dimensionality.

Trees - Pros and Cons

Pros: Good variable selection capability, robust to outliers.
Cons: Can be unstable, splitting purely based on one dimension at a time can lead to missing some relevant relations.

Model Evaluation: Classification

Accuracy: Correct classifications across all data points.
Recall: Percentage of positive cases correctly retrieved for individual classes.
Precision: Proportion of retrieved instances correctly classified amongst the predicted positive instances.
F1-Score: Harmonic mean of precision and recall; important for unbalanced data.

Model Evaluation: Numeric Prediction

Prediction Error: Differences between predicted and actual values.
Mean Error (ME): Average prediction error.
Mean Absolute Error (MAE): Average of absolute prediction error magnitudes.
Mean Absolute Percent Error (MAPE): Average prediction error expressed as a percentage of the actual values.
Root Mean Square Error (RMSE): Square root of the average of squared prediction errors (larger errors penalized more).
Confusion Matrices: Tables used to evaluate the quality of the predictive model based on the predictions for each category.

Tree Evaluation

Entropy: The degree of uncertainty in each node. Lower entropy means more certainty.
Sum of Squared Errors (SSE): The total square deviation from the mean for the nodes in a prediction. Lower values indicate better prediction.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

BA 3551 Review PDF

More Like This

Untitled Quiz

37 questions

Untitled Quiz

WellReceivedSquirrel7948

Untitled Quiz

55 questions

Untitled Quiz

StatuesquePrimrose

Untitled Quiz

18 questions

Untitled Quiz

RighteousIguana

Untitled Quiz

50 questions

Untitled Quiz

JoyousSulfur

Use Quizgecko on...

Browser