Untitled Quiz
8 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is the main purpose of using association rules in data mining?

  • To predict continuous values of an outcome variable
  • To classify categorical values based on labeled data
  • To investigate the co-occurrence of items or events (correct)
  • To assess clustering quality by comparing labels
  • Which method is primarily used to predict categorical outcomes?

  • K-NN
  • Regression Trees
  • Decision Trees (correct)
  • Cluster Analysis
  • What type of learning involves using unlabeled data without a specified outcome?

  • Supervised Learning
  • Predictive Learning
  • Unsupervised Learning (correct)
  • Exploratory Learning
  • What does the support in association rule mining represent?

    <p>The ratio of transactions that include item X to the total transactions</p> Signup and view all the answers

    How is classification different from clustering in data mining?

    <p>Classification uses labeled data while clustering does not</p> Signup and view all the answers

    Which of the following is a technique used in clustering?

    <p>K-means</p> Signup and view all the answers

    What does confidence indicate in association rule mining?

    <p>The likelihood of Y occurring if X is present</p> Signup and view all the answers

    What is the primary assessment criteria for supervised learning models?

    <p>Objective performance metrics</p> Signup and view all the answers

    Study Notes

    BA 3551 Review - Data Mining Methods

    • Data Mining Methods are categorized into Association Rules, Cluster Analysis, Classification, and Numeric Prediction. Each approach has distinct characteristics and applications.

    Association Rules

    • General Idea: Investigating the co-occurrence of items, events, or variables.
    • Association Rule: Example: {Milk, Diapers} → {Coke}. This means customers who buy milk and diapers often also buy coke.
    • Support Metrics:
    • Support Count: The number of transactions containing a particular itemset.
    • Support Frequency (or Percentage): The proportion of transactions containing a particular itemset.
    • Confidence: The probability of buying Y given that X was bought.
    • Lift: Measures how much more likely an itemset co-occurs than pure chance.

    Apriori Algorithm

    • Screening Algorithm: Helps reduce the number of association rules to consider.
    • Min Support and Min Confidence: Thresholds to filter rules. 
    • Filtering: Items with insufficient support are excluded. 
    • Rule Generation: Rules are generated from the remaining item sets that meet the min support criteria.
    • Min Confidence: Filter for rules with insufficient confidence.

    Cluster Analysis

    • General Idea: Grouping data into clusters based on similarities.
    • Distance Measures:
    • Euclidean: Calculates the straight-line distance between data points.
    • Manhattan: Calculates distance by summing the absolute differences along each dimension.
    • Matching/ Jaccard: Specific methods for binary data
    • Data Normalization: Necessary when attributes have values in significantly different ranges, e.g., using min-max scaling or standardization.  

    Hierarchical Clustering (Agglomerative)

    • Linkage Methods: Different strategies for merging clusters:
    • Single Linkage: Uses the minimum distance between data points across clusters.
    • Complete Linkage: Uses the maximum distance between data points across clusters.
    • Average Linkage: Uses the average distance between data points across clusters.
    • Centroid Linkage: Measures the distance between cluster centroids.
    • Ward's Linkage: Aims to minimize the variance within clusters.

    K-Means Clustering

    • Initialization: Selects initial cluster centroids randomly.
    • Assignment: Assigns data points to the nearest centroid.
    • Update: Recalculates centroids based on the assigned points.
    • Iteration: Repeats assignment and update steps until convergence.

    Clusters Evaluation

    • Intra-similarity (cohesion): Measured by 'Within Sum of Squared Errors (WSS)'. Lower WSS indicates cohesive clustering.
    • Inter-similarity (separation): Measured by 'Between Sum of Squares (BSS)'. Higher BSS means better separation.
    • Elbow Plot: Visualizes relationship between number of clusters and WSS to guide cluster selection.

    Hierarchical Clustering - Pros and Cons

    • Pros: Data-driven clustering, variety of solution options.
    • Cons: Computationally demanding, Dendrogram results can vary with linkage method type.

    K-Means Clustering - Pros and Cons

    • Pros: Computationally less demanding compared to hierarchical clustering.
    • Cons: Can be sensitive to initial centroid selection,  not ideal for clusters with irregular shapes, outliers or clusters with different densities, might take multiple runs with different centroid initializations.

    Predictive Analytics

    • Objective: Building prediction models with a defined outcome variable.
    • Data Splitting: Dividing data into training and testing sets. 
    • Model Training: Training an algorithm on the training set. 
    • Model Evaluation: Evaluating the model's performance on the test set.

    k-NN Classification and Prediction

    • Classification: Classifies new data points based on the majority class among the k nearest neighbors. If K is an uneven number of neighbors there can be a tie in which case a random selection for the new data point's class is made. 
    • Numeric Prediction: Predicts the new data point's outcome by averaging the outcomes for the k-nearest neighbors. 
    • Model Evaluation:
    • Accuracy: Percentage of correctly classified instances overall.
    • RMSE (Root Mean Square Errors), MAE (Mean Absolute Error), MAPE (Mean Absolute Percent Error), Mean Error (ME): Used for evaluating numeric prediction models.

    Decision Trees

    • Classification: Based on attribute-value combinations that minimize entropy to determine the majority class.
    • Regression: Based on minimizing Sum of Squared Deviations to find the average outcome value for each leaf node.

    Stopping Criteria in Trees

    • Decision Trees:
    • End when all data points fall into one class.
    • Stop further splitting when entropy does not decrease after a threshold.
    • Regression Trees:
    • End splitting when all data points have the same outcome.
    • Stop when there's no useful attribute-value combination to further split.

    k-NN - Pros and Cons

    • Pros: Straightforward, good at capturing relationships.
    • Cons: Computationally expensive (Lazy Learner), sensitive to dimensionality.

    Trees - Pros and Cons

    • Pros: Good variable selection capability, robust to outliers.
    • Cons: Can be unstable, splitting purely based on one dimension at a time can lead to missing some relevant relations.

    Model Evaluation: Classification

    • Accuracy: Correct classifications across all data points. 
    • Recall: Percentage of positive cases correctly retrieved for individual classes.
    • Precision: Proportion of retrieved instances correctly classified amongst the predicted positive instances.
    • F1-Score: Harmonic mean of precision and recall; important for unbalanced data.

    Model Evaluation: Numeric Prediction

    • Prediction Error: Differences between predicted and actual values. 
    • Mean Error (ME): Average prediction error.
    • Mean Absolute Error (MAE): Average of absolute prediction error magnitudes. 
    • Mean Absolute Percent Error (MAPE): Average prediction error expressed as a percentage of the actual values. 
    • Root Mean Square Error (RMSE): Square root of the average of squared prediction errors (larger errors penalized more).
    • Confusion Matrices: Tables used to evaluate the quality of the predictive model based on the predictions for each category.

    Tree Evaluation

    • Entropy: The degree of uncertainty in each node. Lower entropy means more certainty.
    • Sum of Squared Errors (SSE): The total square deviation from the mean for the nodes in a prediction. Lower values indicate better prediction.

    Studying That Suits You

    Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

    Quiz Team

    Related Documents

    BA 3551 Review PDF

    More Like This

    Untitled Quiz
    37 questions

    Untitled Quiz

    WellReceivedSquirrel7948 avatar
    WellReceivedSquirrel7948
    Untitled Quiz
    55 questions

    Untitled Quiz

    StatuesquePrimrose avatar
    StatuesquePrimrose
    Untitled Quiz
    18 questions

    Untitled Quiz

    RighteousIguana avatar
    RighteousIguana
    Untitled Quiz
    48 questions

    Untitled Quiz

    StraightforwardStatueOfLiberty avatar
    StraightforwardStatueOfLiberty
    Use Quizgecko on...
    Browser
    Browser