ICT 462-3 Week 2: Data Preparation Techniques
48 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is a primary goal of data preparation in data mining and machine learning?

  • Storing data in a cloud environment
  • Making raw data completely error-free
  • Increasing the size of raw data
  • Transforming data into a usable format for analysis (correct)
  • How does data cleaning improve the quality of data?

  • By increasing the volume of data collected
  • By identifying and removing irrelevant data
  • By aggregating data from various sources
  • By filling in missing values and resolving inconsistencies (correct)
  • Which algorithm predicts missing values by averaging from nearest neighbors?

  • K-Nearest Neighbors (KNN) (correct)
  • Support Vector Machine
  • Random Forest
  • Decision Tree
  • Which of the following is NOT considered a factor of data quality?

    <p>Marketing reach</p> Signup and view all the answers

    What technique is used to reduce the noise in a dataset?

    <p>Smoothing</p> Signup and view all the answers

    What is the purpose of data reduction in data preparation?

    <p>To simplify data analysis by decreasing the amount of data</p> Signup and view all the answers

    What does median imputation primarily maintain in a dataset?

    <p>Central tendency</p> Signup and view all the answers

    Why is it essential to ensure compatibility of data from different sources?

    <p>To create uniform data formats for analysis</p> Signup and view all the answers

    What is the definition of an outlier in a dataset?

    <p>It significantly deviates from other observations</p> Signup and view all the answers

    What is the primary purpose of using the KNN method for imputation?

    <p>To consider the similarity between data points</p> Signup and view all the answers

    What method can be used to handle missing values among attributes in a dataset?

    <p>Filling with averages or medians</p> Signup and view all the answers

    What can be the impact of not preprocessing data effectively?

    <p>Decreased performance of the models</p> Signup and view all the answers

    What are common sources of noise in data collection?

    <p>Errors from measurement tools and random errors during data processing</p> Signup and view all the answers

    When training a decision tree, how does it handle instances with missing values?

    <p>It traces the path based on other features using majority class or value</p> Signup and view all the answers

    Which of the following is an outcome of poor data quality?

    <p>Unreliable and inaccurate results</p> Signup and view all the answers

    Which of the following methods is NOT a technique used for smoothing data?

    <p>K-Means Clustering</p> Signup and view all the answers

    What is the primary disadvantage of ignoring tuples with missing class labels?

    <p>It does not utilize remaining attributes in the tuple.</p> Signup and view all the answers

    Why is data imputation utilized in data processing?

    <p>To avoid removing large amounts of data from the dataset.</p> Signup and view all the answers

    Which imputation method is most appropriate for numerical data that is skewed?

    <p>Median Imputation</p> Signup and view all the answers

    What is the key characteristic of mode imputation?

    <p>It uses the most frequent value to fill in missing data.</p> Signup and view all the answers

    Under what circumstances is fixed value imputation particularly useful?

    <p>When imputing 'not answered' in survey responses.</p> Signup and view all the answers

    Which imputation technique is appropriate for time-series data?

    <p>Next or Previous Value Imputation</p> Signup and view all the answers

    What is the best practice when using the mean imputation method?

    <p>It is best used with numerical data that follows a normal distribution.</p> Signup and view all the answers

    What is a significant drawback of using the imputation method?

    <p>It can introduce bias into the dataset if not done correctly.</p> Signup and view all the answers

    Which technique involves testing subsets of features to find the optimal combination?

    <p>Wrapper methods</p> Signup and view all the answers

    What is the primary purpose of data transformation in data analysis?

    <p>To convert data into a more interpretable format</p> Signup and view all the answers

    What is the outcome of normalization (Min-Max Scaling)?

    <p>Rescales data to a specific range, usually between 0 and 1</p> Signup and view all the answers

    Which of the following is an example of an embedded method?

    <p>Random Forests</p> Signup and view all the answers

    Which sampling method involves selecting entire groups rather than individual cases?

    <p>Cluster sample</p> Signup and view all the answers

    What does standardization (z-score normalization) achieve?

    <p>Transforms data to a mean of 0 and a standard deviation of 1</p> Signup and view all the answers

    Which feature selection method evaluates different models to determine the best features?

    <p>Forward selection</p> Signup and view all the answers

    Which of the following is true about chi-square tests?

    <p>It assesses the association between categorical variables.</p> Signup and view all the answers

    What is the primary goal of dimensionality reduction?

    <p>To reduce the number of features while retaining essential information</p> Signup and view all the answers

    Which technique is primarily used in dimensionality reduction to identify directions of maximum variance?

    <p>Eigenvalues and Eigenvectors</p> Signup and view all the answers

    What characteristic is unique to t-SNE compared to PCA in dimensionality reduction?

    <p>It maintains local structure in high-dimensional data.</p> Signup and view all the answers

    Which of the following techniques is NOT associated with feature selection?

    <p>Principal Component Analysis (PCA)</p> Signup and view all the answers

    What step must be performed first in the PCA process?

    <p>Standardize the data</p> Signup and view all the answers

    How does feature selection enhance model performance?

    <p>By simplifying the input space and retaining only relevant features</p> Signup and view all the answers

    Which statistical method helps to identify the contribution of features to a target variable?

    <p>Hypothesis testing</p> Signup and view all the answers

    What is one of the benefits of using dimensionality reduction in data analysis?

    <p>It reduces processing time and overfitting</p> Signup and view all the answers

    What is a primary use of the interquartile range (IQR) in data analysis?

    <p>To detect values outside the range that are considered outliers.</p> Signup and view all the answers

    How does the Isolation Forest algorithm identify outliers?

    <p>By isolating data points that require fewer splits to separate them from the rest.</p> Signup and view all the answers

    Which statement about clustering in the context of noise removal is false?

    <p>Clustering treats outlier points as part of the main clusters.</p> Signup and view all the answers

    What is the purpose of data reduction in data analysis?

    <p>To reduce storage costs while maintaining significant information.</p> Signup and view all the answers

    When using the IQR method, what does a negative price value indicate?

    <p>This may be an outlier or an error needing correction.</p> Signup and view all the answers

    Which clustering algorithm is commonly mentioned as a method for detecting noise?

    <p>K-Means</p> Signup and view all the answers

    Which of the following is a correct method to calculate the lower bound for outlier detection using the IQR?

    <p>Lower Bound = Q1 - 1.5 * IQR</p> Signup and view all the answers

    What is a common characteristic of outliers in clustered data?

    <p>They are not well-represented in any particular cluster.</p> Signup and view all the answers

    Study Notes

    Data Mining and Practical Machine Learning (ICT 462-3) - Week 2: Data Preparation Techniques

    • Data preparation is critical for data mining and machine learning

    • Raw data needs transformation for effective analysis and modeling

    • Data quality directly impacts model performance

    • Data preprocessing aims to improve data quality, increase model efficiency, and ensure data compatibility

    • Data quality factors include accuracy, completeness, consistency, timeliness, believability, and interpretability

    • Data preparation includes data cleaning, data integration, data transformation, and data reduction

    • Data cleaning routines "clean" the data by:

      • Filling in missing values
      • Smoothing noisy data
      • Identifying or removing outliers
      • Resolving inconsistencies
    • Techniques for handling missing values:

      • Ignore the tuple (with caution)
      • Imputation (replacing missing values):
        • Mean imputation
        • Median imputation
        • Mode imputation
        • Fixed value imputation
        • Next or Previous Value
        • Maximum or Minimum Value
        • Missing Value Prediction (using machine learning models like KNN or random forest)
    • Handling noisy data:

      • Noise is random error or variance in measured variables
      • Sources of noise include measurement errors and processing errors
      • Outliers are data points that significantly deviate from the rest of the data
      • Smoothing techniques reduce noise
        • Binning (dividing data into intervals and replacing values within bins with representative values like mean or median)
          • Bin mean method
          • Bin median method
          • Bin boundary method
        • Regression methods
        • Filtering methods (like median filtering or low-pass filtering
    • Methods for outlier detection:

      • Z-score (identifies points far from the mean)
      • IQR (interquartile range)
      • Isolation Forest (isolates data points in a decision tree)
    • Clustering can help in noise removal by grouping similar data points

    • Several clustering algorithms can detect and eliminate noisy data points

    • Points far from cluster centroids or belonging to small, isolated clusters are considered outliers

    • Data reduction methods:

      • Dimensionality reduction (e.g., PCA, t-SNE): Reduces features while preserving essential information
      • Feature selection: Retains relevant features, discarding irrelevant or redundant ones (e.g., statistical methods, wrapper methods, embedded methods)
      • Sampling: Represents the large data set using a small random sample
    • Data transformation methods:

      • Normalization (min-max scaling): Rescales data to a specific range (e.g., 0-1)
      • Standardization (z-score normalization): Transforms data to have a zero mean and unit variance
    • Data discretization: Converting numeric attributes into interval or conceptual labels (e.g., age could be '0-10', '11-20', etc.)

    • Methods for discretization:

      • Binning
      • Histogram Analysis
      • Discretization by Cluster
      • Correlation methods
    • Encoding categorical variables: Converting categorical (non-numeric) data into a numerical form   - One-Hot Encoding   - Label Encoding

    Studying That Suits You

    Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

    Quiz Team

    Related Documents

    Description

    Explore the essential techniques for data preparation in data mining and machine learning. This quiz covers the importance of data quality and the various methods for data cleaning, integration, transformation, and reduction. Understand how these processes impact model performance and analysis.

    More Like This

    Data Preparation for Machine Learning
    18 questions
    Time Series Data Preparation
    18 questions
    Machine Learning Data Preparation Steps
    40 questions
    Machine Learning Project Essentials
    42 questions
    Use Quizgecko on...
    Browser
    Browser