ICT 462-3 Week 2: Data Preparation Techniques

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is a primary goal of data preparation in data mining and machine learning?

Storing data in a cloud environment
Making raw data completely error-free
Increasing the size of raw data
Transforming data into a usable format for analysis (correct)

How does data cleaning improve the quality of data?

By increasing the volume of data collected
By identifying and removing irrelevant data
By aggregating data from various sources
By filling in missing values and resolving inconsistencies (correct)

Which algorithm predicts missing values by averaging from nearest neighbors?

K-Nearest Neighbors (KNN) (correct)
Support Vector Machine
Random Forest
Decision Tree

Which of the following is NOT considered a factor of data quality?

Marketing reach (C) Signup and view all the answers

What technique is used to reduce the noise in a dataset?

Smoothing (A) Signup and view all the answers

What is the purpose of data reduction in data preparation?

To simplify data analysis by decreasing the amount of data (D) Signup and view all the answers

What does median imputation primarily maintain in a dataset?

Central tendency (C) Signup and view all the answers

Why is it essential to ensure compatibility of data from different sources?

To create uniform data formats for analysis (B) Signup and view all the answers

What is the definition of an outlier in a dataset?

It significantly deviates from other observations (C) Signup and view all the answers

What is the primary purpose of using the KNN method for imputation?

To consider the similarity between data points (A) Signup and view all the answers

What method can be used to handle missing values among attributes in a dataset?

Filling with averages or medians (D) Signup and view all the answers

What can be the impact of not preprocessing data effectively?

Decreased performance of the models (A) Signup and view all the answers

What are common sources of noise in data collection?

Errors from measurement tools and random errors during data processing (D) Signup and view all the answers

When training a decision tree, how does it handle instances with missing values?

It traces the path based on other features using majority class or value (B) Signup and view all the answers

Which of the following is an outcome of poor data quality?

Unreliable and inaccurate results (C) Signup and view all the answers

Which of the following methods is NOT a technique used for smoothing data?

K-Means Clustering (B) Signup and view all the answers

What is the primary disadvantage of ignoring tuples with missing class labels?

It does not utilize remaining attributes in the tuple. (B) Signup and view all the answers

Why is data imputation utilized in data processing?

To avoid removing large amounts of data from the dataset. (C) Signup and view all the answers

Which imputation method is most appropriate for numerical data that is skewed?

Median Imputation (B) Signup and view all the answers

What is the key characteristic of mode imputation?

It uses the most frequent value to fill in missing data. (D) Signup and view all the answers

Under what circumstances is fixed value imputation particularly useful?

When imputing 'not answered' in survey responses. (A) Signup and view all the answers

Which imputation technique is appropriate for time-series data?

Next or Previous Value Imputation (B) Signup and view all the answers

What is the best practice when using the mean imputation method?

It is best used with numerical data that follows a normal distribution. (A) Signup and view all the answers

What is a significant drawback of using the imputation method?

It can introduce bias into the dataset if not done correctly. (A) Signup and view all the answers

Which technique involves testing subsets of features to find the optimal combination?

Wrapper methods (A) Signup and view all the answers

What is the primary purpose of data transformation in data analysis?

To convert data into a more interpretable format (B) Signup and view all the answers

What is the outcome of normalization (Min-Max Scaling)?

Rescales data to a specific range, usually between 0 and 1 (C) Signup and view all the answers

Which of the following is an example of an embedded method?

Random Forests (A) Signup and view all the answers

Which sampling method involves selecting entire groups rather than individual cases?

Cluster sample (A) Signup and view all the answers

What does standardization (z-score normalization) achieve?

Transforms data to a mean of 0 and a standard deviation of 1 (C) Signup and view all the answers

Which feature selection method evaluates different models to determine the best features?

Forward selection (B) Signup and view all the answers

Which of the following is true about chi-square tests?

It assesses the association between categorical variables. (A) Signup and view all the answers

What is the primary goal of dimensionality reduction?

To reduce the number of features while retaining essential information (B) Signup and view all the answers

Which technique is primarily used in dimensionality reduction to identify directions of maximum variance?

Eigenvalues and Eigenvectors (A) Signup and view all the answers

What characteristic is unique to t-SNE compared to PCA in dimensionality reduction?

It maintains local structure in high-dimensional data. (C) Signup and view all the answers

Which of the following techniques is NOT associated with feature selection?

Principal Component Analysis (PCA) (B) Signup and view all the answers

What step must be performed first in the PCA process?

Standardize the data (B) Signup and view all the answers

How does feature selection enhance model performance?

By simplifying the input space and retaining only relevant features (C) Signup and view all the answers

Which statistical method helps to identify the contribution of features to a target variable?

Hypothesis testing (A) Signup and view all the answers

What is one of the benefits of using dimensionality reduction in data analysis?

It reduces processing time and overfitting (D) Signup and view all the answers

What is a primary use of the interquartile range (IQR) in data analysis?

To detect values outside the range that are considered outliers. (B) Signup and view all the answers

How does the Isolation Forest algorithm identify outliers?

By isolating data points that require fewer splits to separate them from the rest. (B) Signup and view all the answers

Which statement about clustering in the context of noise removal is false?

Clustering treats outlier points as part of the main clusters. (B) Signup and view all the answers

What is the purpose of data reduction in data analysis?

To reduce storage costs while maintaining significant information. (B) Signup and view all the answers

When using the IQR method, what does a negative price value indicate?

This may be an outlier or an error needing correction. (D) Signup and view all the answers

Which clustering algorithm is commonly mentioned as a method for detecting noise?

K-Means (D) Signup and view all the answers

Which of the following is a correct method to calculate the lower bound for outlier detection using the IQR?

Lower Bound = Q1 - 1.5 * IQR (A) Signup and view all the answers

What is a common characteristic of outliers in clustered data?

They are not well-represented in any particular cluster. (B) Signup and view all the answers

Flashcards

Data Preparation

The process of transforming raw data into a format suitable for analysis and modeling.

Data Preparation: Key Steps

It involves cleaning, transforming, and integrating data to improve its quality and make it more suitable for analysis.

Data Cleaning

A crucial step in data preprocessing, as it involves identifying and correcting errors, handling missing values, and resolving inconsistencies in the data.

Handling Missing Values

It involves replacing missing values with sensible estimates based on available data or using algorithms.