Data Preprocessing Techniques

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

Which data preprocessing task involves consolidating data from multiple sources into a unified view?

Data integration (correct)
Data cleaning
Data transformation
Data reduction

What is the primary goal of data reduction techniques in data preprocessing?

To obtain a smaller representation of the data without compromising analytical results (correct)
To normalize data to a specific range
To increase the volume of data for better analysis
To eliminate all inconsistencies in the data

In data cleaning, which of the following methods is LEAST effective when a tuple has numerous missing values across multiple attributes?

Using the attribute mean to fill in missing values
Using the most probable value to fill in missing values
Using a global constant to fill in missing values
Ignoring the tuple (correct)

Which data transformation technique involves scaling data to fall within a small, specified range, such as 0.0 to 1.0?

Normalization (D)

Signup and view all the answers

What is the primary purpose of 'binning' as a data smoothing technique?

To replace data values with a value representative of their immediate neighborhood (C)

Signup and view all the answers

Which of the following is a common approach to handling missing data by replacing it with a predetermined label, but is NOT foolproof?

Using a global constant (B)

Signup and view all the answers

How do data auditing tools contribute to the data cleaning process?

By analyzing the data to discover rules and relationships, and detecting data that violates these (A)

Signup and view all the answers

In the context of data integration, what does the 'entity identification problem' primarily address?

Matching equivalent real-world entities across different data sources (A)

Signup and view all the answers

What statistical measure is used to evaluate the correlation between two nominal attributes?

Chi-square test (D)

Signup and view all the answers

Which of the following techniques is considered a 'lossy' data compression method?

Wavelet transforms (C)

Signup and view all the answers

What initial step is crucial in Principal Component Analysis (PCA) to ensure equitable contribution from all attributes?

Normalizing the input data (D)

Signup and view all the answers

Which method of attribute subset selection starts with a full set of attributes and iteratively removes the least significant ones?

Stepwise backward elimination (A)

Signup and view all the answers

In data preprocessing, which technique involves modeling data to fit a straight line?

Linear Regression (C)

Signup and view all the answers

When using histograms for data reduction, what is the key characteristic of an equal-frequency histogram?

A uniform number of values in each bucket (D)

Signup and view all the answers

When should stratified sampling be preferred over simple random sampling?

When needing to ensure representation from all strata when some are rare (A)

Signup and view all the answers

Data cube aggregation is used to do what?

Precompute summarized data for fast availability (D)

Signup and view all the answers

Which of the following data transformation strategies aims to remove noise in data?

Smoothing (C)

Signup and view all the answers

What type of data can be transformed for a mining application?

All datatypes (A)

Signup and view all the answers

Which normalization method is LEAST affected by outliers?

Z-score normalization using mean absolute deviation (B)

Signup and view all the answers

How does discretization transform the numeric data?

By assigning interval or conceptual labels (A)

Signup and view all the answers

What differentiates supervised from unsupervised discretization techniques?

Uses class information or not (B)

Signup and view all the answers

What is the bottom-up strategy?

When continuous values are considered potential split-points, removed with some merging of neighborhood values to form intervals, and the process is reaplied (C)

Signup and view all the answers

What do many implicit concept hierarchies have in common?

Automatically defined at the schema definition level (B)

Signup and view all the answers

What does the development of declarative language do for data cleaning?

All of the above (D)

Signup and view all the answers

How can careful data integration help reduce and avoid inconsistencies?

All of the above (D)

Signup and view all the answers

What does it mean if attributes A and B are said to be statistically correlated?

The hypothesis can be rejected, using x2 statistic tests (A)

Signup and view all the answers

What does selecting the best attribute each node do for a decision tree?

To partition data (C)

Signup and view all the answers

Which is more challenging, data reduction or cleaning?

Cleaning, due to the inconsistency and data errors (D)

Signup and view all the answers

Why is data inspected regarding unique rules, consecutive rules, and null rules?

To catch discrepancies, inconsistencies, and errors (A)

Signup and view all the answers

What is known or should be known about our data called?

Any of the above (D)

Signup and view all the answers

Why should software routines be used to uncover null values?

All of the above (D)

Signup and view all the answers

What do data scrubbing tools not help with?

The collection process (A)

Signup and view all the answers

What must happen first to match the attributes of data?

The structure of the data be paid special attention (A)

Signup and view all the answers

What does variance become in the special case where the attributes match each other?

It becomes covariance (B)

Signup and view all the answers

Why is it important to have as few cells contributing to the x2 value as possible?

To ensure the cells whose actual account is very different fro what is expected (A)

Signup and view all the answers

Who should determine the data if the number can not be determined by the computer?

An expert (B)

Signup and view all the answers

What methods are part of the basic heuristic methods of attribute subset selection?

Including techniques such as: the procedure starts with am empty set of attributes as reduced (B)

Signup and view all the answers

When it comes to stopping criteria, what kind method can an implementor employ?

A threshold on what should be measured to stop (D)

Signup and view all the answers

What does a hierarchical pyramid algorithm do at each iteration?

Halves the data (C)

Signup and view all the answers

Flashcards

What is data preprocessing?

Improving data quality and mining results by cleaning, integrating, reducing, and transforming data.

What are elements of data quality?

Accuracy, completeness, consistency, timeliness, believability, and interpretability.

What does data cleaning involve?

Filling missing values, smoothing noisy data, identifying/removing outliers, resolving inconsistencies.

How to handle missing values?

Ignoring tuples, manual filling, global constants, mean/median values, or probable values.

Signup and view all the flashcards

What is noisy data?

Random error or variance in a measured variable.

Signup and view all the flashcards

What is binning?

Smoothing sorted values by consulting neighborhood.

Signup and view all the flashcards

What is the data cleaning workflow?

Identify discrepancies, transform to correct, iterate.

Signup and view all the flashcards

What is metadata?

Data about data; includes data type, domain, acceptable values.

Signup and view all the flashcards

What is field overloading?

Squeezing new attribute definitions into unused parts of existing ones.

Signup and view all the flashcards

What are Data scrubbing tools?

Use domain knowledge, parsing, fuzzy matching to detect/correct.

Signup and view all the flashcards

What are Data auditing tools?

Analyzing data to discover rules, detect violations.

Signup and view all the flashcards

What are ETL tools?

Tools for users to specify transforms via GUI, often restricted.

Signup and view all the flashcards

What is data Integration?

The merging of data from multiple data stores.

Signup and view all the flashcards

What is entity identification?

Matching equivalent real-world entities from different sources.

Signup and view all the flashcards

What is correlation analysis?

Measures how strongly one attribute implies another.

Signup and view all the flashcards

How to test for correlation?

For nominal data, uses contingency tables & chi-square test.

Signup and view all the flashcards

What is Data Reduction?

Data set size reduction to volume that produces similar analytical results.

Signup and view all the flashcards

What are Data Reduction Strategies?

Dimensionality reduction and numerosity reduction.

Signup and view all the flashcards

What is dimensionality reduction?

Reducing random variables or attributes under consideration.

Signup and view all the flashcards

Dimensionality reduction methods?

Wavelet transforms and principal components analysis

Signup and view all the flashcards

What is numerosity reduction?

The process of replacing the original data volume by alternative, smaller forms of data representation.

Signup and view all the flashcards

What is lossless compression?

Data compression algorithms where original data can be exactly constructed.

Signup and view all the flashcards

What is discrete wavelet transform (DWT)?

Linear signal processing transforming data vectors into wavelet coefficients.

Signup and view all the flashcards

Wavelet Reduction?

Truncate/store; retain strongest coefficients.

Signup and view all the flashcards

Principal components analysis?

k-dimensional orthogonal vectors for data representation (k<=n).

Signup and view all the flashcards

What is Attribute subset selection?

Reduces data set size by removing irrelevant or redundant attributes/dimensions.

Signup and view all the flashcards

Attribute Selection Methods?

Stepwise forward selection and stepwise backward elimination.

Signup and view all the flashcards

Decision Tree Induction?

Build tree; attributes appearing in it form reduced subset.

Signup and view all the flashcards

What is Linear Regression?

Approximates data; data fits a straight line.

Signup and view all the flashcards

Log-linear models?

Approximates discrete multidimensional probability distributions.

Signup and view all the flashcards

Histograms?

Data division into disjoint subsets/buckets.

Signup and view all the flashcards

Clustering?

Partitions objects into similar clusters.

Signup and view all the flashcards

Sampling?

small random subset to represent the larger data

Signup and view all the flashcards

Sampling Methods?

Simple random sample w/o replacement (SRSWOR) or w/ replacement (SRSWR), cluster sample and stratified sample.

Signup and view all the flashcards

Data Cube Aggregation?

Data is aggregated at each level

Signup and view all the flashcards

What is Data Transformation?

Transform/consolidate for efficient mining, easier pattern understanding.

Signup and view all the flashcards

Transformation Types?

Smoothing, attribute construction, aggregation, normalization, discretization, and concept hierarchy generation.

Signup and view all the flashcards

Normalization?

Scales data into smaller range.

Signup and view all the flashcards

Discretization?

Maps values to interval/conceptual labels.

Signup and view all the flashcards

Study Notes

Data Preprocessing Overview

Real-world databases often contain noisy, missing, and inconsistent data due to their large size and heterogeneous sources.
Data quality depends on factors like accuracy, completeness, consistency, timeliness, believability, and interpretability.
Data preprocessing aims to improve data quality and mining efficiency.