Data Preprocessing Techniques

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

Which data preprocessing task involves consolidating data from multiple sources into a unified view?

  • Data integration (correct)
  • Data cleaning
  • Data transformation
  • Data reduction

What is the primary goal of data reduction techniques in data preprocessing?

  • To obtain a smaller representation of the data without compromising analytical results (correct)
  • To normalize data to a specific range
  • To increase the volume of data for better analysis
  • To eliminate all inconsistencies in the data

In data cleaning, which of the following methods is LEAST effective when a tuple has numerous missing values across multiple attributes?

  • Using the attribute mean to fill in missing values
  • Using the most probable value to fill in missing values
  • Using a global constant to fill in missing values
  • Ignoring the tuple (correct)

Which data transformation technique involves scaling data to fall within a small, specified range, such as 0.0 to 1.0?

<p>Normalization (D)</p>
Signup and view all the answers

What is the primary purpose of 'binning' as a data smoothing technique?

<p>To replace data values with a value representative of their immediate neighborhood (C)</p>
Signup and view all the answers

Which of the following is a common approach to handling missing data by replacing it with a predetermined label, but is NOT foolproof?

<p>Using a global constant (B)</p>
Signup and view all the answers

How do data auditing tools contribute to the data cleaning process?

<p>By analyzing the data to discover rules and relationships, and detecting data that violates these (A)</p>
Signup and view all the answers

In the context of data integration, what does the 'entity identification problem' primarily address?

<p>Matching equivalent real-world entities across different data sources (A)</p>
Signup and view all the answers

What statistical measure is used to evaluate the correlation between two nominal attributes?

<p>Chi-square test (D)</p>
Signup and view all the answers

Which of the following techniques is considered a 'lossy' data compression method?

<p>Wavelet transforms (C)</p>
Signup and view all the answers

What initial step is crucial in Principal Component Analysis (PCA) to ensure equitable contribution from all attributes?

<p>Normalizing the input data (D)</p>
Signup and view all the answers

Which method of attribute subset selection starts with a full set of attributes and iteratively removes the least significant ones?

<p>Stepwise backward elimination (A)</p>
Signup and view all the answers

In data preprocessing, which technique involves modeling data to fit a straight line?

<p>Linear Regression (C)</p>
Signup and view all the answers

When using histograms for data reduction, what is the key characteristic of an equal-frequency histogram?

<p>A uniform number of values in each bucket (D)</p>
Signup and view all the answers

When should stratified sampling be preferred over simple random sampling?

<p>When needing to ensure representation from all strata when some are rare (A)</p>
Signup and view all the answers

Data cube aggregation is used to do what?

<p>Precompute summarized data for fast availability (D)</p>
Signup and view all the answers

Which of the following data transformation strategies aims to remove noise in data?

<p>Smoothing (C)</p>
Signup and view all the answers

What type of data can be transformed for a mining application?

<p>All datatypes (A)</p>
Signup and view all the answers

Which normalization method is LEAST affected by outliers?

<p>Z-score normalization using mean absolute deviation (B)</p>
Signup and view all the answers

How does discretization transform the numeric data?

<p>By assigning interval or conceptual labels (A)</p>
Signup and view all the answers

What differentiates supervised from unsupervised discretization techniques?

<p>Uses class information or not (B)</p>
Signup and view all the answers

What is the bottom-up strategy?

<p>When continuous values are considered potential split-points, removed with some merging of neighborhood values to form intervals, and the process is reaplied (C)</p>
Signup and view all the answers

What do many implicit concept hierarchies have in common?

<p>Automatically defined at the schema definition level (B)</p>
Signup and view all the answers

What does the development of declarative language do for data cleaning?

<p>All of the above (D)</p>
Signup and view all the answers

How can careful data integration help reduce and avoid inconsistencies?

<p>All of the above (D)</p>
Signup and view all the answers

What does it mean if attributes A and B are said to be statistically correlated?

<p>The hypothesis can be rejected, using x2 statistic tests (A)</p>
Signup and view all the answers

What does selecting the best attribute each node do for a decision tree?

<p>To partition data (C)</p>
Signup and view all the answers

Which is more challenging, data reduction or cleaning?

<p>Cleaning, due to the inconsistency and data errors (D)</p>
Signup and view all the answers

Why is data inspected regarding unique rules, consecutive rules, and null rules?

<p>To catch discrepancies, inconsistencies, and errors (A)</p>
Signup and view all the answers

What is known or should be known about our data called?

<p>Any of the above (D)</p>
Signup and view all the answers

Why should software routines be used to uncover null values?

<p>All of the above (D)</p>
Signup and view all the answers

What do data scrubbing tools not help with?

<p>The collection process (A)</p>
Signup and view all the answers

What must happen first to match the attributes of data?

<p>The structure of the data be paid special attention (A)</p>
Signup and view all the answers

What does variance become in the special case where the attributes match each other?

<p>It becomes covariance (B)</p>
Signup and view all the answers

Why is it important to have as few cells contributing to the x2 value as possible?

<p>To ensure the cells whose actual account is very different fro what is expected (A)</p>
Signup and view all the answers

Who should determine the data if the number can not be determined by the computer?

<p>An expert (B)</p>
Signup and view all the answers

What methods are part of the basic heuristic methods of attribute subset selection?

<p>Including techniques such as: the procedure starts with am empty set of attributes as reduced (B)</p>
Signup and view all the answers

When it comes to stopping criteria, what kind method can an implementor employ?

<p>A threshold on what should be measured to stop (D)</p>
Signup and view all the answers

What does a hierarchical pyramid algorithm do at each iteration?

<p>Halves the data (C)</p>
Signup and view all the answers

Flashcards

What is data preprocessing?

Improving data quality and mining results by cleaning, integrating, reducing, and transforming data.

What are elements of data quality?

Accuracy, completeness, consistency, timeliness, believability, and interpretability.

What does data cleaning involve?

Filling missing values, smoothing noisy data, identifying/removing outliers, resolving inconsistencies.

How to handle missing values?

Ignoring tuples, manual filling, global constants, mean/median values, or probable values.

Signup and view all the flashcards

What is noisy data?

Random error or variance in a measured variable.

Signup and view all the flashcards

What is binning?

Smoothing sorted values by consulting neighborhood.

Signup and view all the flashcards

What is the data cleaning workflow?

Identify discrepancies, transform to correct, iterate.

Signup and view all the flashcards

What is metadata?

Data about data; includes data type, domain, acceptable values.

Signup and view all the flashcards

What is field overloading?

Squeezing new attribute definitions into unused parts of existing ones.

Signup and view all the flashcards

What are Data scrubbing tools?

Use domain knowledge, parsing, fuzzy matching to detect/correct.

Signup and view all the flashcards

What are Data auditing tools?

Analyzing data to discover rules, detect violations.

Signup and view all the flashcards

What are ETL tools?

Tools for users to specify transforms via GUI, often restricted.

Signup and view all the flashcards

What is data Integration?

The merging of data from multiple data stores.

Signup and view all the flashcards

What is entity identification?

Matching equivalent real-world entities from different sources.

Signup and view all the flashcards

What is correlation analysis?

Measures how strongly one attribute implies another.

Signup and view all the flashcards

How to test for correlation?

For nominal data, uses contingency tables & chi-square test.

Signup and view all the flashcards

What is Data Reduction?

Data set size reduction to volume that produces similar analytical results.

Signup and view all the flashcards

What are Data Reduction Strategies?

Dimensionality reduction and numerosity reduction.

Signup and view all the flashcards

What is dimensionality reduction?

Reducing random variables or attributes under consideration.

Signup and view all the flashcards

Dimensionality reduction methods?

Wavelet transforms and principal components analysis

Signup and view all the flashcards

What is numerosity reduction?

The process of replacing the original data volume by alternative, smaller forms of data representation.

Signup and view all the flashcards

What is lossless compression?

Data compression algorithms where original data can be exactly constructed.

Signup and view all the flashcards

What is discrete wavelet transform (DWT)?

Linear signal processing transforming data vectors into wavelet coefficients.

Signup and view all the flashcards

Wavelet Reduction?

Truncate/store; retain strongest coefficients.

Signup and view all the flashcards

Principal components analysis?

k-dimensional orthogonal vectors for data representation (k<=n).

Signup and view all the flashcards

What is Attribute subset selection?

Reduces data set size by removing irrelevant or redundant attributes/dimensions.

Signup and view all the flashcards

Attribute Selection Methods?

Stepwise forward selection and stepwise backward elimination.

Signup and view all the flashcards

Decision Tree Induction?

Build tree; attributes appearing in it form reduced subset.

Signup and view all the flashcards

What is Linear Regression?

Approximates data; data fits a straight line.

Signup and view all the flashcards

Log-linear models?

Approximates discrete multidimensional probability distributions.

Signup and view all the flashcards

Histograms?

Data division into disjoint subsets/buckets.

Signup and view all the flashcards

Clustering?

Partitions objects into similar clusters.

Signup and view all the flashcards

Sampling?

small random subset to represent the larger data

Signup and view all the flashcards

Sampling Methods?

Simple random sample w/o replacement (SRSWOR) or w/ replacement (SRSWR), cluster sample and stratified sample.

Signup and view all the flashcards

Data Cube Aggregation?

Data is aggregated at each level

Signup and view all the flashcards

What is Data Transformation?

Transform/consolidate for efficient mining, easier pattern understanding.

Signup and view all the flashcards

Transformation Types?

Smoothing, attribute construction, aggregation, normalization, discretization, and concept hierarchy generation.

Signup and view all the flashcards

Normalization?

Scales data into smaller range.

Signup and view all the flashcards

Discretization?

Maps values to interval/conceptual labels.

Signup and view all the flashcards

Study Notes

Data Preprocessing Overview

  • Real-world databases often contain noisy, missing, and inconsistent data due to their large size and heterogeneous sources.
  • Data quality depends on factors like accuracy, completeness, consistency, timeliness, believability, and interpretability.
  • Data preprocessing aims to improve data quality and mining efficiency.

Data Preprocessing Techniques

  • Data cleaning removes noise and corrects inconsistencies.
  • Data integration merges data from multiple sources into a coherent store.
  • Data reduction reduces data size by aggregation or feature elimination.
  • Data transformation scales data to a smaller range, like 0.0 to 1.0

Data Cleaning Methods

  • Data cleaning addresses incomplete, noisy, and inconsistent data.
  • Data scrubbing tools use domain knowledge to detect and correct errors.
  • Data auditing tools analyze data to discover rules or relationships and to detect violations.
  • Discrepancy detection involves metadata analysis and the use of unique, consecutive, and null rules.

Handling Missing Values

  • Tuples with missing class labels can be ignored, but this is ineffective if many attributes are missing values.
  • Manual filling is time-consuming with large datasets
  • Using a global constant is simple, but can create the false impression of an interesting result.
  • Central tendency measures (mean/median) can fill missing values, with median being preferable for skewed data.
  • Most probable values can be determined using regression or decision tree induction.
  • Note: missing data does not always imply an error

Data Smoothing Techniques

  • Binning smooths data by consulting value neighborhoods
  • Smoothing by bin means replaces each bin value with the bin's mean
  • Smoothing by bin medians replaces each bin value with the bin's median.
  • Smoothing by bin boundaries replaces values with the closest boundary value.
  • Regression conforms data values to a function.
  • Outlier analysis detects outliers using clustering.

Data Integration Challenges

  • Schema integration and object matching are tricky, requiring metadata analysis to avoid errors.
  • Ensure attribute functional dependencies and referential constraints match across systems.

Redundancy and Correlation Analysis

  • Redundancy occurs when attributes can be derived from others or inconsistencies in naming appear.
  • Correlation analysis measures how strongly one attribute implies another
  • Nominal data correlations can be found using the χ2 test
  • Numeric data correlations can be found using the correlation coefficient
  • Covariance measures the extent to which two attributes change together.

Data Reduction Strategies

  • Data reduction techniques aim to obtain a smaller data representation that yields similar analytical results.
  • Dimensionality reduction reduces the number of attributes.
  • Numerosity reduction lowers data volume via parametric or nonparametric methods.
  • Data compression applies transformations for a reduced or "compressed" data representation.
  • Lossless data reduction means the original data can be built from the compressed data without the loss of any information

Wavelet Transforms

  • The discrete wavelet transform (DWT) transforms a data vector into wavelet coefficients
  • Useful when data is truncated after transformation
  • The strongest coefficients allow for a compact, approximate version of the data
  • DWT is related to the discrete Fourier transform, involves smoothing then detailing data
  • Transforms data using hierarchical pyramid algorithm

Principal Components Analysis (PCA)

  • PCA searches for k orthogonal vectors to represent data, projecting onto a smaller space
  • Input data is normalized
  • PCA Computes k orthonormal vectors
  • Provides a basis for the normalized input data
  • Principal components are sorted by significance
  • Smaller components get weakes and those get cut

Attribute Subset Selection

  • Attribute subset selection can help reduce processing overhead
  • Involves removing redundant attributes or dimensions
  • Aims to find minimum attribute sets
  • Goal is to retain original distribution of data classes
  • Employs heuristic methods that explore reduced search spaces.

Attribute Subset Selection Methods

  • Stepwise forward selection starts empty, adding best attributes on each step.
  • Stepwise backward elimination starts with all attributes, removing the worst on each step.
  • Decision tree induction chooses the "best" attribute to partition data into classes.

Parametric Data Reduction

  • Regression Approximates Data with Function Fit
  • Estimates Data
  • Is most useful with numeric attributes
  • Log-Linear Models
  • Approximate discrete multidimensional probibalility functions

Nonparametric Data Reduction

  • Parametric methods store model parameters, while nonparametric methods store reduced data representations
  • Histograms use binning to approximate data distributions by partitioning data into buckets of value ranges.
  • Clustering groups similar data tuples into clusters, replacing the data with cluster representations
  • Sampling selects a random subset of the data, trading accuracy for speed

Data Transformation

  • Data transformation strategies include smoothing, attribute construction, aggregation, normalization, and discretization
  • Normalization scales attribute data to a smaller range
  • Discretization replaces raw values with interval or conceptual labels.
  • Is crucial for mining data

Data Transformation; Normalization

  • Normalization prevents attributes with larger ranges from dominating
  • Min-max normalization transforms data to a range, like [0,1]
  • Z-score normalization normalizes based around the man / average and standard distribution
  • Decimal scaling normalizes by moving the decimal point

Discretization Methods for Numeric Data

  • Binning discretizes by equal-width or equal-frequency intervals.
  • Histogram analysis partitions values into disjoint ranges called buckets.
  • Cluster analysis groups numeric attribute values into clusters.
  • Decision tree analysis selects split-points that minimize entropy for classification.
  • Correlation analysis uses measures of correlation to choose discretization intervals.

Concept Hierarchies for Nominal Data

  • May often involve lots of attributes, that experts may need to pick and choose
  • Concept Hierarchies may need to be explicitly expressed at the schema level
  • Concept Hierarchies may be manually specified

Hierarchical Order of Attributes

  • High-level concepts have fewer distinct values than lower-level concepts
  • A complete hierarchy will have tightly semantically coupled attributes

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Data Preprocessing
5 questions

Data Preprocessing

RealizablePrehnite avatar
RealizablePrehnite
Data Preprocessing: Why and How
16 questions
Use Quizgecko on...
Browser
Browser