Data Preprocessing: Why and How

BraveNaïveArt avatar
BraveNaïveArt
·
·
Download

Start Quiz

Study Flashcards

16 Questions

Why is data preprocessing important?

No quality data, no quality mining results.

What are the reasons for data in the real world being considered 'dirty'?

Incomplete, noisy, and inconsistent.

What are the dimensions of the multi-dimensional view of data quality?

Accuracy, completeness, consistency, timeliness, believability, value added, interpretability, accessibility.

What comprises the majority of the work in a data mining application?

Data preparation, cleaning, and transformation.

Give an example of inconsistent data.

Age=42, Birthday=03/07/1997

What may cause incorrect or misleading statistics in data mining?

Duplicate or missing data.

What are the 4 major tasks in data preprocessing according to Ahmed Sultan Al-Hegami?

Data cleaning, data integration, data reduction, data transformation

What are the tasks involved in data cleaning according to Ahmed Sultan Al-Hegami?

Fill in missing values, identify outliers and smooth out noisy data, correct inconsistent data, resolve redundancy caused by data integration

Why is data cleaning considered the number one problem in data warehousing according to Ahmed Sultan Al-Hegami?

Data cleaning is the number one problem in data warehousing

What are the reasons for missing data as mentioned by Ahmed Sultan Al-Hegami?

Equipment malfunction, inconsistency with other recorded data, data not entered due to misunderstanding, certain data not considered important at the time of entry, not registering history or changes of the data

How does Ahmed Sultan Al-Hegami suggest handling missing data?

Ignore the tuple, fill in missing values manually (which is tedious and infeasible), fill in automatically with a global constant, the attribute mean, or the most probable value

What is noise in the context of data preprocessing according to Ahmed Sultan Al-Hegami?

Noise is random error or variance in a measured variable

What are the methods suggested by Ahmed Sultan Al-Hegami for handling noisy data?

Binning method, clustering, combined computer and human inspection

What is the purpose of the binning method for data smoothing as explained by Ahmed Sultan Al-Hegami?

To smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.

How does the binning method work for data smoothing according to Ahmed Sultan Al-Hegami?

First sort data and partition into (equi-depth) bins, then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.

What are the 4 major tasks in data preprocessing according to Ahmed Sultan Al-Hegami?

Data cleaning, data integration, data reduction, data transformation

Study Notes

Importance of Data Preprocessing

  • Data preprocessing is crucial because real-world data is often "dirty" due to various reasons such as inconsistent data, missing values, and noisy data.

Dimensions of Data Quality

  • The multi-dimensional view of data quality consists of several dimensions, including accuracy, completeness, consistency, and timeliness.

Data Mining Application

  • The majority of work in a data mining application involves data preprocessing.

Inconsistent Data

  • An example of inconsistent data is when a person's age is recorded as 25 in one place and 30 in another.

Incorrect Statistics

  • Incorrect or misleading statistics in data mining can be caused by factors such as invalid data, incomplete data, or biased data.

Data Preprocessing Tasks

  • According to Ahmed Sultan Al-Hegami, the 4 major tasks in data preprocessing are data cleaning, data transformation, data reduction, and data transformation.

Data Cleaning

  • Data cleaning involves handling missing values, handling noisy data, and handling inconsistent data.

Importance of Data Cleaning

  • Data cleaning is considered the number one problem in data warehousing because it involves dealing with the above-mentioned issues.

Reasons for Missing Data

  • According to Ahmed Sultan Al-Hegami, reasons for missing data include non-response, data entry errors, and instrument errors.

Handling Missing Data

  • Ahmed Sultan Al-Hegami suggests handling missing data by using methods such as mean or median imputation, regression imputation, or predictive modeling.

Noise in Data Preprocessing

  • Noise in the context of data preprocessing refers to random errors or variances in the data.

Handling Noisy Data

  • Ahmed Sultan Al-Hegami suggests handling noisy data using methods such as binning, regression, and aggregation.

Binning Method

  • The purpose of the binning method for data smoothing is to reduce the effect of noise in the data by grouping values into ranges or bins.

Binning Method Operation

  • The binning method works by dividing the data into intervals or bins and then replacing each value with the average or median value of its bin.

Learn about the importance of data preprocessing, including data cleaning, integration, transformation, reduction, and discretization. Understand why preprocessing is necessary due to incomplete, noisy, and inconsistent real-world data.

Make Your Own Quizzes and Flashcards

Convert your notes into interactive study material.

Get started for free
Use Quizgecko on...
Browser
Browser