Module 2:Data Preprocessing1: Improving Data Quality

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Which of the following is NOT a dimension of data quality that preprocessing aims to improve?

Accessibility, referring to how easily different users can get access to required data. (correct)
Completeness, ensuring that no information is missing from the dataset.
Accuracy, ensuring the data is correct and free from errors.
Believability, reflecting the degree to which users trust the integrity of the data.

A dataset contains customer addresses, but several entries have incomplete street numbers. This is an example of which data quality issue?

Inconsistency
Incompleteness (correct)
Inaccuracy
Timeliness

Imagine that a value of -10 has been entered for the salary attribute. This is an example of?

Complete data
Intentional data
Noisy data (correct)
Consistent data

In a customer database, one table lists customer names as 'Robert' while another lists the same customer as 'Bob.' This situation exemplifies:

Inconsistency (C) Signup and view all the answers

Which task involves filling in missing values, smoothing noisy data, identifying outliers, and resolving inconsistencies within a dataset?

Data Cleaning (D) Signup and view all the answers

What is the primary goal of data transformation in the context of data preprocessing?

Converting data into appropriate forms to improve mining results. (A) Signup and view all the answers

What is the main purpose of data reduction techniques in data preprocessing?

To obtain a smaller representation of the dataset while preserving its integrity. (A) Signup and view all the answers

Which major task in data preprocessing involves merging data from various sources, such as different databases or files?

Data Integration (C) Signup and view all the answers

In handling missing values, when is it LEAST effective to simply ignore the sample with the missing value?

When the percentage of ignored samples becomes excessively high. (C) Signup and view all the answers

Which of the following methods is suitable for automatically filling in missing values in a dataset?

Using a global constant such as 'unknown'. (B) Signup and view all the answers

What is the purpose of binning techniques in handling noisy data?

To smooth data by averaging or using median values within bins. (A) Signup and view all the answers

Which of the following approaches can be used to handle noisy data by fitting data into regression functions?

Regression. (C) Signup and view all the answers

What is 'data scrubbing' in the context of data discrepancy detection?

Using simple domain knowledge to detect errors and make corrections. (B) Signup and view all the answers

What is the role of ETL tools in data migration and integration?

Allowing users to specify transformations through a graphical user interface. (B) Signup and view all the answers

What potential problem arises specifically during data integration when merging data from multiple sources?

Entity identification (D) Signup and view all the answers

Why might attribute values differ for the same real-world entity when integrating data from multiple sources?

Different attribute representations, scales, or units (D) Signup and view all the answers

What is a key consideration in handling redundancy during data integration?

Object identification and derivable data. (A) Signup and view all the answers

In the context of data integration, what does 'derivable data' refer to?

Data that can be inferred or calculated from other attributes in the dataset (B) Signup and view all the answers

Which statistical method can be used to detect redundancies between nominal attributes in data integration?

Chi-square test (D) Signup and view all the answers

How does correlation analysis help in handling redundancy during data integration?

By measuring how strongly one attribute implies another (B) Signup and view all the answers

In a contingency table for the chi-square test, what do the rows and columns typically represent?

Distinct values of the two attributes being analyzed (B) Signup and view all the answers

In the context of the chi-square test for nominal data, what does it mean if the test rejects the hypothesis?

Suggests a significant association between attributes A and B. (C) Signup and view all the answers

What does the covariance between two numeric attributes indicate?

The degree of linear relationship, positive or negative, between the attributes. (D) Signup and view all the answers

What does a covariance of zero between two random variables typically imply?

No linear relationship between the variables. (A) Signup and view all the answers

How is the chi-square value computed when performing the chi-square test?

Calculated using observed and expected frequencies in a contingency table. (B) Signup and view all the answers

What is the purpose of calculating degrees of freedom in the chi-square test?

To use the chi-square distribution table to determine statistical significance. (B) Signup and view all the answers

What does WEKA provide for data preprocessing?

A variety of tools for transforming datasets, without writing program code. (D) Signup and view all the answers

In the context of covariance, what does a positive covariance value between two stocks suggest?

The two stocks tend to rise or fall together. (B) Signup and view all the answers

What is the purpose of Weka's Explorer interface in the context of data mining?

To visually explore datasets and apply preprocessing techniques. (B) Signup and view all the answers

Data integration combines data from multiple sources into a coherent store. What task is involved in this process?

Schema Integration (D) Signup and view all the answers

Which of the following routines works to 'clean' the data through filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies?

Data cleaning. (D) Signup and view all the answers

What is the purpose of Data Transformation?

Convert the data into appropiate forms for better mining results. (D) Signup and view all the answers

What is the purpose of Data Reduction?

Get a reduced representation of the data set. (D) Signup and view all the answers

What should you use for nominal data when handling redundancy in data integration?

We use the X² (chi-square) test. (A) Signup and view all the answers

What can you use for numeric attributes when handling redundancy in data integration?

Correlation coefficient and covariance. (D) Signup and view all the answers

What does Data cleaning routines work to do?

Clean the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies. (A) Signup and view all the answers

What is noisy data?

Random error in data (A) Signup and view all the answers

What is the purpose of clustering when dealing with noisy data?

To detect and remove outliers. (A) Signup and view all the answers

What does Data discrepancy detection typically involve?

All of the above. (D) Signup and view all the answers

What is Object identification in handling redundancy in data integration?

Object identification: The same attribute or object may have different names in different databases (C) Signup and view all the answers

What is the meaning of independence regarding covariance?

COVA,B = 0 (A) Signup and view all the answers

According to the content what tasks does WEKA provide?

Ability to preprocess a dataset and train a model (D) Signup and view all the answers

Flashcards

Data Accuracy

Ensuring data is correct; incorrect attribute values may stem from faulty instruments or human error.

Data Completeness

Full information is available. Incomplete data may occur because values are unavailable.

Data Consistency

Data from all sources is consistent. Different user assessments or time zones can cause inconsistency.