30 Questions
Data cleaning routines aim to introduce noise into the data.
False
Faulty data can be caused by errors in data transmission.
True
Disguised missing data refers to users purposely submitting incorrect data values for optional fields.
False
Data preprocessing involves tasks such as data cleaning, data integration, data reduction, and data transformation.
True
Limited buffer size is not a possible technology limitation affecting data quality.
False
Interpretability refers to how well the data can be understood.
True
Data preprocessing involves major tasks such as data cleaning, data integration, data reduction, and data summarization.
False
Data cleaning aims to add noise and introduce inconsistencies to the data.
False
Data integration involves merging data from a single source into a coherent data store.
False
Data reduction does not involve reducing data size by aggregating, eliminating redundant features, or clustering.
False
Data transformation can improve the accuracy and efficiency of mining algorithms involving time measurements.
False
Measures for data quality include accuracy, completeness, consistency, timeliness, and readability.
False
In bin means smoothing, each value in a bin is replaced by the median value of the bin.
False
Simple linear regression involves finding the 'best' line to fit multiple attributes (or variables).
False
In multiple linear regression, the model describes how the dependent variable is related to only one independent variable.
False
Outliers may be detected by clustering where similar values are organized into 'clusters'.
True
Data discrepancies can be caused by respondents not wanting to share information about themselves.
True
Data decay refers to the accurate use of data codes.
False
Discretization involves mapping the entire set of values of a given attribute to a new set of replacement values.
True
Stratified sampling draws samples from each partition with no consideration for the proportion of the data in each partition.
False
Normalization aims to expand the range of attribute values.
False
Data compression in data mining is aimed at expanding the representation of the original data.
False
Attribute construction is a method used in data transformation.
True
Simple random sampling may perform poorly with skewed data.
True
Data integration involves removing redundancies and detecting inconsistencies.
True
Data reduction includes dimensionality reduction and data expansion.
False
Normalization is a step in data transformation and data discretization processes.
True
Wavelet analysis is related to data quality enhancement in data warehouse environments.
False
Declarative data cleaning involves developing algorithms for data compression.
False
Feature extraction is a key concept in exploratory data mining.
True
Study Notes
Data Preprocessing: An Overview
- Data preprocessing is an essential step in data mining, ensuring data quality and preparing it for analysis.
- Major tasks in data preprocessing include data cleaning, data integration, data reduction, data transformation, and data discretization.
Data Quality
- Data quality refers to the degree to which data satisfies the requirements of the intended use.
- Measures of data quality include:
- Accuracy: correctness of data
- Completeness: availability of data
- Consistency: conformity of data to rules and constraints
- Timeliness: relevance of data to the current situation
- Believability: trustworthiness of data
- Interpretability: ease of understanding data
Reasons for Faulty Data
- Faulty data can occur due to:
- Faulty data collection instruments
- Human or computer errors during data entry
- Purposely submitting incorrect data (disguised missing data)
- Errors in data transmission
- Technology limitations
Data Cleaning
- Data cleaning involves identifying and correcting errors, inconsistencies, and inaccuracies in data.
- Techniques used in data cleaning include:
- Filling in missing values
- Smoothing noisy data
- Identifying or removing outliers
- Resolving inconsistencies
Handling Noisy Data
- Techniques used to handle noisy data include:
- Regression analysis
- Outlier analysis
- Clustering
Data Integration
- Data integration involves merging data from multiple sources into a coherent data store.
- Challenges in data integration include:
- Entity identification problem
- Removing redundancies
- Detecting inconsistencies
Data Reduction
- Data reduction involves reducing the size of the data while preserving its integrity.
- Techniques used in data reduction include:
- Dimensionality reduction
- Numerosity reduction
- Data compression
Data Transformation and Data Discretization
- Data transformation involves applying a function to the data to transform it into a more suitable form.
- Data discretization involves dividing the range of a continuous attribute into intervals.
- Techniques used in data transformation and discretization include:
- Smoothing
- Attribute construction
- Aggregation
- Normalization
- Discretization
Data Transformation
- Data transformation involves mapping the entire set of values of a given attribute to a new set of replacement values.
- Methods used in data transformation include:
- Smoothing
- Attribute construction
- Aggregation
- Normalization
Discretization
- Discretization involves dividing the range of a continuous attribute into intervals.
- Interval labels can then be used to replace actual data values.
- Discretization can reduce data size and improve data quality.
Sampling
- Sampling involves selecting a representative subset of the data to reduce the size of the data.
- Types of sampling include:
- Simple random sampling
- Sampling without replacement
- Sampling with replacement
- Stratified sampling
Test your understanding of smoothing by bin means technique where each value in a bin is replaced by the mean value of the bin, as well as simple linear regression which involves finding the best line to fit two attributes. This quiz covers concepts related to handling noisy data and regression analysis.
Make Your Own Quizzes and Flashcards
Convert your notes into interactive study material.
Get started for free