Podcast
Questions and Answers
Which of the following is NOT a dimension of data quality that preprocessing aims to improve?
Which of the following is NOT a dimension of data quality that preprocessing aims to improve?
- Accessibility, referring to how easily different users can get access to required data. (correct)
- Completeness, ensuring that no information is missing from the dataset.
- Accuracy, ensuring the data is correct and free from errors.
- Believability, reflecting the degree to which users trust the integrity of the data.
A dataset contains customer addresses, but several entries have incomplete street numbers. This is an example of which data quality issue?
A dataset contains customer addresses, but several entries have incomplete street numbers. This is an example of which data quality issue?
- Inconsistency
- Incompleteness (correct)
- Inaccuracy
- Timeliness
Imagine that a value of -10 has been entered for the salary attribute. This is an example of?
Imagine that a value of -10 has been entered for the salary attribute. This is an example of?
- Complete data
- Intentional data
- Noisy data (correct)
- Consistent data
In a customer database, one table lists customer names as 'Robert' while another lists the same customer as 'Bob.' This situation exemplifies:
In a customer database, one table lists customer names as 'Robert' while another lists the same customer as 'Bob.' This situation exemplifies:
Which task involves filling in missing values, smoothing noisy data, identifying outliers, and resolving inconsistencies within a dataset?
Which task involves filling in missing values, smoothing noisy data, identifying outliers, and resolving inconsistencies within a dataset?
What is the primary goal of data transformation in the context of data preprocessing?
What is the primary goal of data transformation in the context of data preprocessing?
What is the main purpose of data reduction techniques in data preprocessing?
What is the main purpose of data reduction techniques in data preprocessing?
Which major task in data preprocessing involves merging data from various sources, such as different databases or files?
Which major task in data preprocessing involves merging data from various sources, such as different databases or files?
In handling missing values, when is it LEAST effective to simply ignore the sample with the missing value?
In handling missing values, when is it LEAST effective to simply ignore the sample with the missing value?
Which of the following methods is suitable for automatically filling in missing values in a dataset?
Which of the following methods is suitable for automatically filling in missing values in a dataset?
What is the purpose of binning techniques in handling noisy data?
What is the purpose of binning techniques in handling noisy data?
Which of the following approaches can be used to handle noisy data by fitting data into regression functions?
Which of the following approaches can be used to handle noisy data by fitting data into regression functions?
What is 'data scrubbing' in the context of data discrepancy detection?
What is 'data scrubbing' in the context of data discrepancy detection?
What is the role of ETL tools in data migration and integration?
What is the role of ETL tools in data migration and integration?
What potential problem arises specifically during data integration when merging data from multiple sources?
What potential problem arises specifically during data integration when merging data from multiple sources?
Why might attribute values differ for the same real-world entity when integrating data from multiple sources?
Why might attribute values differ for the same real-world entity when integrating data from multiple sources?
What is a key consideration in handling redundancy during data integration?
What is a key consideration in handling redundancy during data integration?
In the context of data integration, what does 'derivable data' refer to?
In the context of data integration, what does 'derivable data' refer to?
Which statistical method can be used to detect redundancies between nominal attributes in data integration?
Which statistical method can be used to detect redundancies between nominal attributes in data integration?
How does correlation analysis help in handling redundancy during data integration?
How does correlation analysis help in handling redundancy during data integration?
In a contingency table for the chi-square test, what do the rows and columns typically represent?
In a contingency table for the chi-square test, what do the rows and columns typically represent?
In the context of the chi-square test for nominal data, what does it mean if the test rejects the hypothesis?
In the context of the chi-square test for nominal data, what does it mean if the test rejects the hypothesis?
What does the covariance between two numeric attributes indicate?
What does the covariance between two numeric attributes indicate?
What does a covariance of zero between two random variables typically imply?
What does a covariance of zero between two random variables typically imply?
How is the chi-square value computed when performing the chi-square test?
How is the chi-square value computed when performing the chi-square test?
What is the purpose of calculating degrees of freedom in the chi-square test?
What is the purpose of calculating degrees of freedom in the chi-square test?
What does WEKA provide for data preprocessing?
What does WEKA provide for data preprocessing?
In the context of covariance, what does a positive covariance value between two stocks suggest?
In the context of covariance, what does a positive covariance value between two stocks suggest?
What is the purpose of Weka's Explorer interface in the context of data mining?
What is the purpose of Weka's Explorer interface in the context of data mining?
Data integration combines data from multiple sources into a coherent store. What task is involved in this process?
Data integration combines data from multiple sources into a coherent store. What task is involved in this process?
Which of the following routines works to 'clean' the data through filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies?
Which of the following routines works to 'clean' the data through filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies?
What is the purpose of Data Transformation?
What is the purpose of Data Transformation?
What is the purpose of Data Reduction?
What is the purpose of Data Reduction?
What should you use for nominal data when handling redundancy in data integration?
What should you use for nominal data when handling redundancy in data integration?
What can you use for numeric attributes when handling redundancy in data integration?
What can you use for numeric attributes when handling redundancy in data integration?
What does Data cleaning routines work to do?
What does Data cleaning routines work to do?
What is noisy data?
What is noisy data?
What is the purpose of clustering when dealing with noisy data?
What is the purpose of clustering when dealing with noisy data?
What does Data discrepancy detection typically involve?
What does Data discrepancy detection typically involve?
What is Object identification in handling redundancy in data integration?
What is Object identification in handling redundancy in data integration?
What is the meaning of independence regarding covariance?
What is the meaning of independence regarding covariance?
According to the content what tasks does WEKA provide?
According to the content what tasks does WEKA provide?
Flashcards
Data Accuracy
Data Accuracy
Ensuring data is correct; incorrect attribute values may stem from faulty instruments or human error.
Data Completeness
Data Completeness
Full information is available. Incomplete data may occur because values are unavailable.
Data Consistency
Data Consistency
Data from all sources is consistent. Different user assessments or time zones can cause inconsistency.
Data Timeliness
Data Timeliness
Signup and view all the flashcards
Data Believability
Data Believability
Signup and view all the flashcards
Data Interpretability
Data Interpretability
Signup and view all the flashcards
Data Cleaning
Data Cleaning
Signup and view all the flashcards
Data Integration
Data Integration
Signup and view all the flashcards
Data Reduction
Data Reduction
Signup and view all the flashcards
Data Transformation
Data Transformation
Signup and view all the flashcards
Incomplete Data
Incomplete Data
Signup and view all the flashcards
Noisy Data
Noisy Data
Signup and view all the flashcards
Inconsistent Data
Inconsistent Data
Signup and view all the flashcards
Noise in Data
Noise in Data
Signup and view all the flashcards
Binning for Noisy Data
Binning for Noisy Data
Signup and view all the flashcards
Regression for Noisy Data
Regression for Noisy Data
Signup and view all the flashcards
Clustering
Clustering
Signup and view all the flashcards
Combined Inspection
Combined Inspection
Signup and view all the flashcards
Using Metadata
Using Metadata
Signup and view all the flashcards
Data scrubbing
Data scrubbing
Signup and view all the flashcards
Data Auditing
Data Auditing
Signup and view all the flashcards
Data migration tools
Data migration tools
Signup and view all the flashcards
ETL tools
ETL tools
Signup and view all the flashcards
Data Integration
Data Integration
Signup and view all the flashcards
Schema Integration
Schema Integration
Signup and view all the flashcards
Entity identification
Entity identification
Signup and view all the flashcards
Object Identification
Object Identification
Signup and view all the flashcards
Derivable Data
Derivable Data
Signup and view all the flashcards
Correlation Analysis
Correlation Analysis
Signup and view all the flashcards
Contingency Table
Contingency Table
Signup and view all the flashcards
Chi-squared test
Chi-squared test
Signup and view all the flashcards
Covariance
Covariance
Signup and view all the flashcards
Positive Covariance
Positive Covariance
Signup and view all the flashcards
Negative Covariance
Negative Covariance
Signup and view all the flashcards
Independence
Independence
Signup and view all the flashcards
WEKA Tool
WEKA Tool
Signup and view all the flashcards
Signup and view all the flashcards
Study Notes
- The key learning outcomes for the week are to recognize major tasks in data preprocessing and to perform data cleaning and integration.
Data Quality and Preprocessing
- Preprocessing improves data quality, especially for accuracy, which refers to the correctness of data. Incorrect attribute values can arise from faulty instruments or human error.
- Improves completeness of data, meaning that full information is avialable
- Improves consistency; data is same across all sources
- Improves Timeliness: time difference and delay
- Improves Believability: reflects how much the data is trusted by users
- Improves Interpretability: reflects how easy the data are understood.
Major Tasks in Data Preprocessing
- Data cleaning fills in missing values, smooths noisy data, identifies/removes outliers, and resolves inconsistencies.
- Data integration integrates data from multiple sources like databases, data cubes, or files.
- Data reduction obtains a smaller representation of the dataset that still produces the same results.
- Data transformation converts data into appropriate forms to improve mining results.
Data Cleaning
- Real-world data is often dirty, with potentially incorrect information due to instrument faults, or human/computer errors.
- Incomplete data lacks attribute values and may contain only aggregate data. For example: Occupation=“” represents missing data.
- Noisy data contains noise, errors, or outliers, like Salary=“-10."
- Inconsistent data contains discrepancies in codes or names. An example can be Age="42" wheras Birthday=“03/07/2010."
Incomplete Data
- Data is not always available.
- Missing data can occur because:
- Due to equipment malfunction
- Inconsistencies
- Data entry errors
- Data irrelevance
- Change in history
Handling Missing Values
- Ignore the sample where the percentage of ignored samples is too high, which is not effective
- Fill in manually, which is tedious and infeasible
- Fill in automatically, which requires:
- Global constant, like "unknown"
- Attribute mean for all samples or for a specific class
- The most probable value using statistical models
Noisy Data
- Noise is random error in data.
- Incorrect attribute values arise from faulty instruments, data entry/transmission problems, technology limitations, and naming convention inconsistencies.
- Other data problems wich require cleaning includes incomplete data, duplicate records and inconsistent data
Handling Noisy Data
- Binning is a technique that involved first sorting the data and partitioning into equal-frequency binds.
- Regression smooths the data by fitting it into regression functions.
- Clustering detects and removes outliers.
- Combined computer and human inspection deals with possible outliers by detecting suspicious values and checking manually.
Data Cleaning as a Process
- Data discrepancy detection involves:
- Utilizing metadata (domain, dependency, distribution)
- Checking field overloading, uniqueness/consecutive/null rules.
- Data scrubbing using domain knowledge
- Using data auditing for outlier detection via rules and relationship analysis.
- Data migration and integration allows transformation tools and ETL tools for specifying transformations graphically.
- Integration of the processes are best when iterative and interactive leveraging Potter's Wheels.
Data Integration
- Combines data from multiple sources into a coherent store via data integration.
- Schema integration: e.g., A.cust-id = B.cust-# requires integration metadata from various sources
- Entity identification involves identifying real-world entities accross sources.
- Detecting and resolving data value conflicts can arise since attribute values may differ, possibly due to different scales like metric vs British units.
Handling Redundancy in Data Integration
- Occurs when integrated from multiple data sources
- Object and Derivable data can contribute to this
- Integration should reduce redundancies to improve mining speed and quality.
Redundancy Detection
- Can use correlation analysis for redundancy detection
- Use a chi-square test (X²) for nominal (categorical) attributes to assess their correlation, focusing on the differences between actual and expected counts to identify dependencies.
- Apply correlation coefficients and covariance to numeric attributes to determine how values vary across attributes, understanding the strength and direction of their relationships.
X² Correlation Test
- The test utilizes a contingency table that lists data tuples described by distinct values of attributes A and B.
- X2= the sum ( (observed - expected)^2 / expected )
- A statistic tests the hypothesis of independence between A and B. The test’s degrees of freedom are defined by [(r-1) x (c-1)], dependent on a significance level.
Correlation Example
- In a group surveyed (n=1500), data reveals that 250 men prefer fiction, 50 prefer non-fiction, and 200 women prefer fiction, versus 1000 for non-fiction.
- For 1 degree of freedom, the X2 value needed to reject the hypothesis is 10.828
- The computed value, X² = 507.93 indicates correlation between gender and preferred reading type.
Covariance
- The covariance and correlation coefficient measure the extent to which two numeric attributes change together.
- covariance : The sum of (ai - A) (bi – B)/number of tuples
- A positive covariance means that if A is larger than its expected value, B tends to be larger than its expected value - and vice versa. Negative = smaller.
- Independence, COVA,B = 0 but the converse is not true
Co-Variance Example
- Two stocks (A, B) have the following values in one week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
- The mean for A is: 4. For B: 9.6
- Covariance(A,B) results in 4
- These variables rise together. Cov shows that their relationship is > 0
Weka Tool
- WEKA provides many algorithms, datasets, tools for transforming datasets, ability to train, preprocess and analyze without writting program code.
- Complete Date Mining with WEKA can be viewed ad: https://www.youtube.com/user/WekaMOOC)
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.