Podcast
Questions and Answers
What is the consequence of trusting poor quality data?
What is the consequence of trusting poor quality data?
- It becomes more expensive
- It is difficult (correct)
- It becomes more accurate
- It becomes irrelevant
According to A.T.Kearny, 10%-20% of operative costs are due to poor data quality.
According to A.T.Kearny, 10%-20% of operative costs are due to poor data quality.
False (B)
What is the reason for low accuracy of numerical attributes?
What is the reason for low accuracy of numerical attributes?
Noisy measurements, limited precision, wrong measurements, transposition of digits (when entered manually)
According to the Data Warehouse Institute, industry and administration in the US lose $______________ annually due to poor data quality.
According to the Data Warehouse Institute, industry and administration in the US lose $______________ annually due to poor data quality.
What is the result of incorrect prices in inventory retail databases?
What is the result of incorrect prices in inventory retail databases?
According to SAS study, 80% of German companies trust their own data.
According to SAS study, 80% of German companies trust their own data.
Match the following concepts with their definitions:
Match the following concepts with their definitions:
Approximately ____________ of all hospital records contain errors.
Approximately ____________ of all hospital records contain errors.
What is a method of predicting and assigning values using statistical or machine learning methods?
What is a method of predicting and assigning values using statistical or machine learning methods?
Actual observed values are used in Inserting predicted values method.
Actual observed values are used in Inserting predicted values method.
What is the purpose of determining the quality of the data in data understanding?
What is the purpose of determining the quality of the data in data understanding?
Visualization techniques can be used to find ______________________ in data.
Visualization techniques can be used to find ______________________ in data.
Match the following steps in data understanding with their descriptions:
Match the following steps in data understanding with their descriptions:
Data understanding is only necessary for confirming expected dependencies between attributes.
Data understanding is only necessary for confirming expected dependencies between attributes.
What is an outlier in data analysis?
What is an outlier in data analysis?
Incomplete records are an example of completeness violation.
Incomplete records are an example of completeness violation.
What is one of the causes of outliers in data analysis?
What is one of the causes of outliers in data analysis?
Unbalanced data occurs when the data set is biased extremely to one type of _________________.
Unbalanced data occurs when the data set is biased extremely to one type of _________________.
Match the following data quality issues with their descriptions:
Match the following data quality issues with their descriptions:
What should be done with outliers that come from erroneous data?
What should be done with outliers that come from erroneous data?
Even if the outliers are correct, they should always be included in the analysis.
Even if the outliers are correct, they should always be included in the analysis.
Data quality problems can cause issues with the ________________ of the data.
Data quality problems can cause issues with the ________________ of the data.
What is an effect of a single extremely large outlier on the mean value?
What is an effect of a single extremely large outlier on the mean value?
Deletion of all observations with missing values is a recommended method for handling missing values.
Deletion of all observations with missing values is a recommended method for handling missing values.
What are some common reasons for missing values in a dataset?
What are some common reasons for missing values in a dataset?
The ______________ plot can be used to visualize the distribution of a single variable and detect outliers.
The ______________ plot can be used to visualize the distribution of a single variable and detect outliers.
Match the following methods with their descriptions for handling outliers:
Match the following methods with their descriptions for handling outliers:
What is a disadvantage of deleting all observations with missing values?
What is a disadvantage of deleting all observations with missing values?
Outliers can be detected using Cluster analysis techniques.
Outliers can be detected using Cluster analysis techniques.
What is a method for replacing missing values in a dataset?
What is a method for replacing missing values in a dataset?
Study Notes
Data Quality
- Poor quality data can be useless or even dangerous
- Trusting poor quality data is difficult
- Effects of data quality problems:
- Incorrect prices in inventory retail databases: $2.5 billion in costs for consumers, 80% of barcode-scan-errors to the disadvantage of consumers
- IRS 1992: almost 100,000 tax refunds not deliverable
- 50-80% of computerized criminal records in the U.S. were found to be inaccurate, incomplete, or ambiguous
- US-Postal Service: up to 7,000 undeliverable mailings due to incorrect addresses
Cost of Dirty Data
- A.T. Kearny: 25-40% of operative costs due to poor data quality
- Data Warehouse Institute: industry and administration in the US lose $600 billion annually
- SAS study: only 18% of German companies trust their own data
- 80% of all hospital records contain errors
Aspects of Data Quality
- Syntactic accuracy: entry is not in the domain (e.g. "fmale" in gender)
- Semantic accuracy: entry is in the domain but not correct (e.g. "John Smith" as female)
- Completeness: an entry is missing critical values
- Timeliness: is the available data up to date?
Typical Data Quality Problems
- End of line not recognized
- Footer/Preamble issues
- Separation of fields (comma, semicolon, tab)
- Incorrect values
- Missing values
- Incorrect format
- Incorrect title
- Superfluous characters
Outliers
- Definition: a value or data object that is far away or very different from all or most of the other data
- Causes: data quality problems, exceptional or unusual situations/data objects
- Handling outliers:
- Deletion: delete observations with outliers
- Replacement: replace outliers with other values (mean, etc.)
- Sub-sampling: analyze outliers separately from the rest
Visualizing the Distribution
- Easy way to find outliers: visualize the distribution of variables
- Boxplot or Histogram for one variable
- Scatter plot for two variables
- PCA or MDS plots for outliers in multidimensional data
- Cluster analysis techniques: outliers are points that cannot be assigned to any cluster
Treatment of Missing Values
- Deletion: delete all observations with missing values
- Replacement:
- Replace with summary statistics (mean, mode, median)
- Replace with predicted values using statistical or machine learning methods
Checklist for Data Understanding
- Determine the quality of the data
- Find outliers using visualization techniques
- Detect and examine missing values
- Discover new or confirm expected dependencies or correlations between attributes
- Check specific application-dependent assumptions
- Compare statistics with expected behavior
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
This quiz explores the consequences of poor data quality, including financial losses, incorrect tax refunds, and inaccurate criminal records.