Podcast
Questions and Answers
What is the consequence of trusting poor quality data?
What is the consequence of trusting poor quality data?
According to A.T.Kearny, 10%-20% of operative costs are due to poor data quality.
According to A.T.Kearny, 10%-20% of operative costs are due to poor data quality.
False
What is the reason for low accuracy of numerical attributes?
What is the reason for low accuracy of numerical attributes?
Noisy measurements, limited precision, wrong measurements, transposition of digits (when entered manually)
According to the Data Warehouse Institute, industry and administration in the US lose $______________ annually due to poor data quality.
According to the Data Warehouse Institute, industry and administration in the US lose $______________ annually due to poor data quality.
Signup and view all the answers
What is the result of incorrect prices in inventory retail databases?
What is the result of incorrect prices in inventory retail databases?
Signup and view all the answers
According to SAS study, 80% of German companies trust their own data.
According to SAS study, 80% of German companies trust their own data.
Signup and view all the answers
Match the following concepts with their definitions:
Match the following concepts with their definitions:
Signup and view all the answers
Approximately ____________ of all hospital records contain errors.
Approximately ____________ of all hospital records contain errors.
Signup and view all the answers
What is a method of predicting and assigning values using statistical or machine learning methods?
What is a method of predicting and assigning values using statistical or machine learning methods?
Signup and view all the answers
Actual observed values are used in Inserting predicted values method.
Actual observed values are used in Inserting predicted values method.
Signup and view all the answers
What is the purpose of determining the quality of the data in data understanding?
What is the purpose of determining the quality of the data in data understanding?
Signup and view all the answers
Visualization techniques can be used to find ______________________ in data.
Visualization techniques can be used to find ______________________ in data.
Signup and view all the answers
Match the following steps in data understanding with their descriptions:
Match the following steps in data understanding with their descriptions:
Signup and view all the answers
Data understanding is only necessary for confirming expected dependencies between attributes.
Data understanding is only necessary for confirming expected dependencies between attributes.
Signup and view all the answers
What is an outlier in data analysis?
What is an outlier in data analysis?
Signup and view all the answers
Incomplete records are an example of completeness violation.
Incomplete records are an example of completeness violation.
Signup and view all the answers
What is one of the causes of outliers in data analysis?
What is one of the causes of outliers in data analysis?
Signup and view all the answers
Unbalanced data occurs when the data set is biased extremely to one type of _________________.
Unbalanced data occurs when the data set is biased extremely to one type of _________________.
Signup and view all the answers
Match the following data quality issues with their descriptions:
Match the following data quality issues with their descriptions:
Signup and view all the answers
What should be done with outliers that come from erroneous data?
What should be done with outliers that come from erroneous data?
Signup and view all the answers
Even if the outliers are correct, they should always be included in the analysis.
Even if the outliers are correct, they should always be included in the analysis.
Signup and view all the answers
Data quality problems can cause issues with the ________________ of the data.
Data quality problems can cause issues with the ________________ of the data.
Signup and view all the answers
What is an effect of a single extremely large outlier on the mean value?
What is an effect of a single extremely large outlier on the mean value?
Signup and view all the answers
Deletion of all observations with missing values is a recommended method for handling missing values.
Deletion of all observations with missing values is a recommended method for handling missing values.
Signup and view all the answers
What are some common reasons for missing values in a dataset?
What are some common reasons for missing values in a dataset?
Signup and view all the answers
The ______________ plot can be used to visualize the distribution of a single variable and detect outliers.
The ______________ plot can be used to visualize the distribution of a single variable and detect outliers.
Signup and view all the answers
Match the following methods with their descriptions for handling outliers:
Match the following methods with their descriptions for handling outliers:
Signup and view all the answers
What is a disadvantage of deleting all observations with missing values?
What is a disadvantage of deleting all observations with missing values?
Signup and view all the answers
Outliers can be detected using Cluster analysis techniques.
Outliers can be detected using Cluster analysis techniques.
Signup and view all the answers
What is a method for replacing missing values in a dataset?
What is a method for replacing missing values in a dataset?
Signup and view all the answers
Study Notes
Data Quality
- Poor quality data can be useless or even dangerous
- Trusting poor quality data is difficult
- Effects of data quality problems:
- Incorrect prices in inventory retail databases: $2.5 billion in costs for consumers, 80% of barcode-scan-errors to the disadvantage of consumers
- IRS 1992: almost 100,000 tax refunds not deliverable
- 50-80% of computerized criminal records in the U.S. were found to be inaccurate, incomplete, or ambiguous
- US-Postal Service: up to 7,000 undeliverable mailings due to incorrect addresses
Cost of Dirty Data
- A.T. Kearny: 25-40% of operative costs due to poor data quality
- Data Warehouse Institute: industry and administration in the US lose $600 billion annually
- SAS study: only 18% of German companies trust their own data
- 80% of all hospital records contain errors
Aspects of Data Quality
- Syntactic accuracy: entry is not in the domain (e.g. "fmale" in gender)
- Semantic accuracy: entry is in the domain but not correct (e.g. "John Smith" as female)
- Completeness: an entry is missing critical values
- Timeliness: is the available data up to date?
Typical Data Quality Problems
- End of line not recognized
- Footer/Preamble issues
- Separation of fields (comma, semicolon, tab)
- Incorrect values
- Missing values
- Incorrect format
- Incorrect title
- Superfluous characters
Outliers
- Definition: a value or data object that is far away or very different from all or most of the other data
- Causes: data quality problems, exceptional or unusual situations/data objects
- Handling outliers:
- Deletion: delete observations with outliers
- Replacement: replace outliers with other values (mean, etc.)
- Sub-sampling: analyze outliers separately from the rest
Visualizing the Distribution
- Easy way to find outliers: visualize the distribution of variables
- Boxplot or Histogram for one variable
- Scatter plot for two variables
- PCA or MDS plots for outliers in multidimensional data
- Cluster analysis techniques: outliers are points that cannot be assigned to any cluster
Treatment of Missing Values
- Deletion: delete all observations with missing values
- Replacement:
- Replace with summary statistics (mean, mode, median)
- Replace with predicted values using statistical or machine learning methods
Checklist for Data Understanding
- Determine the quality of the data
- Find outliers using visualization techniques
- Detect and examine missing values
- Discover new or confirm expected dependencies or correlations between attributes
- Check specific application-dependent assumptions
- Compare statistics with expected behavior
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
This quiz explores the consequences of poor data quality, including financial losses, incorrect tax refunds, and inaccurate criminal records.