Data Quality Issues
30 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is the consequence of trusting poor quality data?

  • It becomes more expensive
  • It is difficult (correct)
  • It becomes more accurate
  • It becomes irrelevant

According to A.T.Kearny, 10%-20% of operative costs are due to poor data quality.

False (B)

What is the reason for low accuracy of numerical attributes?

Noisy measurements, limited precision, wrong measurements, transposition of digits (when entered manually)

According to the Data Warehouse Institute, industry and administration in the US lose $______________ annually due to poor data quality.

<p>600 billion USD</p> Signup and view all the answers

What is the result of incorrect prices in inventory retail databases?

<p>Costs for consumers of 2.5 billion $ (A)</p> Signup and view all the answers

According to SAS study, 80% of German companies trust their own data.

<p>False (B)</p> Signup and view all the answers

Match the following concepts with their definitions:

<p>Syntactic accuracy = Entry is not in the domain Semantic accuracy = Entry is in the domain but not correct</p> Signup and view all the answers

Approximately ____________ of all hospital records contain errors.

<p>80%</p> Signup and view all the answers

What is a method of predicting and assigning values using statistical or machine learning methods?

<p>Inserting predicted values (B)</p> Signup and view all the answers

Actual observed values are used in Inserting predicted values method.

<p>False (B)</p> Signup and view all the answers

What is the purpose of determining the quality of the data in data understanding?

<p>To check syntactic accuracy</p> Signup and view all the answers

Visualization techniques can be used to find ______________________ in data.

<p>outliers</p> Signup and view all the answers

Match the following steps in data understanding with their descriptions:

<p>Determine the quality of the data = Check syntactic accuracy Find outliers = Using visualization techniques Detect and examine missing values = Check for hidden default values Compare statistics with expected behavior = Check for normal distribution</p> Signup and view all the answers

Data understanding is only necessary for confirming expected dependencies between attributes.

<p>False (B)</p> Signup and view all the answers

What is an outlier in data analysis?

<p>A value or data object that is far away or very different from all or most of the other data (A)</p> Signup and view all the answers

Incomplete records are an example of completeness violation.

<p>True (A)</p> Signup and view all the answers

What is one of the causes of outliers in data analysis?

<p>Data quality problems (erroneous data coming from wrong measurements or typing mistakes)</p> Signup and view all the answers

Unbalanced data occurs when the data set is biased extremely to one type of _________________.

<p>records</p> Signup and view all the answers

Match the following data quality issues with their descriptions:

<p>Timeliness = Missing critical values Completeness = Up-to-date data Unbalanced data = Data set is biased to one type of records</p> Signup and view all the answers

What should be done with outliers that come from erroneous data?

<p>Exclude them from the analysis (A)</p> Signup and view all the answers

Even if the outliers are correct, they should always be included in the analysis.

<p>False (B)</p> Signup and view all the answers

Data quality problems can cause issues with the ________________ of the data.

<p>completeness</p> Signup and view all the answers

What is an effect of a single extremely large outlier on the mean value?

<p>It leads to completely misleading values for the mean value (D)</p> Signup and view all the answers

Deletion of all observations with missing values is a recommended method for handling missing values.

<p>False (B)</p> Signup and view all the answers

What are some common reasons for missing values in a dataset?

<p>Broken sensors, refusal to answer a question, irrelevant attribute for the corresponding object</p> Signup and view all the answers

The ______________ plot can be used to visualize the distribution of a single variable and detect outliers.

<p>Boxplot or Histogram</p> Signup and view all the answers

Match the following methods with their descriptions for handling outliers:

<p>Deletion = Delete observations with outliers Replacement = Replace observations with other values (mean, etc.) Sub-sampling = Analyze outliers separately from the rest</p> Signup and view all the answers

What is a disadvantage of deleting all observations with missing values?

<p>It reduces the sample size and makes the model less valid (C)</p> Signup and view all the answers

Outliers can be detected using Cluster analysis techniques.

<p>True (A)</p> Signup and view all the answers

What is a method for replacing missing values in a dataset?

<p>Replace with summary statistics (mean, mode, median)</p> Signup and view all the answers

Study Notes

Data Quality

  • Poor quality data can be useless or even dangerous
  • Trusting poor quality data is difficult
  • Effects of data quality problems:
    • Incorrect prices in inventory retail databases: $2.5 billion in costs for consumers, 80% of barcode-scan-errors to the disadvantage of consumers
    • IRS 1992: almost 100,000 tax refunds not deliverable
    • 50-80% of computerized criminal records in the U.S. were found to be inaccurate, incomplete, or ambiguous
    • US-Postal Service: up to 7,000 undeliverable mailings due to incorrect addresses

Cost of Dirty Data

  • A.T. Kearny: 25-40% of operative costs due to poor data quality
  • Data Warehouse Institute: industry and administration in the US lose $600 billion annually
  • SAS study: only 18% of German companies trust their own data
  • 80% of all hospital records contain errors

Aspects of Data Quality

  • Syntactic accuracy: entry is not in the domain (e.g. "fmale" in gender)
  • Semantic accuracy: entry is in the domain but not correct (e.g. "John Smith" as female)
  • Completeness: an entry is missing critical values
  • Timeliness: is the available data up to date?

Typical Data Quality Problems

  • End of line not recognized
  • Footer/Preamble issues
  • Separation of fields (comma, semicolon, tab)
  • Incorrect values
  • Missing values
  • Incorrect format
  • Incorrect title
  • Superfluous characters

Outliers

  • Definition: a value or data object that is far away or very different from all or most of the other data
  • Causes: data quality problems, exceptional or unusual situations/data objects
  • Handling outliers:
    • Deletion: delete observations with outliers
    • Replacement: replace outliers with other values (mean, etc.)
    • Sub-sampling: analyze outliers separately from the rest

Visualizing the Distribution

  • Easy way to find outliers: visualize the distribution of variables
  • Boxplot or Histogram for one variable
  • Scatter plot for two variables
  • PCA or MDS plots for outliers in multidimensional data
  • Cluster analysis techniques: outliers are points that cannot be assigned to any cluster

Treatment of Missing Values

  • Deletion: delete all observations with missing values
  • Replacement:
    • Replace with summary statistics (mean, mode, median)
    • Replace with predicted values using statistical or machine learning methods

Checklist for Data Understanding

  • Determine the quality of the data
  • Find outliers using visualization techniques
  • Detect and examine missing values
  • Discover new or confirm expected dependencies or correlations between attributes
  • Check specific application-dependent assumptions
  • Compare statistics with expected behavior

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

Big data part 5.pdf

Description

This quiz explores the consequences of poor data quality, including financial losses, incorrect tax refunds, and inaccurate criminal records.

More Like This

Information Systems and Data Quality
25 questions
Database Management Essentials
32 questions
Datenmanagement Kapitel 1
41 questions
Database Fundamentals Quiz
8 questions

Database Fundamentals Quiz

MagnanimousEnlightenment8216 avatar
MagnanimousEnlightenment8216
Use Quizgecko on...
Browser
Browser