Data Quality Issues
30 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is the consequence of trusting poor quality data?

  • It becomes more expensive
  • It is difficult (correct)
  • It becomes more accurate
  • It becomes irrelevant
  • According to A.T.Kearny, 10%-20% of operative costs are due to poor data quality.

    False

    What is the reason for low accuracy of numerical attributes?

    Noisy measurements, limited precision, wrong measurements, transposition of digits (when entered manually)

    According to the Data Warehouse Institute, industry and administration in the US lose $______________ annually due to poor data quality.

    <p>600 billion USD</p> Signup and view all the answers

    What is the result of incorrect prices in inventory retail databases?

    <p>Costs for consumers of 2.5 billion $</p> Signup and view all the answers

    According to SAS study, 80% of German companies trust their own data.

    <p>False</p> Signup and view all the answers

    Match the following concepts with their definitions:

    <p>Syntactic accuracy = Entry is not in the domain Semantic accuracy = Entry is in the domain but not correct</p> Signup and view all the answers

    Approximately ____________ of all hospital records contain errors.

    <p>80%</p> Signup and view all the answers

    What is a method of predicting and assigning values using statistical or machine learning methods?

    <p>Inserting predicted values</p> Signup and view all the answers

    Actual observed values are used in Inserting predicted values method.

    <p>False</p> Signup and view all the answers

    What is the purpose of determining the quality of the data in data understanding?

    <p>To check syntactic accuracy</p> Signup and view all the answers

    Visualization techniques can be used to find ______________________ in data.

    <p>outliers</p> Signup and view all the answers

    Match the following steps in data understanding with their descriptions:

    <p>Determine the quality of the data = Check syntactic accuracy Find outliers = Using visualization techniques Detect and examine missing values = Check for hidden default values Compare statistics with expected behavior = Check for normal distribution</p> Signup and view all the answers

    Data understanding is only necessary for confirming expected dependencies between attributes.

    <p>False</p> Signup and view all the answers

    What is an outlier in data analysis?

    <p>A value or data object that is far away or very different from all or most of the other data</p> Signup and view all the answers

    Incomplete records are an example of completeness violation.

    <p>True</p> Signup and view all the answers

    What is one of the causes of outliers in data analysis?

    <p>Data quality problems (erroneous data coming from wrong measurements or typing mistakes)</p> Signup and view all the answers

    Unbalanced data occurs when the data set is biased extremely to one type of _________________.

    <p>records</p> Signup and view all the answers

    Match the following data quality issues with their descriptions:

    <p>Timeliness = Missing critical values Completeness = Up-to-date data Unbalanced data = Data set is biased to one type of records</p> Signup and view all the answers

    What should be done with outliers that come from erroneous data?

    <p>Exclude them from the analysis</p> Signup and view all the answers

    Even if the outliers are correct, they should always be included in the analysis.

    <p>False</p> Signup and view all the answers

    Data quality problems can cause issues with the ________________ of the data.

    <p>completeness</p> Signup and view all the answers

    What is an effect of a single extremely large outlier on the mean value?

    <p>It leads to completely misleading values for the mean value</p> Signup and view all the answers

    Deletion of all observations with missing values is a recommended method for handling missing values.

    <p>False</p> Signup and view all the answers

    What are some common reasons for missing values in a dataset?

    <p>Broken sensors, refusal to answer a question, irrelevant attribute for the corresponding object</p> Signup and view all the answers

    The ______________ plot can be used to visualize the distribution of a single variable and detect outliers.

    <p>Boxplot or Histogram</p> Signup and view all the answers

    Match the following methods with their descriptions for handling outliers:

    <p>Deletion = Delete observations with outliers Replacement = Replace observations with other values (mean, etc.) Sub-sampling = Analyze outliers separately from the rest</p> Signup and view all the answers

    What is a disadvantage of deleting all observations with missing values?

    <p>It reduces the sample size and makes the model less valid</p> Signup and view all the answers

    Outliers can be detected using Cluster analysis techniques.

    <p>True</p> Signup and view all the answers

    What is a method for replacing missing values in a dataset?

    <p>Replace with summary statistics (mean, mode, median)</p> Signup and view all the answers

    Study Notes

    Data Quality

    • Poor quality data can be useless or even dangerous
    • Trusting poor quality data is difficult
    • Effects of data quality problems:
      • Incorrect prices in inventory retail databases: $2.5 billion in costs for consumers, 80% of barcode-scan-errors to the disadvantage of consumers
      • IRS 1992: almost 100,000 tax refunds not deliverable
      • 50-80% of computerized criminal records in the U.S. were found to be inaccurate, incomplete, or ambiguous
      • US-Postal Service: up to 7,000 undeliverable mailings due to incorrect addresses

    Cost of Dirty Data

    • A.T. Kearny: 25-40% of operative costs due to poor data quality
    • Data Warehouse Institute: industry and administration in the US lose $600 billion annually
    • SAS study: only 18% of German companies trust their own data
    • 80% of all hospital records contain errors

    Aspects of Data Quality

    • Syntactic accuracy: entry is not in the domain (e.g. "fmale" in gender)
    • Semantic accuracy: entry is in the domain but not correct (e.g. "John Smith" as female)
    • Completeness: an entry is missing critical values
    • Timeliness: is the available data up to date?

    Typical Data Quality Problems

    • End of line not recognized
    • Footer/Preamble issues
    • Separation of fields (comma, semicolon, tab)
    • Incorrect values
    • Missing values
    • Incorrect format
    • Incorrect title
    • Superfluous characters

    Outliers

    • Definition: a value or data object that is far away or very different from all or most of the other data
    • Causes: data quality problems, exceptional or unusual situations/data objects
    • Handling outliers:
      • Deletion: delete observations with outliers
      • Replacement: replace outliers with other values (mean, etc.)
      • Sub-sampling: analyze outliers separately from the rest

    Visualizing the Distribution

    • Easy way to find outliers: visualize the distribution of variables
    • Boxplot or Histogram for one variable
    • Scatter plot for two variables
    • PCA or MDS plots for outliers in multidimensional data
    • Cluster analysis techniques: outliers are points that cannot be assigned to any cluster

    Treatment of Missing Values

    • Deletion: delete all observations with missing values
    • Replacement:
      • Replace with summary statistics (mean, mode, median)
      • Replace with predicted values using statistical or machine learning methods

    Checklist for Data Understanding

    • Determine the quality of the data
    • Find outliers using visualization techniques
    • Detect and examine missing values
    • Discover new or confirm expected dependencies or correlations between attributes
    • Check specific application-dependent assumptions
    • Compare statistics with expected behavior

    Studying That Suits You

    Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

    Quiz Team

    Related Documents

    Big data part 5.pdf

    Description

    This quiz explores the consequences of poor data quality, including financial losses, incorrect tax refunds, and inaccurate criminal records.

    More Like This

    Data Management Frameworks Quiz
    10 questions
    Information Systems and Data Quality
    25 questions
    Database Management Essentials
    32 questions
    Database Fundamentals Quiz
    8 questions

    Database Fundamentals Quiz

    MagnanimousEnlightenment8216 avatar
    MagnanimousEnlightenment8216
    Use Quizgecko on...
    Browser
    Browser