Data Quality and Governance Overview
48 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is a limitation of the iterative Bayesian algorithm, Accu?

  • It accounts for varying truth probabilities.
  • It allows for multiple true values for each data item.
  • It assumes sources are independent. (correct)
  • It provides a dynamic estimation of P(V is true for i).

Which type of conflict arises from using outdated information in data fusion?

  • Copying behaviors
  • Out of date information (correct)
  • Inconsistent interpretations of semantics
  • Incorrect calculations

What characterizes multi-truth problems in data fusion?

  • Different values can contribute partial truths. (correct)
  • Only one true value exists for each data object.
  • Values are either entirely correct or entirely incorrect.
  • There are no conflicting sources.

Which category of data fusion algorithms is inspired by measuring web page authority?

<p>Web link based (B)</p> Signup and view all the answers

What phase follows the value clustering in the data fusion-STORM process?

<p>Authority-based Bayesian inference (A)</p> Signup and view all the answers

What does the Bayesian based category of data fusion algorithms primarily rely on?

<p>Bayesian inference (A)</p> Signup and view all the answers

What does the term 'authored sources' refer to in the context of data fusion-STORM?

<p>Sources that have been referenced by many others. (A)</p> Signup and view all the answers

What is a necessary component of data fusion in the context of big data?

<p>Addressing conflicts that arise from data. (D)</p> Signup and view all the answers

What is the definition of an inclusion dependency (IND)?

<p>R[A1, .., An] is a subset of S[B1, .., Bn], where R and S are two different relations. (B)</p> Signup and view all the answers

Which dimension does NOT pertain to schema quality?

<p>Redundancy (D)</p> Signup and view all the answers

What characterizes a partial inclusion dependency?

<p>Only a certain percentage of tuples satisfy the dependency. (A)</p> Signup and view all the answers

In terms of trust and credibility of web data, what does data trustworthiness depend on?

<p>Data provenance and the average trustworthiness of the provider. (C)</p> Signup and view all the answers

What is the main issue with a non-probability sampling approach?

<p>It does not provide a known probability of selection for each unit. (B)</p> Signup and view all the answers

Which dimension of schema quality measures the clarity of representation?

<p>Readability (B)</p> Signup and view all the answers

What is a characteristic of n-ary inclusion dependencies?

<p>They involve multiple attributes from relations R and S. (B)</p> Signup and view all the answers

What does the trustworthiness of a data value indicate?

<p>The probability that the value is correct. (A)</p> Signup and view all the answers

What is the first step in the Analytic Hierarchy Process (AHP)?

<p>Development of a goal hierarchy (C)</p> Signup and view all the answers

Which of the following is NOT a root cause of data quality problems?

<p>Standardized data formats (D)</p> Signup and view all the answers

What does data profiling primarily aim to achieve?

<p>Determine the metadata of a dataset (B)</p> Signup and view all the answers

Which of the following describes a data-based approach to improving data quality?

<p>Identifying and correcting errors in data values (D)</p> Signup and view all the answers

In the context of data integration, data profiling supports which of the following?

<p>Understanding data content before integration (C)</p> Signup and view all the answers

What does the consistency check in the AHP process ensure?

<p>Logical coherence of paired comparisons (D)</p> Signup and view all the answers

Which activity is part of the data profiling steps?

<p>User selects data to be profiled (B)</p> Signup and view all the answers

What common issue arises from having multiple data sources?

<p>Different values for the same information (A)</p> Signup and view all the answers

What is the primary goal of data cleaning?

<p>To improve the quality by eliminating errors and inconsistencies (A)</p> Signup and view all the answers

Which of the following is NOT a step in the data cleaning process?

<p>Data visualization (A)</p> Signup and view all the answers

What does normalization involve in the context of data cleaning?

<p>Remapping data into a common format (D)</p> Signup and view all the answers

Which task requires understanding the meaning or semantics of the data?

<p>Semantic data transformations (B)</p> Signup and view all the answers

What type of tasks do syntactic data transformations NOT require?

<p>External knowledge or reference data (C)</p> Signup and view all the answers

In the context of transformation tools, what does 'proactive transformation' mean?

<p>The tool suggests potential transformations automatically (A)</p> Signup and view all the answers

Which of the following best describes 'discretization' in data cleaning?

<p>Simplifying numerical values into buckets or ranges (A)</p> Signup and view all the answers

Which interaction model requires the user to provide input-output examples?

<p>Transformation by example (C)</p> Signup and view all the answers

What is the primary goal of error correction/imputation?

<p>To ensure all edits are satisfied while changing the fewest fields possible. (C)</p> Signup and view all the answers

Which method involves replacing missing values using logical relations between variables?

<p>Regression imputation (A)</p> Signup and view all the answers

What is a characteristic of truncated data?

<p>Sample data is drawn from a subset of the population. (A)</p> Signup and view all the answers

Which statement best describes outlier detection techniques?

<p>Outlier techniques can lose effectiveness with high-dimensional datasets. (B)</p> Signup and view all the answers

What is a common method to detect missing values?

<p>Analyzing data distribution and comparing to expected values. (D)</p> Signup and view all the answers

Which of the following best defines an outlier?

<p>A value that is unusually larger or smaller compared to others. (D)</p> Signup and view all the answers

What does mean imputation do?

<p>Replaces missing values with the mean of the dataset. (A)</p> Signup and view all the answers

What should you do after identifying an outlier?

<p>Confirm if it represents a legitimate behavior or a data glitch. (D)</p> Signup and view all the answers

What is the key characteristic of a streaming data?

<p>It evolves over time and may decrease in value. (B)</p> Signup and view all the answers

What is the potential consequence of poor data quality in processes?

<p>Higher operational costs due to process failures. (B)</p> Signup and view all the answers

Which of the following describes a local check in data quality processes?

<p>Checks can be executed in parallel during both reading and updating. (A)</p> Signup and view all the answers

Which dimension is NOT considered a quality dimension for data streams?

<p>Ease of use (B)</p> Signup and view all the answers

Which operator is intended to increase completeness in data streams?

<p>Data generator (D)</p> Signup and view all the answers

What effect does the sampling operator have on data streams?

<p>Estimates accuracy depends on the size of the sample. (C)</p> Signup and view all the answers

What is the primary function of aggregation in data merging?

<p>To compress multiple incoming data points into one output value. (D)</p> Signup and view all the answers

What characterizes preliminary checks in data quality processes?

<p>They are executed only once before the actual process begins. (D)</p> Signup and view all the answers

Flashcards

Data Cleaning

The process of correcting errors, inconsistencies, and discrepancies in data to improve its quality.

Syntactic Data Transformation

Transforming data from one format to another without needing external knowledge.

Declarative Transformation

Specifying transformations directly using a language or tool.

Transformation by Example

Providing examples of input and output to automatically learn and apply transformations.

Signup and view all the flashcards

Proactive Transformation

Tools automatically suggest possible transformations based on data analysis.

Signup and view all the flashcards

Semantic Data Transformation

Transforming data by understanding its meaning and using external sources.

Signup and view all the flashcards

Data Type Conversion

Converting data from one type to another, e.g., text to numbers.

Signup and view all the flashcards

Data Normalization

Mapping diverse data formats into a consistent format.

Signup and view all the flashcards

Data Rule Consistency

The process of verifying a set of data rules to ensure they are consistent and non-redundant.

Signup and view all the flashcards

Error Localization

The process of using verified data rules to identify and correct errors in data.

Signup and view all the flashcards

Error Correction/Imputation

The task of changing data to satisfy verified data rules while minimizing changes to the data.

Signup and view all the flashcards

Outlier

A value that is unusually large or small compared to other values in the same data set.

Signup and view all the flashcards

Genuine Outlier

An outlier that may be a genuine rare event or a data error.

Signup and view all the flashcards

Data Glitch Outlier

An outlier caused by a mistake in data recording or entry.

Signup and view all the flashcards

Population Outlier

An outlier that comes from a different population than the rest of the data.

Signup and view all the flashcards

Distributional Outlier

An outlier that is detected by analyzing the distribution of data points.

Signup and view all the flashcards

Error e

The fraction of tuples in a relation that violate a Functional Dependency (FD). This value helps determine the severity of FD violations.

Signup and view all the flashcards

Threshold ϵ

A threshold value used to decide if an FD violation is significant. If the error rate 'e' is above the threshold ϵ, then the FD violation is considered problematic.

Signup and view all the flashcards

Inclusion Dependency (IND)

A type of data constraint that ensures that a specific set of attributes in a table can be uniquely identified by another set of attributes in the same or a different table. It involves at least two tables.

Signup and view all the flashcards

Accuracy

In a data model, the accuracy of a conceptual schema refers to how well it captures the real-world requirements. It involves using the correct model constructs and representing the information accurately.

Signup and view all the flashcards

Completeness

In data modeling, completeness measures how well a conceptual schema encompasses all the necessary elements to fulfill the defined system requirements. It ensures all required information and concepts are included.

Signup and view all the flashcards

Pertinence

In data modeling, pertinence measures how much unnecessary or irrelevant information is included in the conceptual schema. It focuses on eliminating unnecessary elements to create a lean and efficient representation.

Signup and view all the flashcards

Minimality

A schema is minimal if it represents every part of the requirements only once, avoiding redundancies and ensuring a streamlined data model.

Signup and view all the flashcards

Readability

A schema is readable when the meaning of the information is clear for its intended use. The schema should be easily understood by anyone who needs to work with it.

Signup and view all the flashcards

Data quality block placement

A method for suggesting where to place quality checks within a process.

Signup and view all the flashcards

Local checks (parallel)

Checks run in parallel with the data processing, typically used for reading data.

Signup and view all the flashcards

Local checks (sequential)

Checks run sequentially, one after the other, usually for updating data.

Signup and view all the flashcards

Preliminary check

A single quality check that happens before any other process step. Execution is delayed if low data quality is detected.

Signup and view all the flashcards

Parallel checks

Checks placed in parallel with all other process tasks. Errors are reported only at the end. No delay in processing.

Signup and view all the flashcards

Data stream

An infinite sequence of data elements that arrives continuously.

Signup and view all the flashcards

Data quality dimensions for streams

Metrics used to assess data quality in a stream. They include accuracy, confidence, completeness, timeliness, and data volume.

Signup and view all the flashcards

Data stream operators

Operators that modify data streams, including data generators, data reducers, and data mergers.

Signup and view all the flashcards

Analytic Hierarchy Process (AHP)

A method for decision-making that uses pairwise comparisons of criteria and alternatives to determine their relative importance. It relies on eigenvectors to calculate the weights of each element in a hierarchy.

Signup and view all the flashcards

What are the steps of AHP?

A structured approach to analyze and understand the quality of data. It typically involves four key steps: defining the goal hierarchy, making pairwise comparisons, evaluating consistency, and aggregating the results.

Signup and view all the flashcards

Data Profiling

A process of examining and understanding data characteristics to reveal underlying patterns and insights. It provides information about data structure, distribution, relationships, and potential issues.

Signup and view all the flashcards

What is Data Profiling used for?

A set of activities and processes used to determine the metadata associated with a dataset. It can include various descriptive information such as data types, value ranges, missing values, and relationships between different fields.

Signup and view all the flashcards

What is one common root cause of data quality problems?

Multiple data sources may provide contradictory information about the same subject, leading to inconsistency and confusion.

Signup and view all the flashcards

What is another common root cause of data quality problems?

Human judgment and subjectivity can introduce errors into data, especially in situations where data collection relies on personal interpretations or estimations.

Signup and view all the flashcards

What is another common root cause of data quality problems?

Constraints on computing resources, such as limited processing power or storage space, can hinder data quality by limiting the ability to perform comprehensive data validation and analysis.

Signup and view all the flashcards

Data Fusion

The process of combining data from multiple sources, resolving inconsistencies and determining the most accurate value for each data item.

Signup and view all the flashcards

Rule-based Data Fusion

A type of data fusion algorithm that uses rules, like majority voting, to decide the most likely value.

Signup and view all the flashcards

Web Link-based Data Fusion

A data fusion algorithm inspired by the way web page authority is measured, using links between sources to determine reliability.

Signup and view all the flashcards

Bayesian-based Data Fusion

A data fusion algorithm based on Bayesian inference, using probabilities to determine the likelihood of values given sources.

Signup and view all the flashcards

Single Truth Data Fusion

A situation where there is only one correct value for each data item, and different sources may contain opposing values.

Signup and view all the flashcards

Multi-Truth Data Fusion

A situation where the true value is composed of multiple values, with different sources contributing partial truths.

Signup and view all the flashcards

Data Fusion-STORM Algorithm

A data fusion algorithm that attempts to identify and remove noise or inaccuracies from the data before applying the fusion process.

Signup and view all the flashcards

Authored Sources

Sources that have been copied by many other sources, indicating their potential influence and reliability.

Signup and view all the flashcards

Study Notes

Data and Information Quality Recap

  • Data quality is the ability of a data collection to meet user requirements. From an information system perspective, there should be no contradiction between the real world view and the database view.
  • Causes of poor data quality include historical changes in data importance, data usage variations, corporate mergers, and external data enrichment.
  • Factors impacting data quality include data volatility, process, and technology.
  • Data governance is the practice of organizing and implementing policies, procedures, and standards to maximize data accessibility and interoperability for business objectives.
  • Data governance defines roles, responsibilities, and processes for data asset accountability. It's essential for master data management, business intelligence, and data quality control.
  • Key components of data governance include master data management, data quality, security, metadata, and integration.
  • Data quality dimensions include accuracy, completeness, consistency, timeliness, and others.
  • Accuracy is the nearness of a data value to its true representation. It has syntactic accuracy (closeness to definition domain elements) and semantic accuracy (closeness to real-world representation).
  • Completeness refers to how well a data collection represents the real-world objects it describes..
  • Consistency is maintained when semantic rules apply across data items..
  • Timeliness refers to data availability for a task. Data age is one component.
  • Schema quality dimensions include accuracy, completeness, and pertinence.
  • Functional dependencies and related concepts like partitioning, search, pruning, and making Tane approximate are discussed.
  • Trust and credibility of web data require examining trustworthiness based on provenance and data similarity.
  • Sampling for quality assurance is important when a complete census is not feasible.
  • Data quality interpretation helps assess the quality of results.
  • Data profiling, data cleaning, data transformation, normalization, missing values, outlier detection, and duplicate detection. This is a process in multiple tasks.
  • Machine learning (ML) can be used for data quality tasks such as data imputation, active learning, or deep learning.
  • Business processes (BPs) and data quality are discussed including modeling, data quality checks, and data quality costs.
  • Big data challenges, including volume, variety, velocity, and veracity, are crucial to data quality.
  • Data quality improvement limitations include source dependency (constrained resources and varied arrival rates) and inherent limitations (infinite data, evaluation needs, and transient data).
  • Data integration and other big data topics are addressed, such as schema alignment, probabilistic schema mappings, and record linkages.
  • Mapping techniques, such as MAPREDUCE, and provenance discussions are vital topics.
  • Truth discovery, dealing with conflict resolution and different computation methods, is addressed.
  • Data fusion, incorporating merging, cleansing, and reconciliation techniques, are detailed. Specific types of data fusion, like iterative Bayesian algorithms, are described.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

Description

This quiz covers the essential aspects of data and information quality, including definitions, causes of poor quality, and the role of data governance. Learn about the factors affecting data quality and the key components necessary for effective governance. Test your understanding of how data quality impacts business objectives.

More Like This

Data Governance Framework
5 questions

Data Governance Framework

DistinctiveRoseQuartz avatar
DistinctiveRoseQuartz
Data Governance and Quality Management
48 questions
Data Governance Quiz
13 questions

Data Governance Quiz

PerfectSequence3537 avatar
PerfectSequence3537
Use Quizgecko on...
Browser
Browser