Data Quality and Governance Overview
48 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is a limitation of the iterative Bayesian algorithm, Accu?

  • It accounts for varying truth probabilities.
  • It allows for multiple true values for each data item.
  • It assumes sources are independent. (correct)
  • It provides a dynamic estimation of P(V is true for i).
  • Which type of conflict arises from using outdated information in data fusion?

  • Copying behaviors
  • Out of date information (correct)
  • Inconsistent interpretations of semantics
  • Incorrect calculations
  • What characterizes multi-truth problems in data fusion?

  • Different values can contribute partial truths. (correct)
  • Only one true value exists for each data object.
  • Values are either entirely correct or entirely incorrect.
  • There are no conflicting sources.
  • Which category of data fusion algorithms is inspired by measuring web page authority?

    <p>Web link based</p> Signup and view all the answers

    What phase follows the value clustering in the data fusion-STORM process?

    <p>Authority-based Bayesian inference</p> Signup and view all the answers

    What does the Bayesian based category of data fusion algorithms primarily rely on?

    <p>Bayesian inference</p> Signup and view all the answers

    What does the term 'authored sources' refer to in the context of data fusion-STORM?

    <p>Sources that have been referenced by many others.</p> Signup and view all the answers

    What is a necessary component of data fusion in the context of big data?

    <p>Addressing conflicts that arise from data.</p> Signup and view all the answers

    What is the definition of an inclusion dependency (IND)?

    <p>R[A1, .., An] is a subset of S[B1, .., Bn], where R and S are two different relations.</p> Signup and view all the answers

    Which dimension does NOT pertain to schema quality?

    <p>Redundancy</p> Signup and view all the answers

    What characterizes a partial inclusion dependency?

    <p>Only a certain percentage of tuples satisfy the dependency.</p> Signup and view all the answers

    In terms of trust and credibility of web data, what does data trustworthiness depend on?

    <p>Data provenance and the average trustworthiness of the provider.</p> Signup and view all the answers

    What is the main issue with a non-probability sampling approach?

    <p>It does not provide a known probability of selection for each unit.</p> Signup and view all the answers

    Which dimension of schema quality measures the clarity of representation?

    <p>Readability</p> Signup and view all the answers

    What is a characteristic of n-ary inclusion dependencies?

    <p>They involve multiple attributes from relations R and S.</p> Signup and view all the answers

    What does the trustworthiness of a data value indicate?

    <p>The probability that the value is correct.</p> Signup and view all the answers

    What is the first step in the Analytic Hierarchy Process (AHP)?

    <p>Development of a goal hierarchy</p> Signup and view all the answers

    Which of the following is NOT a root cause of data quality problems?

    <p>Standardized data formats</p> Signup and view all the answers

    What does data profiling primarily aim to achieve?

    <p>Determine the metadata of a dataset</p> Signup and view all the answers

    Which of the following describes a data-based approach to improving data quality?

    <p>Identifying and correcting errors in data values</p> Signup and view all the answers

    In the context of data integration, data profiling supports which of the following?

    <p>Understanding data content before integration</p> Signup and view all the answers

    What does the consistency check in the AHP process ensure?

    <p>Logical coherence of paired comparisons</p> Signup and view all the answers

    Which activity is part of the data profiling steps?

    <p>User selects data to be profiled</p> Signup and view all the answers

    What common issue arises from having multiple data sources?

    <p>Different values for the same information</p> Signup and view all the answers

    What is the primary goal of data cleaning?

    <p>To improve the quality by eliminating errors and inconsistencies</p> Signup and view all the answers

    Which of the following is NOT a step in the data cleaning process?

    <p>Data visualization</p> Signup and view all the answers

    What does normalization involve in the context of data cleaning?

    <p>Remapping data into a common format</p> Signup and view all the answers

    Which task requires understanding the meaning or semantics of the data?

    <p>Semantic data transformations</p> Signup and view all the answers

    What type of tasks do syntactic data transformations NOT require?

    <p>External knowledge or reference data</p> Signup and view all the answers

    In the context of transformation tools, what does 'proactive transformation' mean?

    <p>The tool suggests potential transformations automatically</p> Signup and view all the answers

    Which of the following best describes 'discretization' in data cleaning?

    <p>Simplifying numerical values into buckets or ranges</p> Signup and view all the answers

    Which interaction model requires the user to provide input-output examples?

    <p>Transformation by example</p> Signup and view all the answers

    What is the primary goal of error correction/imputation?

    <p>To ensure all edits are satisfied while changing the fewest fields possible.</p> Signup and view all the answers

    Which method involves replacing missing values using logical relations between variables?

    <p>Regression imputation</p> Signup and view all the answers

    What is a characteristic of truncated data?

    <p>Sample data is drawn from a subset of the population.</p> Signup and view all the answers

    Which statement best describes outlier detection techniques?

    <p>Outlier techniques can lose effectiveness with high-dimensional datasets.</p> Signup and view all the answers

    What is a common method to detect missing values?

    <p>Analyzing data distribution and comparing to expected values.</p> Signup and view all the answers

    Which of the following best defines an outlier?

    <p>A value that is unusually larger or smaller compared to others.</p> Signup and view all the answers

    What does mean imputation do?

    <p>Replaces missing values with the mean of the dataset.</p> Signup and view all the answers

    What should you do after identifying an outlier?

    <p>Confirm if it represents a legitimate behavior or a data glitch.</p> Signup and view all the answers

    What is the key characteristic of a streaming data?

    <p>It evolves over time and may decrease in value.</p> Signup and view all the answers

    What is the potential consequence of poor data quality in processes?

    <p>Higher operational costs due to process failures.</p> Signup and view all the answers

    Which of the following describes a local check in data quality processes?

    <p>Checks can be executed in parallel during both reading and updating.</p> Signup and view all the answers

    Which dimension is NOT considered a quality dimension for data streams?

    <p>Ease of use</p> Signup and view all the answers

    Which operator is intended to increase completeness in data streams?

    <p>Data generator</p> Signup and view all the answers

    What effect does the sampling operator have on data streams?

    <p>Estimates accuracy depends on the size of the sample.</p> Signup and view all the answers

    What is the primary function of aggregation in data merging?

    <p>To compress multiple incoming data points into one output value.</p> Signup and view all the answers

    What characterizes preliminary checks in data quality processes?

    <p>They are executed only once before the actual process begins.</p> Signup and view all the answers

    Study Notes

    Data and Information Quality Recap

    • Data quality is the ability of a data collection to meet user requirements. From an information system perspective, there should be no contradiction between the real world view and the database view.
    • Causes of poor data quality include historical changes in data importance, data usage variations, corporate mergers, and external data enrichment.
    • Factors impacting data quality include data volatility, process, and technology.
    • Data governance is the practice of organizing and implementing policies, procedures, and standards to maximize data accessibility and interoperability for business objectives.
    • Data governance defines roles, responsibilities, and processes for data asset accountability. It's essential for master data management, business intelligence, and data quality control.
    • Key components of data governance include master data management, data quality, security, metadata, and integration.
    • Data quality dimensions include accuracy, completeness, consistency, timeliness, and others.
    • Accuracy is the nearness of a data value to its true representation. It has syntactic accuracy (closeness to definition domain elements) and semantic accuracy (closeness to real-world representation).
    • Completeness refers to how well a data collection represents the real-world objects it describes..
    • Consistency is maintained when semantic rules apply across data items..
    • Timeliness refers to data availability for a task. Data age is one component.
    • Schema quality dimensions include accuracy, completeness, and pertinence.
    • Functional dependencies and related concepts like partitioning, search, pruning, and making Tane approximate are discussed.
    • Trust and credibility of web data require examining trustworthiness based on provenance and data similarity.
    • Sampling for quality assurance is important when a complete census is not feasible.
    • Data quality interpretation helps assess the quality of results.
    • Data profiling, data cleaning, data transformation, normalization, missing values, outlier detection, and duplicate detection. This is a process in multiple tasks.
    • Machine learning (ML) can be used for data quality tasks such as data imputation, active learning, or deep learning.
    • Business processes (BPs) and data quality are discussed including modeling, data quality checks, and data quality costs.
    • Big data challenges, including volume, variety, velocity, and veracity, are crucial to data quality.
    • Data quality improvement limitations include source dependency (constrained resources and varied arrival rates) and inherent limitations (infinite data, evaluation needs, and transient data).
    • Data integration and other big data topics are addressed, such as schema alignment, probabilistic schema mappings, and record linkages.
    • Mapping techniques, such as MAPREDUCE, and provenance discussions are vital topics.
    • Truth discovery, dealing with conflict resolution and different computation methods, is addressed.
    • Data fusion, incorporating merging, cleansing, and reconciliation techniques, are detailed. Specific types of data fusion, like iterative Bayesian algorithms, are described.

    Studying That Suits You

    Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

    Quiz Team

    Related Documents

    Description

    This quiz covers the essential aspects of data and information quality, including definitions, causes of poor quality, and the role of data governance. Learn about the factors affecting data quality and the key components necessary for effective governance. Test your understanding of how data quality impacts business objectives.

    More Like This

    Data Governance Framework
    5 questions

    Data Governance Framework

    DistinctiveRoseQuartz avatar
    DistinctiveRoseQuartz
    Data Governance and Quality Management
    48 questions
    Use Quizgecko on...
    Browser
    Browser