Data And Information Quality Recap PDF
Document Details
Uploaded by AltruisticFern
Politecnico di Milano 1863
2025
Francesco Spangaro
Tags
Summary
This document is a recap of data and information quality, covering topics like data quality definitions, causes and factors affecting data quality, data governance, and functional dependencies. It seems to be from a university or college course.
Full Transcript
Data and Information Quality recap Francesco Spangaro 05 January 2025 Academic Year 2024 - 2025 Contents 1 Data Quality...
Data and Information Quality recap Francesco Spangaro 05 January 2025 Academic Year 2024 - 2025 Contents 1 Data Quality 4 1.1 Data quality definitions........................................... 4 1.2 Causes of poor data quality......................................... 4 1.3 Factors that impact data quality...................................... 4 2 Data Governance 4 2.1 Data governance definitions......................................... 4 2.2 Data governance main components..................................... 4 2.3 Roles and responsibilities.......................................... 5 2.4 Goals of data governance.......................................... 5 2.5 Data quality management.......................................... 5 2.6 Data quality dimensions........................................... 6 2.7 Some concepts on data quality dimensions................................. 6 2.8 Most used objective dimensions....................................... 6 2.8.1 Accuracy............................................... 6 2.8.1.1 Syntactic accuracy..................................... 6 2.8.1.2 Semantic accuracy..................................... 6 2.8.2 Completeness............................................. 7 2.8.3 Consistency.............................................. 7 2.8.3.1 Integrity constraints.................................... 7 2.8.4 Timeliness............................................... 7 2.8.5 Other dimensions........................................... 7 2.9 Schema quality dimensions......................................... 8 3 Functional dependencies 8 3.1 Tane method................................................. 8 3.1.1 Partitioning.............................................. 8 3.1.2 Search................................................. 9 3.1.3 Pruning................................................ 9 3.2 Making Tane approximate.......................................... 9 4 Inclusion dependency (IND) 9 5 Schema quality dimensions 9 6 Trust and credibility over web data 10 7 Sampling for quality assurance 10 1 8 Data quality interpretation 10 9 IQ Score 10 10 Scaling methods 11 11 User weighting 11 12 Root causes of DQ problems 11 13 DQ improvement strategies 11 13.1 Data profiling................................................. 11 13.1.1 Use cases............................................... 12 13.1.2 Data Profiling tasks......................................... 12 13.1.2.1 Single column analysis................................... 12 13.1.2.2 Dependency discovery................................... 12 13.1.2.3 Relaxed dependencies................................... 12 13.1.3 Next generation profiling...................................... 13 13.2 Data cleaning................................................. 13 13.2.1 Data cleaning - Tasks........................................ 13 13.2.1.1 Data transformation and normalization......................... 13 13.2.1.2 Localize and correct inconsistencies........................... 13 13.2.1.3 Missing values....................................... 14 13.2.1.4 Outlier detection...................................... 14 13.2.1.5 Duplicate detection.................................... 15 13.2.1.6 String-based distance functions.............................. 15 13.2.1.7 Data Deduplication.................................... 16 13.2.2 Conflict handling strategies..................................... 16 13.2.2.1 Data fusion answers.................................... 16 13.2.2.2 Rule-based data cleaning................................. 17 14 Data Quality for Machine Learning 17 14.1 Phases and tasks............................................... 17 14.2 Data preparations.............................................. 17 14.2.1 Data transformation tasks...................................... 17 14.2.2 Basic cleaning operations...................................... 17 15 Machine learning and Data quality 18 16 Machine learning for Data quality 18 16.1 Data imputation with KNN......................................... 18 16.2 Active learning in dada deduplication................................... 18 16.3 Deep learning in data deduplication.................................... 18 17 Process based data improvement 18 17.1 BP and Data................................................. 19 17.2 Modeling data quality in BP........................................ 19 17.3 Data quality blocks.............................................. 19 17.4 DQ costs................................................... 19 17.5 Data stream.................................................. 19 17.6 Sensors’ Data................................................. 19 17.6.1 Data quality aware strategy..................................... 20 17.6.1.1 Window size definition................................... 20 17.6.1.2 Interestingness....................................... 20 17.6.1.3 Technological issues with nodes in Wireless Sensors Networks (WSN)........ 20 18 Big data 21 18.1 Four V of big data.............................................. 21 19 Data Quality and big data 21 20 Data quality improvement limitations 21 21 Big Data integration 22 22 MAPREDUCE 22 23 Provenance 22 23.1 Provenance and Data Quality........................................ 23 24 Truth discovery 23 24.1 Truth computation methods......................................... 23 2 25 Data fusion 24 25.1 Iterative Bayesian algorithm: Accu..................................... 24 25.2 Data fusion and truth............................................ 24 25.3 Data fusion-STORM............................................. 24 25.4 Data fusion-word embeddings........................................ 25 3 DIQ Recap 1 Data Quality 1.1 Data quality definitions Traditional: the ability of a data collection to meet user requirements. Information System POV: No contradictions between the real world user view and the view derived from the database. 1.2 Causes of poor data quality Historical changes: the importance of data may change over time. Data usage: data relevance depends on the process in which the data is used. Corporate mergers: difficulties introduced by data integration. Data enrichment: might be dangerous to enrich internal data with data taken from external sources. 1.3 Factors that impact data quality Data – Volatility – Measures - Validation – Modeling - Distribution Processes – Data management process – Process design Technology – Data pipeline – Redundancy People – Leadership – Training – Motivation 2 Data Governance In every data-driven company, there is an organization that constantly defines policies and data processing rules and checks the quality of data. 2.1 Data governance definitions Data governance is the practice of organizing and implementing policies, procedures, and standards that maximize data access and interoperability for the business mission. Data governance defines roles, responsibilities, and processes to ensure accountability and ownership of data assets throughout the enterprise. Data governance is absolutely a mandatory requirement for success if an organization wants to achieve master data management, build business intelligence, improve data quality, or manage documents. 2.2 Data governance main components Master data management: data that provides the context for transactions data, the goal is to create a copy for crucial data subjects Data quality Security: authentication, secure transmission, data access and use processes, security management procedures Metadata Integration Francesco Spangaro (110734844) 4 DIQ Recap 2.3 Roles and responsibilities Steering Committee – Responsible for the overall governance strategy – Championing the work of data stewards – Holding the governance organization accountable to timeliness and outcomes Data Owners – Generally members of the Steering Committee – Responsible for ensuring that information within a specific data domain is governed across systems and lines of business – Ensuring the accuracy of information across the enterprise – Direct data quality activities – Working with other data owners to resolve data issues – Second-level review for issues identified by Data Stewards Data Stewards – Responsible for identifying data issues and working with other Data Stewards to resolve them – Acting as a member of the Data Steward council – Proposing, discussing and voting on data policies and committee activities – Reporting to the data owner and other stakeholders within a data domain – Working cross-functionality across lines of business to ensure their domain’s data is managed and un- derstood 2.4 Goals of data governance Minimize risk Establish internal rules for data usage Implement compliance requirements Improve internal and external communication Increase the value of data Reduce costs In practice: Data Governance builds the data structures that will give the business the information it needs when it needs it Data Governance acts as a central planning center to coordinate data design across organizations: it builds the enterprise-wide blueprint that guides all IT development towards data interoperability 2.5 Data quality management Performed in 4 cyclic phases: Define suitable data quality dimensions Measure data quality level along data quality dimensions Analyze data quality problems and root causes Improve the design with design improvement actions Francesco Spangaro (110734844) 5 DIQ Recap 2.6 Data quality dimensions The real world is composed of objects that are described by attributes and monitored through their states. Data stored in the information system represent the state of information in a certain time instant. Correct representa- (a) (b) (c) (d) Figure 1: (a) Correct Representation (b) Incompleteness (c) Insignificance (d) Ambiguity tion: A real-world system is properly represented if there exists a correspondence between the real world and the information system. 2.7 Some concepts on data quality dimensions Each dimension captures a specific aspect included under the general umbrella of data quality Quality dimensions can refer to – Extension of data – Data values – Intension – Their schema –... so on, more 179 dimensions 2.8 Most used objective dimensions Accuracy: extent to which data are correct, reliable and certified Completeness: degree to which a data collection describes the corresponding set of real-world object Consistency: satisfaction of semantic rules defined over a set of data items Extent to which data are sufficiently up to date for a task 2.8.1 Accuracy Defined as the closeness between a data value v and a data value v’, considered the correct representation of the real-life phenomenon that value v aims to represent. We have two types of accuracy: 2.8.1.1 Syntactic accuracy Defined as the closeness of a value v to the elements of the corresponding definition domain D. Is measured in two ways: Exact matching: 0 if v ∈ / D, 1 if c ∈ D Similarity-based: accuracy value is included in the interval [0,1], evaluated by comparison functions, like the edit distance. 2.8.1.2 Semantic accuracy Defined as the closeness between a data value v and a data value v’. Better to be measured with an exact matching. For example, we might find the name of a student in the column ”Teachers”. The name might be written correctly, but it is not accurate. Additional knowledge is often needed; we should look for the same data in different data sources Requires the solution of the object identification problem, the problem of understanding whether two tuples refer to the same real-world entity. The main issues that we can face while solving this problem are as follows: – Identification: Tuples may not have unique identifiers; we may solve this by performing an appropriate key matching. – Decision strategy: Once a match is found, we have to decide whether the matching tuples are the same or not. Francesco Spangaro (110734844) 6 DIQ Recap Accuracy may be aggregated, E.G. Relation accuracy = ratio between accurate values and total number of values Accusacy is related to Duplication 2.8.2 Completeness The completeness of a table characterized the extent to which the table represents the corresponding real world Characterized with regard to – Presence/absence and meaning of NULL values – Validity off one of the two assumptions: Open world(OWA) or Closed world(CWA) CWA: states only the values actually present in the table OWA: we can neither state the truth nor the falsity of facts not represented in the table Interesting cases are when we have no null values in the OWA and when we have null values in the CWA 2.8.3 Consistency Consistency captures the violation of semantic rules defined over a set of data items Semantic rule’s types: – Data edits – Business rules – Integrity constraints 2.8.3.1 Integrity constraints Intrarelation constraints: can regard a single or multiple attributes of a relation, E.G. – Age is included between 0 and 120 – If workingYear is 25k/year Interrelation constraints: involve attributes of more than 1 relation, E.G. – Year of movies relation must be equal to year of Oscar Awards Three main dependency types among Integrity Constraints – Key dependency: holds in a relational instance r for a subset k of attributes, if no two rows of r have the same k-value – Inclusion dependency: one of them is the foreign key constraint, referring columns in one relation must be contained in the primary key column of the referenced relation – Functional dependency: X-> Y where x and y are two non-empty sets of attributes in r; r satisfies the functional dependency if ∀ tuple pair holds: if t1.x=t2.x then t1.y=t2.y Dependency discovery issues: – Research time: it could be exponential -> ways to decrease research time – Dependencies should be valid: it is necessary to test the quality of the results 2.8.4 Timeliness Has two components: Currency or Age: a measure of how old the information is, based on how long ago it was recorded. Currency = Age + (DeliveryTime - InputTime) – DeliveryTime: time the information product is delivered to the customer – InputTime: time the data unit is obtained Volatility: a measure of information over the instability or the frequency of change of the value of an entity attribute 2.8.5 Other dimensions Accessibility: measures the ability of the user to access the data from his own culture, physical status and available technologies Redundancy: minimality, compactness and conciseness refer to the capability of representing the aspects of the reality of interest with the minimal use of informative resources Readability: ease of understanding and fruition of information by users Usefulness: related to the advantage the user gains from the use of information Francesco Spangaro (110734844) 7 DIQ Recap 2.9 Schema quality dimensions Accuracy: – Correctness with respect to the model concerns the correct use of the constructs of the model in repre- senting requirements – Correctness with respect to requirements concerns the correct representation of the requirements in terms of the model constructs Completeness: the extent to which a conceptual schema includes all the conceptual elements necessary to meet some specified requirements Pertinence how many unnecessary conceptual elements are included in the conceptual schema Minimality: a schema is minimal if every part of the requirements is represented only once in the schema. Obtained through avoidance off redundancies and normalization Readability: a schema is readable whenever it represents the meaning of the reality represented by the schema in a clear way for its intended use Trustworthiness: since data might come from different sources it is often not possible to define a unique Data Quality model. Trustworthiness represents the confidence factor we give to data 3 Functional dependencies A functional dependency X->A asserts that all pairs of records with same values in attribute combination X must have the same values in attribute A Types: given a rule X->Y – Trivial: attributes on Y are a subset of attributes on X – Non-trivial: at least one attribute on Y doesn’t appear on X – Completely non-trivial: attributes on X and Y are distinct – Minimal functional dependency: Y does not depend on any subset of X Our goal is, given a relation R, to find all minimal completely non-trivial functional dependencies. Methods: – Column-based: Tane or FD-mine – Row-based – Other 3.1 Tane method Two key ideas: Reduce tuple sets through partitioning, this can be done according to attribute values or level-wise, increase of size of attribute set Reduce column combination through pruning -¿ reasoning over FDs 3.1.1 Partitioning TabID A B C 1 1 a d 2 1 A f 3 2 A d 4 2 A d 5 2 b f Partitions of attributes: πA ={{1, 2},{3, 4, 5}} πB ={{1},{2, 3, 4},{5}} πC ={{1, 3, 4},{2, 5}} A functional dependency X ->A holds iff πX refines πA Francesco Spangaro (110734844) 8 DIQ Recap 3.1.2 Search Bottom-up search, start to test X->Y at the first level, XX-> at second level, and so on. 3.1.3 Pruning In two cases: Every time we find a key Every time a superkey is detected 3.2 Making Tane approximate Definition is based on a minimum number of tuples to be removed from R for X->A to hold in R. Problem: given a relation R and threshold ϵ, find all minimal non-trivial FDs X->A such that e(X->A)≤ ϵ. Steps: – Define error e: fraction of tuples causing FD violation – Specify threshold ϵ 4 Inclusion dependency (IND) Typically involves more than one relation Definition: Let D be a relation schema and let I be an instance of D. R[A1 ,.., An ] denotes projection of I on attributes A1 ,.., An of relation R. IND R[A1 ,.., An ] ⊆ S[B1 ,.., Bn ], where R and S are (possibly identical) relations of D. Projection on R and S must have the same number of attributes. Values of R are called ”Dependent values”, while values of S are called ”Referenced values” Types – Unary: INDs on single attributes: R[Ax ] ⊆ S[Bx ] – N-ary: INDs on multiple attributes: R[A1 ,.., An ] ⊆ S[B1 ,.., Bn ] – Partial: IND R[A] ⊆ S[B] is satisfied for X% of all of R’s tuples – Approximate: IND R[A] ⊆ S[B] is satisfied with probability P Example: 5 Schema quality dimensions Accuracy: correctness with regard to the model. Concerns the correct use of the constructs of the model in representing requirements Completeness: measures the extent to which a conceptual schema. Includes all the conceptual elements necessary to meet some specified requirements Francesco Spangaro (110734844) 9 DIQ Recap Pertinence: measures how many unnecessary conceptual elements are included in the conceptual schema Minimality: a schema is minimal if every part of the requirements is represented only once in the schema Readability: a schema is readable whenever it represents the meaning of the reality represented by the schema in a clear way for its intended use. 6 Trust and credibility over web data Data trustworthiness based on data provenance. Trust model based on data similarity, data conflict, path similarity and data deduction Trustworthiness of a data item is the probability its value is correct Trustworthiness of a source provider is the average trustworthiness of data provided 7 Sampling for quality assurance Not always feasible to perform a census of the entire database There are two ways of sampling: – Non-probability approach: it is not possible to know the probability with each unit is drawn from the population. There is no way to evaluate the reliability of the results – Probability sampling: each unit is drawn from the population with a known probability Important steps: – set the objective of the sampling – define unit and population – define the degree of precision (# errors) and reliability level required Methods – Simple random: choose a random sample of the required size – Systematic: first row randomly chosen, then every k-th row from then on is included – Stratified random: create subgroups with uniform quality distributions in them, then take random samples from each – Cluster: population is divided into clusters based on specific criteria, then a random sample from each cluster is inspected 8 Data quality interpretation Information Quality (IQ) scores are useful for Result annotation: – Assessment of an aggregate quality index – Definition of thresholds – Quality representations Source selection: – find a qualitative ordering of the sources to decide which ones to query – choose a combination off one or more sources if needed to answer a query 9 IQ Score Given a data source with a scaled and weighted quality vector IQ(Si), the IQ score is the value r(IQ(Si)) where r returns a value [0, 1] Francesco Spangaro (110734844) 10 DIQ Recap 10 Scaling methods Bring all the scores into non-dimensional scores within [0, 1] in order to make them comparable Normalization Topsis: based on Euclidean distance to a virtual ideal and a virtual negative ideal data source 11 User weighting Criteria are not equally important! Requirements might change over time on the basis of the context Definition of a weighting vector (w − 1,..., wn ) Some methods: Pn – Direct assignment: user specifies weights with the rule j=1 wj = 1 – Pair comparison: user specifies for each pair of criteria how much a criterion is more important than the other one. n(n−1) 2 comparisons. – Analytic Hierarchy Process (AHP) method: Based on eigenvector and composed of four steps: ∗ Development of a goal hierarchy ∗ Comparison of goals by pairs ∗ Consistency of the comparisons ∗ Aggregation of the comparisons 12 Root causes of DQ problems Multiple data sources: different values for the same information Subjective judgment in data production Limited computing resources Security/Accessibility trade-off: easy access may conflict with security requirements Coded data across disciplines Volume of data: difficulties to access needed information in a reasonable time Changing data needs Distributed information systems without proper integration mechanisms 13 DQ improvement strategies Data-based approaches: focus on data values and aim to identify and correct errors without considering the process and the context in which they will be used Process-based actions: activated when an error occurs; they aim to discover and eliminate the error’s root cause 13.1 Data profiling Is the set of activities and processes designed to determine the metadata of a given dataset Helps understanding and preparing data for subsequent cleaning, integration and analysis Steps: – User selects data to be profiled and metadata to be generated – Tool computes metadata using SQL and/or specialized algorithms – Results usually displayed in tabs, tables, charts or other graphical visualizations – Discovered metadata are applied to concrete use cases Francesco Spangaro (110734844) 11 DIQ Recap 13.1.1 Use cases Data exploration: when datasets arrive at an organization and accumulate in data lakes, experts need an understanding of their content. Manual data exploration can and should be supported with data profiling techniques Data integration: datasets that need to be integrated are unfamiliar and the integration expert wants to explore the datasets first. Apart from exploring individual sources, data profiling can also reveal how and how well two datasets can be integrated Data quality/Data cleansing: profiling results can be used to reveal data errors Big data analytics: Big data is data that cannot be managed with traditional techniques, underscoring the importance of data profiling. 13.1.2 Data Profiling tasks 13.1.2.1 Single column analysis Cardinalities: numbers that summarize simple metadata, number of rows, attributes, null values(completeness), distinct values (uniqueness) Value distributions – Summarize distribution of values within a column – Common representation -> histogram – Extremes of a numeric column support identification of outliers – Constancy of a column is defined as its most frequent value divided by the total number of values Data types: necessity to label a column with its type Patterns: identify any frequently occurring pattern of values Semantic domain: more difficult than finding the datatype or pattern Approximate statistics – Uniform sample: sampling can be used to build approximate distributions and histograms. Not all statistics will be reliable – Sketch: small summary, they typically use a combination of counters and hashing to transform data values to a smaller domain 13.1.2.2 Dependency discovery Constraints such as keys, foreign keys, and functional dependencies are often constraints of a schema and are known at design-time. However, many datasets do not come with key dependencies explicitly, which motivates the dependency discovery. Unique column combination (UCC): a set of attributes that contains no duplicates in a relational instance. Every UCC indicates a syntactically valid key. Functional dependencies: written as X->A, assert that all pairs of records in attribute combination X must also have the same values in attribute A. Inclusion dependency: an inclusion dependency Ri [X] ⊆ Rj [Y ] over the relational schemata Ri and Rj states that all values in X occur in Y 13.1.2.3 Relaxed dependencies Relaxing the extent of a dependency Partial Dependency – it may hold on some subset of tuples but be violated by other – a partial dependency holds if the error is lower than a threshold – useful if data are expected to contain errors Conditional Dependency (CFD) – is a partial dependency that explicitly specifies conditions that restricts its scope – conditions are typically sets of pattern tuples that summarize satisfying tuples (called ”tableau”) – to assess how often a CFD holds, confidence has been defined as the minimum number of tuples that must be removed to make the CFD hold Relaxing attribute comparisons – two values may satisfy a dependency if they are similar but not necessarily equal – Metric dependencies: relax comparison method to tolerate formatting differences – Neighborhood dependencies: closeness function is defined for the two attributes in the rule – Differential dependencies: uses a differential function for each attribute in the form of a constraint – Matching dependencies Order dependency: generalized to sequential dependencies. Francesco Spangaro (110734844) 12 DIQ Recap 13.1.3 Next generation profiling With multiple sources we will have three different overlapping problems: Schematic, Data and Topical Schematic overlap: – Schema matching: automatically determine cross-schema value correnspondences between attributes – Cross-schema dependencies Data overlap – Detect multiple (different) representations of the same real-world entity – Profiling can help intra-source: cleanliness and inter-source: data fit Topic overlap – What is a dataset about? – Topical clustering for source selection 13.2 Data cleaning It is a data-based approach Definition: Data cleaning is the process off identifying and eliminating inconsistencies, discrepancies and errors in data, in order to improve quality Steps of data cleaning: Standardization/Normalization -> Error correction -> Duplicate detection 13.2.1 Data cleaning - Tasks 13.2.1.1 Data transformation and normalization Data type conversion: varchar ->int Normalization: mapping data into a common format (01-Mar-25 -> 01/03/25) Discretization of numerical values Domain-specific transformations (St. -> Street) We have two types of tasks: Syntactic data transformations: aim to transform a table from one syntactic format to another. No require- ment of external knowledge or reference data. We have three major components/dimensions: – Language: ∗ Transformation language limits the space of possible transformations ∗ Defines the set off operations allowed ∗ a language needs to be expressive enough – Authoring: ∗ tools must allow an easy and effective authoring of transformation programs ∗ three different interaction models: · Declarative transformation: user specify transformation directly · Transformation by example: user give a few input-output examples. The tools will then use them to automatically transform data into the format they find is the more plausible · Proactive transformation: automatically suggest potential transformations without requiring user input. Some tools suggest the appropriate transformation by considering a suitability score – Execution: tools usually offer assistance to users in selecting the correct transformation Semantic data transformations: involve understanding the meaning/semantics or the typical use of data. Usually they require external sources. 13.2.1.2 Localize and correct inconsistencies Data must verify a set of properties (edits) When such rules are collected, it is crucial that they are proven to be consistent and non redundant Once we have a valid set of edits we can perform error localization The activity of using edits to correct is called error correction/imputation, its main goals are – All the edits should be satisfied by changing the fewest fields possible, minimum change principle – When the imputation is necessary it is desirable to maintain marginal and joint frequency distribution of values Francesco Spangaro (110734844) 13 DIQ Recap 13.2.1.3 Missing values Missing information on different levels – Instance level: values, tuples, relation fragments,... – Schema level: attributes Detecting missing values: – Basic analysis: # of null values, # of duplicates, mean, frequency. Another basic analysis would be to compare values with the expected ones – Detection by analyzing data distribution Truncated and censored data – Truncation: sample data are drawn from a subset – Censoring: values in a certain range are all transformed into a single value – Both can be detected with histograms and frequency distributions Handling of missing values – Drop – Inputing Deterministing imputation: imputes a missing value by using logical relations between variables Mean imputation Regression imputation: replace missing values with values from regression Hot deck imputation: take a value from a random similar record 13.2.1.4 Outlier detection Definition: A value unusually larger or smaller in relation to other values in a set Assumptions and limitations of each method are essentials for choosing the right tool for a given application domain Outlier techniques lose their effectiveness when the number of dimensions of a dataset is large Possible generations: – incorrectly observed, recorded or entered in a dataset – comes from different population – correct, but represents a rare event Detection methods: – Control charts – Distributional outliers – Time series outliers Analysis: after an outlier is identified, we have to decide whether it represents an abnormal but legitimate behavior or a data glitch Statistic-based detection methods: – Assumption: normal data points will appear in high probability regions, outliers will instead appear in low probability ones – Hyp. testing methods: they calculate a test statistic based on observed data points – Fitting distribution methods: they aim to fit a probability density function based on the observed data (low probability -> outliers) – Pros: ∗ Statistical techniques provide a score, rather than making a binary 0 or 1 decision ∗ statistical techniques don’t need training data to be labeled – Cons: ∗ Assumption of an underlying distribution doesn’t hold hot a high dimensionality dataset ∗ Even when the assumption holds, it is not a straightforward task Distance-based detection methods: – Assumption: a normal data point should be close to the others – Global distance: they consider distance between the selected data point and all the others – Local distance: distance is evaluated between a data point and its neighboring points – Pros: ∗ There is no need to make any assumptions regarding the data distribution Francesco Spangaro (110734844) 14 DIQ Recap ∗ The technique is adaptable to different data types – Cons: ∗ Fails if the outliers have enough close data points ∗ High computational complexity ∗ The performance depends on the distance measure, which is hard to find Model-based detection methods: – Assumptions: ∗ A classifier can be trained to distinguish normal and anomalous data points. ∗ Learn a model from a set of labeled data ∗ Can be multi-class or single-class – Pros: ∗ Powerful algorithms that can distinguish different classes ∗ They have a fast testing phase – Cons: ∗ They need to rely on availability of accurate labels 13.2.1.5 Duplicate detection Definition: Duplicate detection is the discovery of multiple representations of the same real-world object Duplicate detection techniques can also be called – Record linkage – Object identification – Entity resolution – Merge The process: – Preprocessing: ∗ Standardization ∗ Replacement off alternative settings ∗ Conversion of upper/lower cases ∗ Schema reconciliation – Search space reduction: ∗ Blocking: file is partitioned in blocks and limiting the comparisons to records in the same block ∗ Sorted neighborhood: sorting a file and then moving a window of a fixed size on the file, comparing only records in the window ∗ Pruning: objective of just removing from the search space all records that cannot match each other, without comparing them – Distance-based comparison functions: ∗ String-based: edit distance, sounded code, jaro ∗ Item-based: Jaccard dstance, TF – Comparison and decision: ∗ Probabilistic techniques ∗ Empirical techniques ∗ Knowledge-based techniques 13.2.1.6 String-based distance functions Edit distance/Levenshtein distance: – Minimum number of edits needed to get from one world to another – We have 3 edit operations: Substitution, Deletion and Insertion editd ist(m,n) – Similarity score: 1 − max(|m|,|n|) Jaro/Jaro-Winkler: – c = number of common characters between two strings(limited to ⌈half ⌉ of the length of the longest string) – t = number of transpositions of common characters out of order S = 31 ( m c + nc + c−t c ) American soundex: – Defined for problems with names that can be spelt in different ways – Phonetically oriented algorithm Francesco Spangaro (110734844) 15 DIQ Recap 13.2.1.7 Data Deduplication Probabilistic techniques: Fellegi and Sunter theory: – Supervised technique that needs training dataset for labeling records as matching or not – We have 2 or 3 outcome sets: M is the matching set, U is the mismatching set and P is the set of possible matches – The model is based on the m(matching)-probability and the u(mismatching)-probability Empirical techniques: – Sorted neighborhood (SNM): ∗ Create a key ∀ records by extracting relevant fields ∗ Sort records using this key ∗ Merge data by using a fixed size window: this limits the number of comparisons needed ∗ Problems: · We have to define the optimal size in order to maximize the accuracy and minimize the cost · We need to select the correct key – Multi-pass approach: ∗ Several runs of the SNM method with different keys and very small windows ∗ Each run produces a set of possible matches ∗ The result is the union of all pairs found in the independent runs Knowledge-based techniques: – Choice maker: ∗ Based on rules called clues ∗ Clues can be both dependent or independent to the domain ∗ Clues can be used offline to exploit the importance of the various clues in order to produce as many examples as possible ∗ Clues can be used at runtime to compute matching probability – Intelliclean: ∗ exploit rules as an evolution of the previously proposed distance functions ∗ Duplicate identification rules: specify conditions according to which two tuples can be classified as duplicates ∗ Merge-purge rules: specify how duplicate records are to be handled ∗ A certainty factor (CF) is applied to each duplicate identification rule. The aggregation of all these factors will represent how well we have merged. It is also possible to define a threshold to reject poor-quality merges. 13.2.2 Conflict handling strategies Conflict ignorance: – Ignore conflict between multiple records that refer to the same entity – We have two options: ∗ Escalate ∗ Consider all possible resolutions: Build all possible solutions while annotating them with their respective likelihood or use a probabilistic database Conflict avoidance: simple rule to take a unique decision based on data instance or metadata Conflict resolution: solve the conflicts by picking the value from already present values (deciding) or by choos- ing a value that doesn’t necessarily exist among present values (meditating). We have advanced techniques for conflict resolution: – Source accuracy: probability that a value is a true value – Source dependency: sources are not independent (a source may be a copy of another) – Source freshness: how quick a value change is captured by a data source Human intervention is sometimes necessary. 13.2.2.1 Data fusion answers Result of a query to an integrated system Complete Concise Consistent Complete and Consistent additionally fulfill a key constraint in some real-world ID Francesco Spangaro (110734844) 16 DIQ Recap 13.2.2.2 Rule-based data cleaning A set of data quality rules ϵ has been specified for a schema R by an expert or by an automatic discovery rules tool. The process is done in two steps: 1. (a) Violation detection: – Violation points state that some values are correct, while others are not. It is difficult to recognize them exactly. – Sometimes rules detect violations after the data have been processed. With error propagation techniques, violations should be traced back to identify errors and their cause – There is a high risk of having high computational costs (b) Holistic error detection: – Pinpoint which values are more likely to be wrong by compiling all violations – A value involved in multiple violations is more likely to be wrong 2. Error repair: – Is the process of finding another database instance that confirms the rules. 14 Data Quality for Machine Learning 14.1 Phases and tasks 14.2 Data preparations Raw data cannot be used directly; instead, they are used as training test data or input data. Machine Learning algorithms require data to be numerical Linear or nonlinear relationships can be extracted 14.2.1 Data transformation tasks Data conversion: ignore ID-like fields as they are unique ∀ records. Discretization: – convert a numeric variable to an ordinal value – divide the range of continuous attributes into intervals Normalization: prevent attributes with large ranges from out-weighting attributes with small ranges Missing values: – Always present – The modeler should be conscious on how missing values are automatically replaced by the integrated tools – We can handle them in several ways 14.2.2 Basic cleaning operations Identify and remove column variables that only have a single value Identify and consider column variables with a few unique values (can’t assume uselessness) Identify and remove rows that contain duplicates Outlier detection in high dimensional data Francesco Spangaro (110734844) 17 DIQ Recap – The only solution is dimensionality reduction, many traditional techniques lose their effectiveness when the number of dimensions is too high – Techniques that use all available dimensions to find new datasets are random projections and the Prin- cipal Component Analysis (PCA) – Techniques that discover outliers using a subset of dimensions are the subspace outlier detection tech- niques and the contextual outlier detection techniques Unbalanced datasets – Classes have unequal frequency – Methods to deal: resample, collect more data from minority class and change algorithm 15 Machine learning and Data quality Data preparation is very important. Formalized and fine grained annotation of data is still considered costly to produce, consequently, a significant amount of workflow processing still deals with metadata wrangling, format transformations and mapping identifier. To obtain high quality data is difficult. Data argumentation is used in orded to create synthetic data. 16 Machine learning for Data quality 16.1 Data imputation with KNN Use a model to predict missing values A new sample is imputed by finding samples in the training set closest to it configuration of this method often involves selecting the distance measure and the number of contributing neighbors 16.2 Active learning in dada deduplication Train a binary classifier to predict duplicates Need of a large training set Solicit user feedback for unlabeled record pairs -> high utility in training process 16.3 Deep learning in data deduplication Attribute embedding Attribute similarity representation Classification 17 Process based data improvement A Business Process is a sequence of activities carried out starting from an input aimed at the realization of an output A business process focuses on activities and on exchange of informations they generate BPs facilitate collaboration between people and other corporate resources Poor quality negatively affects the efficiency and effectiveness of business processes Wrong outputs Different courses of actions Wrong analyses Failures Delays and timeouts Francesco Spangaro (110734844) 18 DIQ Recap 17.1 BP and Data Each activity depends not only on the data sources but also on the outcome of previous activities Quality can’t be evaluated independently for each activity in the process Typical causes: – Input data: wrong, missing, access to external sources, received messages – Work-arounds – Temporal aspects: untimely information, delayed recording of information 17.2 Modeling data quality in BP Information may be treated as a product and steps involved in creating is as a set of manufacturing processes The IP manager visualizes the most important phases in the manufacturing of an IP and identifies critical phases that affect its quality Data quality blocks are used to represent the checks for data quality on those data items that are essential in producing a defect-free IP. A block is associated to a list of quality checks 17.3 Data quality blocks Methodology to suggest process designer where to insert blocks Local checks: parallel for reading, sequential for updating Preliminary check: only one quality block before the process. Execution is delayed if low data quality is detected Parallel check: only one block in parallel to everything. No delay, but errors are notified only at the end of the process 17.4 DQ costs Poor data quality costs a lot Process failure costs Information scrap and rework costs Lost and missed opportunity costs 17.5 Data stream A stream is an infinite sequence of elements Goal: compute a function of a stream Characteristics: – Data arrives continuously and we should keep up a data rate to prevent loss of information – Streaming data evolve over time and their value decreases with the passage of time. Recent streaming data is sufficient for many applications – Noisy, corrupted 17.6 Sensors’ Data Their quality is restricted to limited precision and sensor failures To meet resource constraints data stream processing introduces additional noise and reduces quality We can handle this by using an optimistic approach to rely on sensors or a data quality aware strategy Francesco Spangaro (110734844) 19 DIQ Recap 17.6.1 Data quality aware strategy We have a list of quality dimensions for data streams: Accuracy Confidence Completeness Timeliness Data volume: amount of raw data to compute the results of a data stream subquery We also have a list of operators: Data generator operators – Data items inserted in the stream based on existing sensors’ data – They increase completeness and affect accuracy, confidence and data volume Data reducing operators – Selection: doesn’t impact on accuracy, completeness and volume. Only impacts confidence – Sampling: accuracy of estimated value depends on the size of the sample Data merging: – Aggregation: it compresses incoming data to one output value. DQ dimensions are calculated as the average dimensions if incoming tuples Data modifying operators – No effect on data volume nor completeness – Join operators: ∗ Join of synchronic streams: timestamp-based join has no impact on dq information ∗ Timestamp-join of asynchronic streams: no impact on the dq information 17.6.1.1 Window size definition Small windows result in high granular quality, disadvantage: higher data overhead The wider the windows the opposite the statement above We try to apply an approach based on interestingness 17.6.1.2 Interestingness Definition: is not a data quality dimension, but characterizes the data stream in the context of a specific application scenario For stream parts of high interest, wider DQ windows are sufficient We can adapt window’s size also during the procedure basing on different interestingness’ thresholds 17.6.1.3 Technological issues with nodes in Wireless Sensors Networks (WSN) Memory – Small data buffers – Small algorithm footprint Power – Limited battery life – Transmission is the most power consuming function Preserve data quality Francesco Spangaro (110734844) 20 DIQ Recap 18 Big data Definition: big data is a collection of data that needs to be continuously analyzed to extract business value, but these analyses raise technical challenges and involve a technological and organizational change Types of data sources: – Human-sourced information sources (social networks) – Process-mediated sources (institutions and private sector) – Machine-generated sources (sensors and computer systems) Types of big data: – Conversation text data – Photo and video – Audio files – Sensor data – IoT data – Web customer data – Traditional customer data 18.1 Four V of big data Volume: how to manage? – NoSQL databases – MapReduce Variety: – Data came from different sources – Managing heterogeneous information requires integration Velocity: they are not data to query on demand but dynamic data to analyze at real time Veracity: – Data are useful only if they can be trusted – Data not reliable: analysis not reliable – Veracity divides in how accurate or truthful a data set may be and how trustworthy the data source, type and processing of it is 19 Data Quality and big data We should check if, or how much, a source is accurate, up-to-date and trustworthy Big data tried to overcome data quality with quantity High quality data are the precondition for guaranteeing the quality of the results of big data analysis Challenges: – Variety: complex data structures and difficult data integration – Volume: profiling and assessment is difficult to execute in a reasonable amount of time – Velocity: data is updated continuously and timeliness is very short. If we don’t analyze the data quickly enough, information becomes stale and invalid – Veracity: DQ standards have been proposed for traditional sources, not Big Data. Big data contains errors and noise. Often the context of data is known only by the source that doesn’t share this information. 20 Data quality improvement limitations Source dependent limitations: – Resource constraints: computational, communication and memory limitations – Source heterogeneity: differences in structure lead to unusual behavior – Scalability: data and algorithms have to be distributed Inherent limitations: – Variety of arrival rate – Infinite data: evaluation must be done online without interruptions – Transient data: data expires and lose credibility – Distributed data points: integrate and extract correlations on data collected Francesco Spangaro (110734844) 21 DIQ Recap 21 Big Data integration Schema alignment: – Mediated schema: is the set of schema terms on which queries are posed – Probabilistic mediated schema: set of mediated schemas, each with a probability indicating the likelihood of the schema indicating correctly the domain of the sources – Probabilistic schema mapping: capture uncertainty on mappings between schemas Record linkage: – It provides satisfying answers when data are traditional records. They need to be well structured information with clearly identified metadata. It works best when the database we are taking data from is a relational database. – Because of this we are passing from record linkage to object linkage, that goes from images to completely unstructured data – Object linkage should face up challenges of huge size and time variance of data, other than poor schema informations – Blocking techniques are necessary to reduce the number of comparisons 22 MAPREDUCE Technique for data deduplication, is the combination of blocking and parallel processing Relies on data partitioning and distribution 4 phases: 1. Map: mapper nodes extract < key, value > couples analyzing data blocks for which they are responsible 2. Shuffle: intermediate records are moved from mappers to reducers 3. Reduce: reducers receive pre-elaborated data and combine them to compute the final output 4. Sort: data are ordered before the presentation of the results Disadvantages: – Disjoint data partitioning: map output is partitioned on the basis of its key value, not suitable for sliding window approaches – Load balancing: partitions can be characterized by different size that depends on the key values – Memory bottlenecks: entities in the same block are passed to a single reduce call. Reduce task implies that data are processed row-by-row and all the entities in the same block are compared to each other. 23 Provenance Definition: provenance is a record of metadata that describes entities and activities involved in producing and delivering or otherwise influencing a given object Main usages: – Understanding where data comes from – Ownership and rights over a resource – Making judgments about a resource to determine whether to trust it or not – Verifying that the process used to obtain a result complies with requirements Why we need provenance? – Data comes from various diverse sources – Quality is varying – Scopes are different Three types of provenance: 1. Agent-centered: what people or organizations were involved in generating or manipulating a source 2. Object-centered: tracing origins or portions of an entity 3. Progress-centered: capture activities and steps taken to generate a resource Francesco Spangaro (110734844) 22 DIQ Recap 23.1 Provenance and Data Quality Important quality dimensions for provenance are: Data trustworthiness – Authenticity – Reliability Dimensions of believability – Trustworthiness of source, implies also related artifacts and actors – Reasonableness of data ∗ Possibility, extent to which a data value is possible ∗ Consistency 24 Truth discovery Three different approaches: Content-based Source reputation-based Evidence-based In any case, the first step is information extraction, which consists in three phases: 1. Text segmentation: aims to divide a text into linguistically meaningful units, using Tokenization (practice of identifying tokens so that matches occur despite differences in the character sequences of the word. Includes stemming and lemmatization). The first thing performed is the language identification, then the tokenization and then the POS tagging (practice of marking each word in a text with labels corresponding to the part of speech of the word in its grammatical context). 2. Normalization: aims to reduce linguistic noise and name variance in order to eliminate multiple representations of the same identity. Two steps: (a) Identification of orthographic errors (b) Corrections of errors and transformations of abbreviations 3. Semantic analysis: Named entity recognition: identifies and classifies some types of information elements. Given concrete types of semantics, the goal is to locate the elements in the text that fit the semantics Mention detection: language-dependent step of marking potentially co-references in a text. Followed by co-reference resolution, which links detected mentions in groups referring to the same entity. 24.1 Truth computation methods Simple majority voting: choose the value most present across sources Agreement based method: two groups of methods: – The first group refers to the approaches from web link analysis and trust metrics. They generally consist in computing relative importance of a source – The second group relies on iterative voting algorithm. It iteratively computes source trustworthiness as a function of the values’ confidence and the confidence score of a value as a function of its source’s trustworthiness Maximum a posteriori estimation-based methods: these approaches differ from agreement-based ones mainly in the modeling of the source’s trustworthiness. MAP estimation-base captures it as a latent variable to estimate Bayesian inference-based methods: DEPEN is the first Bayesian truth detection model that takes in consid- eration the copying relationship between sources. It is based on the intuition that sharing the same errors is unlikely is the sources are independent. D penalizes the vote count on a source it if is detected to be a copy of another one. Francesco Spangaro (110734844) 23 DIQ Recap 25 Data fusion Data fusion is necessary in the big data area, but also in reliable contexts Types of conflicts: – Mistyping – Incorrect calculations – Out of date information – Copying behaviors – Inconsistent interpretations of semantics Definition: having a set of data items D and a set of sources S, each source providing values for a subset of data items in D. Data fusion decides the true value for each data item in D. Categories of data fusion algorithms: – Rule based: early approaches to data fusion, such as majority voting – Web link based: inspired by measuring web page authority – Information retrieval based: inspired by similarity measures – Bayesian based: based on Bayesian inference 25.1 Iterative Bayesian algorithm: Accu With this algorithm we have managed to express the probability P (V istruef ori|valuesprovidedf ori) in terms of source accuracy As An estimation of As is computed by the average truth probability of the values provided by s Limitations: – P(V’ is true for i) assumed uniform – Sources assumed independent – Data extraction might introduce noise – Only one true value for each i 25.2 Data fusion and truth We can have two classes of problems: 1. Single truth: Only one true value ∀ data objects Different values are opposing each other Values and sourcing are either correct or not 2. Multi-truth: Truth is composed by a set of values Different values could provide a partial truth Incomplete correct claims are not wrong claims 25.3 Data fusion-STORM We define as authored sources the ones that have been copied by many sources Phases: 1. Values provided by the sources for the objects 2. Value Clustering, including cleaning and reconciliation 3. Value-clusters provided by the sources for the objects 4. Authority-based Bayesian inference 5. Sets of true values for the objects Intuitions: When source administrators decide to copy data, they will choose the sources that they perceive as most trustworthy Assumptions: – No mutual copying within the same domain – For each pair of sources, either they are independent or one is a copier Francesco Spangaro (110734844) 24 DIQ Recap 25.4 Data fusion-word embeddings Words with similar semantic meanings tend to have vectors that are close together Vector differences between words in embeddings have been shown to represent relationships between words Francesco Spangaro (110734844) 25