Data Quality (A Database Approach) PDF

Document Details

BrighterPersonification6672

Uploaded by BrighterPersonification6672

University of Malta

Prof. Joseph Vella

Tags

data quality database systems computer information systems data management

Summary

This document is a presentation or lecture notes on data quality from a database approach. It covers various aspects of data quality, including definitions, characteristics, and practical examples. The document also analyses data quality issues and solutions for improving income, efficiency, and image of organizations.

Full Transcript

Data Quality (A database approach) DATA QUALITY (IN COMPUTER INFORMATION SYSTEMS) PROF. JOSEPH G VELLA, CIS, FICT, UM 1 DATA QUALITY...

Data Quality (A database approach) DATA QUALITY (IN COMPUTER INFORMATION SYSTEMS) PROF. JOSEPH G VELLA, CIS, FICT, UM 1 DATA QUALITY  An airline’s loyal customer was charged US$ 674,000 for converting his 674,000 air miles to a holiday flight!  The Enron scandal discovery was caused by erroneous data (including intentionally misleading data) being provided to share holders.  The loss of the Mars Climate Orbiter (1999) due to misreading measure units – NASA specified SI to all, but Lockheed used Imperial units.  Recent reports/insights by Gartner & IBM specifically indicate:  “The average financial impact of poor data quality on organizations is US$9,700,00 per year.”  In the US alone, businesses lose $3.1 trillion annually due to poor data quality. It is well known that data quality issues leads to a decrease in income, efficiency to do business, and image. The idiom “rubbish in, rubbish out” is a popular expression. PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 2 Prof. Joseph Vella ©, [email protected] 1 Data Quality (A database approach) REALLY PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 3 LESSON PLAN o CONTEXT AND BASIC DEFINITIONS o ENFORCING DATA QUALITY THROUGH RULES o MAPPING VALUES TO AN ACCEPTABLE CLASSIFICATION o CAN WE “SUMMARISE” VOLUMINOUS DATA AND YET EXTRACT GOOD QUALITY RESULTS? LESSON RESOURCES ON VLE ✓ FOLLOW UP READING (ONE OR TWO OF REFERENCES LISTED) ✓ ENSURE YOU RE-WRITE QUERIES AND CODE AS INSTRUCTED PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 4 Prof. Joseph Vella ©, [email protected] 2 Data Quality (A database approach) DATA QUALITY – DEFINITIONS AND CHARACTERISTICS We say our data has a high quality if it is "fit for [its] intended uses in operations, decision making and planning“ WIKIPEDIA.ORG/DATA_QUALITY citing the work of TC Redman “The Field Guide” (2008)  Characteristics of data quality include:  Completeness of data instance;  Conformity within the attribute’s range;  Consistency between an instance attributes;  Referential integrity between records;  Accuracy of record when compared to current reality.  Origin or onset of bad data includes:  Misreads and Missing data;  Wrong import (movement of data from one application to another);  Application errors (in design and runt time checks). PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 5 WHAT CAN WE DO? (ESPECIALLY IN DATA MANAGEMENT)  Specification of structures through rules is useful  Structural rules through constraints  Primary key, referential, not null, check ! Don’t go overboard with these during design  Transitional rules (mapping a state into the next)  Constraints and assertions at start and end of a transaction d !! Hairy stuff ve ol  Subject/domain-oriented rules and heuristics: nv  Statistical methods for sampling and searching for outliers; ei or  Run local procedures/function with known results (i.e. if results are not what is expected we have an issue!). M PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 6 Prof. Joseph Vella ©, [email protected] 3 Data Quality (A database approach) ENFORCING DATA QUALITY – WITH RULES  Through structural rules specification:  Primary keys  All pairs of values/records must have distinct values.  Referential constraints  Every reference must be satisfied by one, and only one, instance.  Check constraints  Discount on a product could not be greater than its selling price  Evaluation of these rules  Denial queries – there is no data that breaks a specified rule!  “Clever” implementations are required as denial queries computation can become killer queries! PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 7 ENFORCING DATA QUALITY – WITH RULES – PRIMARY KEY  This is a very simple table of data, called dept, with only 4 rows of data. Attribute deptno is the primary key.  Therefore, any addition or amendment of rows needs to respect this primary key constraint. Integrity Constraint ∀𝒙∀𝐲∀𝒛∀𝒙′∀𝒚′∀𝒛′ 𝒅𝒆𝒑𝒕 𝒙, 𝒚, 𝒛 ∧ 𝒅𝒆𝒑𝒕 𝒙′ , 𝒚′ , 𝒛′ ∧ 𝒙 = 𝒙′ ՜ 𝒚 = 𝒚′ ∧ 𝒛 = 𝒛′ Denial query in SQL This is expensive! If dept table size is n blocks then (n+(n*n)) blocks are read. Basically, we are comparing each combination. PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 8 Prof. Joseph Vella ©, [email protected] 4 Data Quality (A database approach) ENFORCING DATA QUALITY EFFICIENTLY  OPTIMISATION – AS IN LEAST WORK POSSIBLE  What to optimise for? Reads of blocks from disk.  Use of deltas/incremental evaluation: Rather than compare all when a change occurs: Start with the premise that the current state is compliant with constraint and check the changed record only. That is, does the new record primary key, i.e. deptno value, already occurs in dept table? Optimised Denial query in This is much less expensive! SQL If deptno size is n blocks then (n) blocks are read. Basically we are searching all records. Actually, with a B-tree index, with 10 million records in dept we need only to read 5 blocks to confirm or otherwise. PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 9 TO DO ON YOUR OWN: o For referential, unique, not null and check (as per example indicated): o Specify these constraints; o Write the denial query (in SQL) of each of these; o How does it perform? Can we reduce its cots of upkeep? PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 10 Prof. Joseph Vella ©, [email protected] 5 Data Quality (A database approach) ENFORCING DATA QUALITY – WITH SHAKY RULES  Colours; what could be a nicer gift?  Agreeing on, or recalling a colour name, is another matter!?  We can go with synonyms, generalisations, but no ambiguities. PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 11 GET THE DATA  We can address some of these issues.  Let us adopt the RGB colour model (there others of course);  RGB (Red Green Blue) is a vector of three integers with a range of 0 to 255.  Many applications offer a colour picker (even this PowerPoint) where you point on a screen pixel and it gives you its RGB value.  Next we need the data – list of colour names and their RGB values!  Wikipedia to our rescue – EN.WIKIPEDIA.ORG/WIKI/LISTS_OF_COLORS  955 instances listed (from the possible 2553 that is 16.5 million) – 2020 visited  Download, encode (e.g. in an ASCII table) and read into a database. Vide data, scripts and queries on VLE. PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 12 Prof. Joseph Vella ©, [email protected] 6 Data Quality (A database approach) RETRIEVE THE COLOUR NAME BY KNN ON RGB VALUES K-NN for 2 arguments (x1, x2), k=3 and target is the green dot  The idea is to get a pixel’s RGB value and then retrieve colours that are close to it (it is unlikely to be an exact overlap):  This is the classic K-Nearest Neighbour (K-NN) query  We need two components:  Compare records to our RGB vector;  Calculate a distance (input to database instances). PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 13 COMPARE RECORDS TO OUR RGB VECTOR PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 14 Prof. Joseph Vella ©, [email protected] 7 Data Quality (A database approach) CALCULATE A DISTANCE We use the Euclidean distance for 3d. PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 15 QUICK RUN! RGB(137,242,81) PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 16 Prof. Joseph Vella ©, [email protected] 8 Data Quality (A database approach) MORE TO DO ON YOUR OWN: Example MBBs of 3D points o Search for a colour by textual name; e.g. ‘RED’. o For a subset of colours calculate the most bounding box (MBB) of their RGB vectors. o For the colours that have a ‘RED’ in their textual name calculate their MBB. Consequently return all colours names within that MBB. o How many of these have ‘RED’ in their textual name? PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 17 WHAT IF WE HAD TOO MUCH GOOD DATA! Data reduction techniques can help out.  How?  By attributes;  By data volume.  When?  Undertaking complicated processing that is very computationally heavy.  Why?  To get quicker results; but maintain good quality results even if we throw away a high portion of it. PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 18 Prof. Joseph Vella ©, [email protected] 9 Data Quality (A database approach) CASE STUDY  We have a few hundreds of specimen, each has its outer shape reduced to a long sequence of points (from 400 to 900 points).  We are given a measure that compares two shapes and output how similar the two are: Vide paper: Arkin, E.M., Chew, L.P., Huttenlocher, D.P., Kedem, K., and Mitchell, J.S.B. "An efficiently computable metric for comparing polygonal shapes." IEEE Transactions on Pattern Analysis and Machine Intelligence.Vol. 13, Number 3, 1991, pp. 209-16  We compared each specimen to all others.  An “oracle” choose, quite arbitrarily, a very small subset of these for us (i.e. 37 specimens).  We are asked to visualise this subset of specimens and their similarity measure. PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 19 DIATOMS IN LIFE PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 20 Prof. Joseph Vella ©, [email protected] 10 Data Quality (A database approach) CASE STUDY (CONTINUED) – DATA DISCOVERY We should have n(n-1)/2 records. Check Specimen 1 Specimen 2 Similarity with n=37, that is 666. Confirmed! 426 126 0.549167  Get a sample of the data … 171 126 0.513564 … 171 426 0.235817 PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 21 CASE STUDY (CONTINUED) – FOCUS ON THE MEASURE  Is self similarity equal to 0?  Actually we need to make a specific run (i.e. measure a sample self similarity)  Also check other features e.g. rotation invariance and mirroring.  Is similarity between two shapes symmetric?  A run for testing the symmetry would put our mind at rest.  Is similarity between three shapes obeying the triangle rule? A A’s AB B B’s BC C C’s AB AC index index Index + BC 126 4 0.55 426 20 0.24 171 7 0.79 > 0.51 … 126 4 0.55 426 20 0.56 301 14 1.11 > 0.42 PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 22 Prof. Joseph Vella ©, [email protected] 11 Data Quality (A database approach) WE KNOW A BIT ABOUT THE DATA … NOW WE VISUALISE  We have 36 specimens and 666 similarities measures.  We could use a “force” field analogue … two specimens that are close (smaller measure) are drawn close together but two specimens that have a lesser similarity are repelled off.  We intend to use the specimens as pins and spring (adjusted to the similarity measure) in between them.  This is called force directed graph drawing.  Read technical details in the reference: di Battista, Giuseppe; Peter Eades; Roberto Tamassia; Ioannis G. Tollis (1999), Graph Drawing: Algorithms for the Visualization of Graphs, Prentice Hall, ISBN 978-0-13- 301615-4 Vide video on page: WWW.YWORKS.COM/PAGES/FORCE-DIRECTED-GRAPH-LAYOUT PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 23 VISUALISATION PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 25 Prof. Joseph Vella ©, [email protected] 12 Data Quality (A database approach) VISUALISE – A SECOND TRY  We need data reduction …  Clearly we need to get rid of that cocoon of lines.  We have shown that the triangle rule holds with our similarity measure!  Basically, we reduce edges AB and BC and keep AC so long as B is connected to some other specimen.  This is the classic Minimum Spanning Tree Graph reduction.  This graph is a subset of the edges that forms a tree that includes every vertex, where the total weight of all the edges in the tree is minimized. PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 26 COMPUTING THE MINIMUM SPANNING TREE  Prim's (also known as Jarník's) algorithm is a greedy algorithm that finds a minimum spanning tree for a weighted undirected graph.  The algorithm operates by building this tree one vertex at a time, from an arbitrary starting vertex, at each step adding the cheapest possible connection from the tree to another vertex.  Works for connected graphs, and have weighted edges but un-directed.  The time complexity of Prim's algorithm depends on the data structures used for the graph and for ordering the edges by weight, which can be done using a priority queue.The following table shows the typical choices: PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 27 Prof. Joseph Vella ©, [email protected] 13 Data Quality (A database approach) AHA … WE CAN DEVELOP ON THIS! Get data (i.e. build graph with vertices and edges); Apply the Minimum Spanning Tree to reduce edges but keep all vertices; Apply the force directed graph drawing. PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 28 WHAT’S NEXT  Data quality upkeep during transaction processing …  Where are you expected to use this:  Design, implementation and optimisation of computer information systems  Data integration and consolidation (as in projects with Data Ware housing, Data mining, & data science)  What’s Next!  Watch out for data quality whilst undergoing database change of state – Transaction Processing …  In our optional Data Mining unit we consider the following:  What if some data is outside its scope? – outliers.  What if we have missing data and we would like to fill it up with replacements? PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 29 Prof. Joseph Vella ©, [email protected] 14 Data Quality (A database approach) MISSING DATA PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 30 MISSING VALUES  Sometimes missing data needs to be addressed (not sifted out of the selection process).  Functionality to allow for missing data to be imputed, using pre-defined imputation methods, includes:  Mean Imputation;  Regression Imputation;  Last Observation Carried Forward;  Multiple Imputation. PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 31 Prof. Joseph Vella ©, [email protected] 15 Data Quality (A database approach)  6 Different Ways to Compensate for Missing Values In a Dataset (Data Imputation with examples)  Popular strategies to statistically impute missing values in a dataset  Will Badr, Jan 5, 2019  https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples- 6022d9ca0779  Missing Values in Matlab  https://uk.mathworks.com/help/matlab/data_analysis/missing-data-in-matlab.html  In statistics, imputation is the process of replacing missing data with substituted values. When substituting for a data point, it is known as "unit imputation"; when substituting for a component of a data point, it is known as "item imputation".  https://en.wikipedia.org/wiki/Imputation_(statistics) PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 32 THE DATA (SCOTT.EMP) PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 33 Prof. Joseph Vella ©, [email protected] 16 Data Quality (A database approach) SIMPLE EXAMPLE – SCOTT DATABASE – NULLS IN THE COMM ATTRIBUTE PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 34 SIMPLE EXAMPLE – SCOTT DATABASE – NULLS IN THE COMM ATTRIBUTE PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 35 Prof. Joseph Vella ©, [email protected] 17 Data Quality (A database approach) THANKS PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 36 SPATIAL DATA QUALITY CLEANING A WALKING TRACK PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 37 Prof. Joseph Vella ©, [email protected] 18 Data Quality (A database approach) DATA COLLECTION –VALLETTA ROADS AND POI Not complete nor accurate!?  Use some basic tool to edit some:  Points (POI – points of interest)  Road Network  Area PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 38 DATA COLLECTION – SUNDAY WALK IN VALLETTA!? PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 39 Prof. Joseph Vella ©, [email protected] 19 Data Quality (A database approach) REQUIREMENTS PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 40 DATA DEFINITION AND UPLOAD (FROM GPX FILES) create table spatial_data.vt_road( vtr serial primary key, create table spatial_data.vt_poi( vtr_desc character varying(99) not null, vtp serial primary key, vtr_other character varying(99), vtp_desc character varying(99) not null, vtr_geom geometry(linestring,4326) not null); vtp_geom geometry(point,4326) not null); CREATE INDEX idx_vtr_road_ls ON CREATE INDEX idx_vtp_poi_pt ON spatial_data.vt_road USING gist(vtr_geom); spatial_data.vt_poi USING gist(vtp_geom); -- export from qgis the road segments gpx file create table spatial_data.vt_walk( -- into temp table road_etl in this schema vtw serial primary key, -- move from etl to road vtw_desc character varying(99), insert into spatial_data.vt_road(vtr_geom,vtr_desc) vtw_geom geometry(linestring,4326) not null); select ST_LineMerge(ST_Force2D(cast(geom as CREATE INDEX idx_vtr_walk_ls ON geometry))), name spatial_data.vt_walk USING gist(vtw_geom); from spatial_data.vt_etl_road; PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 41 Prof. Joseph Vella ©, [email protected] 20 Data Quality (A database approach) SOME QUERIES. PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 42 SOME MORE QUERIES. PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 43 Prof. Joseph Vella ©, [email protected] 21 Data Quality (A database approach) CREATING BUFFERS PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 44 CREATING BUFFERS (CONTINUED) CREATE OR REPLACE VIEW spatial_data.vw_road_buffer AS SELECT vtr, vtr_desc, ST_Buffer(vtr_geom,.0001, 'endcap=round join=round') ::geometry(polygon,4326) AS rdbuffer FROM spatial_data.vt_road; PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 45 Prof. Joseph Vella ©, [email protected] 22 Data Quality (A database approach) USING THE RAMER-DOUGLAS-PEUCKER ALGORITHM (ST_SIMPILFY) PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 46 USING THE RAMER-DOUGLAS-PEUCKER ALGORITHM (ST_SIMPILFY) LINESTRING(14.5109707 35.8956818,14.5117593  32 pts, 35.8964597,14.5110995 tolerance = 0.000066 35.8974157,14.5133579 35.8987758,14.5130146 35.8995233,14.5086694 35.8977807)  28 pts, tolerance = 0.0001  6 pts, tolerance = 0.0006 PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 47 Prof. Joseph Vella ©, [email protected] 23 Data Quality (A database approach) TAKING ADVANTAGE OF THE GRID LAYOUT OF VALLETTA PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 48 FOR EACH POINT ON TRACK FIND ITS CLOSEST MAGNET PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 49 Prof. Joseph Vella ©, [email protected] 24 Data Quality (A database approach) CONVERT MAGNET POINTS TO NEW TRACK PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 50 EVALUATION Highlight areas still lacking quality!  We need to take care:  Start and End  Outliers  Routing might help too!  What other details could help? PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 51 Prof. Joseph Vella ©, [email protected] 25 Data Quality (A database approach) PROF. JOSEPH G VELLA - DATA QUALITY IN COMPUTER INFORMATION SYSTEMS 52 Prof. Joseph Vella ©, [email protected] 26

Use Quizgecko on...
Browser
Browser