Chapter 13: Data Quality PDF

C HAPT ER 13 Data Quality 1. Introduction E ffective data management involves a set of complex, interrelated processes that enable an organization to use its data to achieve strategic goals. Data management includes the ability to design data for applications, store and access it securely, share it appropriately, learn from it, and ensure it meets business needs. An assumption underlying assertions about the value of data is that the data itself is reliable and trustworthy. In other words, that it is of high quality. 449 Order 11611 by LEXIE MAY on August 25, 2017 450 D MBO K 2 However, many factors can undermine that assumption by contributing to poor quality data: Lack of understanding about the design, inconsistent development processes, incomplete documentation, a lack of standards, or a lack of governance. Many organizations fail to define what makes data fit for purpose. All data management disciplines contribute to the quality of data, and high quality data that supports the organization should be the goal of all data management disciplines. Because uninformed decisions or actions by anyone who interacts with data can result in poor quality data, producing high quality data requires crossfunctional commitment and coordination. Organizations and teams should be aware of this and should plan for high quality data, by executing processes and projects in ways that account for risk related to unexpected or unacceptable conditions in the data. Because no organization has perfect business processes, perfect technical processes, or perfect data management practices, all organizations experience problems related to the quality of their data. Organizations that formally manage the quality of data have fewer problems than those that leave data quality to chance. Formal data quality management is similar to continuous quality management for other products. It includes managing data through its lifecycle by setting standards, building quality into the processes that create, transform, and store data, and measuring data against standards. Managing data to this level usually requires a Data Quality program team. The Data Quality program team is responsible for engaging both business and technical data management professionals and driving the work of applying quality management techniques to data to ensure that data is fit for consumption for a variety of purposes. The team will likely be involved with a series of projects through which they can establish processes and best practices while addressing high priority data issues. Because managing the quality of data involves managing the data lifecycle, a Data Quality program will also have operational responsibilities related to data usage. For example, reporting on data quality levels and engaging in the analysis, quantification, and prioritization of data issues. The team is also responsible for working with those who need data to do their jobs to ensure the data meets their needs and working with those who create, update, or delete data in the course of their jobs to ensure they are properly handling the data. Data quality depends on all who interact with the data, not just data management professionals. As is the case with Data Governance and with data management as a whole, Data Quality Management is a program, not a project. It will include both project and maintenance work, along with a commitment to communications and training. Most importantly, the long-term success of data quality improvement program depends on getting an organization to change its culture and adopt a quality mindset. As stated in Data Manifesto: fundamental, lasting change requires committed leadership and involvement from people at all levels in an organization. People who use data to do their jobs which in most organizations is a very large percentage of employees need to drive change. One of the most critical changes to focus on is how their organizations manage and improve the quality of their data. 71 For the full text of The Data Manifesto, see http://bit.ly/2sQhcy7. Order 11611 by LEXIE MAY on August 25, 2017 D A T A Q UA LIT Y Data Quality Management Definition:The planning, implementation, and control of activities that apply quality management techniques to data, in order to assure it is fit for consumption and meets the needs of data consumers. Goals: 1. Develop a governed approach to make data fit for purpose based on data requirements. 2. Define standards, requirements, and specifications for data quality controls as part of the data lifecycle. 3. Define and implement processes to measure, monitor, and report on data quality levels. 4. Identify and advocate for opportunities to improve the quality of data, through process and system improvements. Business Drivers Inputs: Activities: Data Policies and Standards Data Quality Expectations Business Requirements Business Rules Data Requirements Business Metadata Technical Metadata Data Sources and Data Stores Data Lineage Deliverables: Data Quality Strategy & framework Data Quality Program organization Analyses from Data Profiling Recommendations based on root cause analysis of issues DQM Procedures Data Quality Reports Data Quality Governance Reports Data Quality Service Level Agreements DQ Policies and Guidelines Define High Quality Data (P) Define a Data Quality Strategy (P) Define Scope of Initial Assessment (P) Identify Critical Data Identify Existing Rules and Patterns Perform Initial Data Quality Assessment (P) Identify and prioritize issues Perform root cause analysis of issues Identify & Prioritize Improvements Prioritize Actions based on Business Impact Develop Preventative and Corrective Actions Confirm Planned Actions Develop and Deploy Data Quality Operations (D) Develop Data Quality Operational Procedures Correct Data Quality Defects Measure and Monitor Data Quality Report on Data Quality levels and findings Suppliers: Business Management Subject Matter Experts Data Architects Data Modelers System Specialists Data Stewards Participants: Business Process Analysts CDO Data Quality Analysts Data Stewards Data Owners Data Analysts Database Administrators Data Professionals DQ Managers IT Operations Data Integration Architects Consumers: Business Data Consumers Data Stewards Data Professionals IT Professionals Knowledge Workers Data Governance Bodies Partner Organizations Centers of Excellence Compliance Team Techniques: Spot-Checking using Multiple Subsets Tags and Notes to Mark Data Issues Root Cause Analysis Statistical Process Control Technical Drivers Tools: Profiling engines, query tools Data Quality Rule Templates Quality Check and Audit Code Modules Metrics: Governance and Conformance Metrics Data Quality Measurement Results Improvement trends Issue Management Metrics (P) Planning, (C) Control, (D) Development, (O) Operations Figure 91 Context Diagram: Data Quality Order 11611 by LEXIE MAY on August 25, 2017 451 452 D MBO K 2 1.1 Business Drivers The business drivers for establishing a formal Data Quality Management program include: Increasing the value of organizational data and the opportunities to use it Reducing risks and costs associated with poor quality data Improving organizational efficiency and productivity Protecting and enhancing the reputation Organizations that want to get value out of their data recognize that high quality data is more valuable than low quality data. Poor quality data is risk-laden (see Chapter 1). It can damage an resulting in fines, lost revenue, lost customers, and negative media exposure. Regulatory requirements often demand high quality data. In addition, many direct costs are associated with poor quality data. For example, Inability to invoice correctly Increased customer service calls and decreased ability to resolve them Revenue loss due to missed business opportunities Delay of integration during mergers and acquisitions Increased exposure to fraud Loss due to bad business decisions driven by bad data Loss of business due to lack of good credit standing Still high quality data is not an end in itself. It is a means to organizational success. Trustworthy data not only mitigates risk and reduces costs, but also improves efficiency. Employees can answer questions more quickly and consistently, when they are working with reliable data. They spend less time trying to figure out if the data is right and more time using the data to gain insight, make decisions, and serve customers. 1.2 Goals and Principles Data Quality programs focus on these general goals: Developing a governed approach to make data fit for purpose based on data requirements Defining standards and specifications for data quality controls as part of the data lifecycle Defining and implementing processes to measure, monitor, and report on data quality levels Identifying and advocating for opportunities to improve the quality of data, through changes to processes and systems and engaging in activities that measurably improve the quality of data based on data consumer requirements Data Quality programs should be guided by the following principles: Criticality: A Data Quality program should focus on the data most critical to the enterprise and its customers. Priorities for improvement should be based on the criticality of the data and on the level of risk if data is not correct. Order 11611 by LEXIE MAY on August 25, 2017 D A T A Q UA LIT Y 453 Lifecycle management: The quality of data should be managed across the data lifecycle, from creation or procurement through disposal. This includes managing data as it moves within and between systems (i.e., each link in the data chain should ensure data output is of high quality). Prevention: The focus of a Data Quality program should be on preventing data errors and conditions that reduce the usability of data; it should not be focused on simply correcting records. Root cause remediation: Improving the quality of data goes beyond correcting errors. Problems with the quality of data should be understood and addressed at their root causes, rather than just their symptoms. Because these causes are often related to process or system design, improving data quality often requires changes to processes and the systems that support them. Governance: Data Governance activities must support the development of high quality data and Data Quality program activities must support and sustain a governed data environment. Standards-driven: All stakeholders in the data lifecycle have data quality requirements. To the degree possible, these requirements should be defined in the form of measurable standards and expectations against which the quality of data can be measured. Objective measurement and transparency: Data quality levels need to be measured objectively and consistently. Measurements and measurement methodology should be shared with stakeholders since they are the arbiters of quality. Embedded in business processes: Business process owners are responsible for the quality of data produced through their processes. They must enforce data quality standards in their processes. Systematically enforced: System owners must systematically enforce data quality requirements. Connected to service levels: Data quality reporting and issues management should be incorporated into Service Level Agreements (SLA). 1.3 Essential Concepts 1.3.1 Data Quality The term data quality refers both to the characteristics associated with high quality data and to the processes used to measure or improve the quality of data. These dual usages can be confusing, so it helps to separate them and clarify what constitutes high quality data.72 In the DAMA-DMBOK2, we have tried to avoid using the words data quality without clarifying their context. For example, referring to high quality data or low quality data, and to data quality work efforts or data quality activities. Order 11611 by LEXIE MAY on August 25, 2017 454 D MBO K 2 Data is of high quality to the degree that it meets the expectations and needs of data consumers. That is, if the data is fit for the purposes to which they want to apply it. It is of low quality if it is not fit for those purposes. Data quality is thus dependent on context and on the needs of the data consumer. One of the challenges in managing the quality of data is that expectations related to quality are not always known. Customers may not articulate them. Often, the people managing data do not even ask about these requirements. However, if data is to be reliable and trustworthy, then data management professionals need to requirements and how to measure them. This needs to be an ongoing discussion, as requirements change over time as business needs and external forces evolve. 1.3.2 Critical Data Most organizations have a lot of data, not all of which is of equal importance. One principle of Data Quality Management is to focus improvement efforts on data that is most important to the organization and its customers. Doing so gives the program scope and focus and enables it to make a direct, measurable impact on business needs. While specific drivers for criticality will differ by industry, there are common characteristics across organizations. Data can be assessed based on whether it is required by: Regulatory reporting Financial reporting Business policy Ongoing operations Business strategy, especially efforts at competitive differentiation Master Data is critical by definition. Data sets or individual data elements can be assessed for criticality based on the processes that consume them, the nature of the reports they appear in, or the financial, regulatory, or reputational risk to the organization if something were to go wrong with the data. 73 1.3.3 Data Quality Dimensions A Data Quality dimension is a measurable feature or characteristic of data. The term dimension is used to make the connection to dimensions in the measurement of physical objects (e.g., length, width, height). Data quality dimensions provide a vocabulary for defining data quality requirements. From there, they can be used to define results of initial data quality assessment as well as ongoing measurement. In order to measure the quality of data, an organization needs to establish characteristics that are both important to business processes (worth measuring) and measurable. Dimensions provide a basis for measurable rules, which themselves should be directly connected to potential risks in critical processes. See Jugulum (2014), Chapters 6 and 7 for an approach to rationalizing critical data. Order 11611 by LEXIE MAY on August 25, 2017 D A T A Q UA LIT Y 455 For example, if the data in the customer email address field is incomplete, then we will not be able to send product information to our customers via email, and we will lose potential sales. Therefore, we will measure the percentage of customers for whom we have usable email addresses, and we will improve our processes until we have a usable email address for at least 98% of our customers. Many leading thinkers in data quality have published sets of dimensions. 74 The three most influential are described here because they provide insight into how to think about what it means to have high quality data, as well as into how data quality can be measured. The Strongdimensions across four general categories of data quality: Intrinsic DQ o Accuracy o Objectivity o Believability o Reputation Contextual DQ o Value-added o Relevancy o Timeliness o Completeness o Appropriate amount of data Representational DQ o Interpretability o Ease of understanding o Representational consistency o Concise representation Accessibility DQ o Accessibility o Access security In Data Quality for the Information Age (1996), Thomas Redman formulated a set of data quality dimension rooted in data structure.75 Redman defines a data item as a value from the domain of an attribute within an entity. Dimensions can be associated with any of the component pieces of data the model (entities and attributes) as well as the values. Redman includes the dimension of representation, which he defines as a set of rules for recording data items. Within these three general categories (data model, data values, representation), he describes more than two dozen dimensions. They include the following: In addition to the examples detailed here and numerous academic papers on this topic, see Loshin (2001), Olson (2003), McGilvray (2008), and Sebastian-Coleman (2013) for detailed discussions on data quality dimensions. See Myers (2013) for a comparison of dimensions. Redman expanded and revised his set of dimensions in Data Quality: The Field Guide (2001). Order 11611 by LEXIE MAY on August 25, 2017 456 D MBO K 2 Data Model: Content: o Relevance of data o The ability to obtain the values o Clarity of definitions Level of detail: o Attribute granularity o Precision of attribute domains Composition: o Naturalness: The idea that each attribute should have a simple counterpart in the real world and that each attribute should bear on a single fact about the entity o Identify-ability: Each entity should be distinguishable from every other entity o Homogeneity o Minimum necessary redundancy Consistency: o Semantic consistency of the components of the model o Structure consistency of attributes across entity types Reaction to change: o o Robustness Flexibility Data Values: Accuracy Completeness Currency Consistency Representation: Appropriateness Interpretability Portability Format precision Format flexibility Ability to represent null values Efficient use of storage Physical instances of data being in accord with their formats Redman recognizes that consistency of entities, values, and representation can be understood in terms of constraints. Different types of consistency are subject to different kinds of constraints. Order 11611 by LEXIE MAY on August 25, 2017 D A T A Q UA LIT Y 457 In Improving Data Warehouse and Business Information Quality (1999), Larry English presents a comprehensive set of dimensions divided into two broad categories: inherent and pragmatic. 76 Inherent characteristics are independent of data use. Pragmatic characteristics are associated with data presentation and are dynamic; their value (quality) can change depending on the uses of data. Inherent quality characteristics o Definitional conformance o Completeness of values o Validity or business rule conformance o Accuracy to a surrogate source o Accuracy to reality o Precision o Non-duplication o Equivalence of redundant or distributed data o Concurrency of redundant or distributed data Pragmatic quality characteristics o Accessibility o Timeliness o Contextual clarity o Usability o Derivation integrity o Rightness or fact completeness In 2013, DAMA UK produced a white paper describing six core dimensions of data quality: Completeness: The proportion of data stored against the potential for 100%. Uniqueness: No entity instance (thing) will be recorded more than once based upon how that thing is identified. Timeliness: The degree to which data represent reality from the required point in time. Validity: Data is valid if it conforms to the syntax (format, type, range) of its definition. Accuracy: The degree to which data correctly describes the described. object or event being Consistency: The absence of difference, when comparing two or more representations of a thing against a definition. The DAMA UK white paper also describes other characteristics that have an impact on quality. While the white Usability: Is the data understandable, simple, relevant, accessible, maintainable and at the right level of precision? English expanded and revised his dimensions in Information Quality Applied (2009). Order 11611 by LEXIE MAY on August 25, 2017 458 D MBO K 2 Timing issues (beyond timeliness itself): Is it stable yet responsive to legitimate change requests? Flexibility: Is the data comparable and compatible with other data? Does it have useful groupings and classifications? Can it be repurposed? Is it easy to manipulate? Confidence: Are Data Governance, Data Protection, and Data Security processes in place? What is the reputation of the data, and is it verified or verifiable? Value: Is there a good cost / benefit case for the data? Is it being optimally used? Does it endanger safety or privacy, or the legal responsibilities of the enterprise? Does it support or contradict the corporate image or the corporate message? While there is not a single, agreed-to set of data quality dimensions, these formulations contain common ideas. Dimensions include some characteristics that can be measured objectively (completeness, validity, format conformity) and others that depend on heavily context or on subjective interpretation (usability, reliability, reputation). Whatever names are used, dimensions focus on whether there is enough data (completeness), whether it is right (accuracy, validity), how well it fits together (consistency, integrity, uniqueness), whether it is up-to-date (timeliness), accessible, usable, and secure. Table 29 contains definitions of a set of data quality dimensions, about which there is general agreement and describes approaches to measuring them. Table 29 Common Dimensions of Data Quality Dimension of Quality Accuracy Completeness Consistency Description Accuracy refers to the degree that data correctly represents Accuracy is difficult to measure, unless an organization can reproduce data collection or manually confirm accuracy of records. Most measures of accuracy rely on comparison to a data source that has been verified as accurate, such as a system of record or data from a reliable source (e.g., Dun and Bradstreet Reference Data). Completeness refers to whether all required data is present. Completeness can be measured at the data set, record, or column level. Does the data set contain all the records expected? Are records populated correctly? (Records with different statuses may have different expectations for completeness.) Are columns/attributes populated to the level expected? (Some columns are mandatory. Optional columns are populated only under specific conditions.) Assign completeness rules to a data set with varying levels of constraint: Mandatory attributes that require a value, data elements with conditional and optional values, and inapplicable attribute values. Data set level measurements may require comparison to a source of record or may be based on historical levels of population. Consistency can refer to ensuring that data values are consistently represented within a data set and between data sets, and consistently associated across data sets. It can also refer to the size and composition of data sets between systems or across time. Consistency may be defined between one set of attribute values and another attribute set within the same record (record-level consistency), between one set of attribute values and another attribute set in different records (cross-record consistency), or between one set of attribute values and the same attribute set within the same record at different points in time (temporal consistency). Consistency can also be used to refer to consistency of format. Take care not to confuse consistency with accuracy or correctness. Characteristics that are expected to be consistent within and across data sets can be used as the basis for standardizing data. Data Standardization refers to the conditioning of input data to ensure that data meets rules for content and format. Standardizing data enables more effective matching and facilitates consistent output. Encapsulate consistency constraints as a set of rules that specify consistent relationships between values of attributes, either across a record or message, or along all values of a single attribute (such as a range or list of valid values). For example, one might expect that the number of transactions each day does not exceed 105% of the running average number of transactions for the previous 30 days. Order 11611 by LEXIE MAY on August 25, 2017 D A T A Q UA LIT Y Dimension of Quality Integrity Reasonability Timeliness Uniqueness / Deduplication Validity 459 Description Data Integrity (or Coherence) includes ideas associated with completeness, accuracy, and consistency. In data, integrity usually refers to either referential integrity (consistency between data objects via a reference key contained in both objects) or internal consistency within a data set such that there are no holes or missing parts. Data sets without integrity are seen as corrupted, or have data loss. Data sets without referential invalid reference keys, or identical rows which may negatively affect aggregation functions. The level of orphan records can be measured as a raw count or as a percentage of the data set. Reasonability asks whether a data pattern meets expectations. For example, whether a distribution of sales across a geographic area makes sense based on what is known about the customers in that area. Measurement of reasonability can take different forms. For example, reasonability may be based on comparison to benchmark data, or past instances of a similar data set (e.g., sales from the previous quarter). Some ideas about reasonability may be perceived as subjective. If this is the case, work with data consumers to articulate the basis of their expectations of data to formulate objective comparisons. Once benchmark measurements of reasonability are established, these can be used to objectively compare new instances of the same data set in order to detect change. (See Section 4.5.) The concept of data Timeliness refers to several characteristics of data. Measures of timeliness need to be understood in terms of expected volatility how frequently data is likely to change and for what reasons. Data currency is the measure of whether data values are the most up-to-date version of the information. Relatively static data, for example some Reference Data values like country codes, may remain current for a long period. Volatile data remains current for a short period. Some data, for example, stock prices on financial web pages, will often be shown with an as-of-time, so that data consumers understand the risk that the data has changed since it was recorded. During the day, while the markets are open, such data will be updated frequently. Once markets close, the data will remain unchanged, but will still be current, since the market itself is inactive. Latency measures the time between when the data was created and when it was made available for use. For example, overnight batch processing can give a latency of 1 day at 8am for data entered into the system during the prior day, but only one hour for data generated during the batch processing. (See Chapter 8.) Uniqueness states that no entity exists more than once within the data set. Asserting uniqueness of the entities within a data set implies that a key value relates to each unique entity, and only that specific entity, within the data set. Measure uniqueness by testing against key structure. (See Chapter 5.) Validity refers to whether data values are consistent with a defined domain of values. A domain of values may be a defined set of valid values (such as in a reference table), a range of values, or value that can be determined via rules. The data type, format, and precision of expected values must be accounted for in defining the domain. Data may also only be valid for a specific length of time, for example data that is generated from RFID (radio frequency ID) or some scientific data sets. Validate data by comparing it to domain constraints. Keep in mind that data may be valid (i.e., it may meet domain requirements) and still not be accurate or correctly associated with particular records. Figure 92 aligns data quality dimensions and concepts associated with those dimensions. The arrows indicate significant overlaps between concepts and also demonstrate that there is not agreement on a specific set. For example, the dimension Order 11611 by LEXIE MAY on August 25, 2017 460 D MBO K 2 Figure 92 Relationship Between Data Quality Dimensions77 Adapted from Myers (2013), used with permission. Order 11611 by LEXIE MAY on August 25, 2017 D A T A Q UA LIT Y 461 1.3.4 Data Quality and Metadata Metadata is critical to managing the quality of data. The quality of data is based on how well it meets the requirements of data consumers. Metadata defines what the data represents. Having a robust process by which data is defined supports the ability of an organization to formalize and document the standards and requirements by which the quality of data can be measured. Data quality is about meeting expectations. Metadata is a primary means of clarifying expectations. Well-managed Metadata can also support the effort to improve the quality of data. A Metadata repository can house results of data quality measurements so that these are shared across the organization and the Data Quality team can work toward consensus about priorities and drivers for improvement. (See Chapter 12.) 1.3.5 Data Quality ISO Standard ISO 8000, the international standard for data quality, is being developed to enable the exchange of complex data in an applicationstore, maintain, transfer, process and present data to support business processes in a timely and cost effective manner requires both an understanding of the characteristics of the data that determine its quality, and an ability ISO 8000 defines characteristics that can be tested by any organization in the data supply chain to objectively determine conformance of the data to ISO 8000. 78 The first published part of ISO 8000 (part 110, published in 2008) focused on the syntax, semantic encoding, and conformance to the data specification of Master Data. Other parts projected for the standard include part 100 - Introduction, part 120 - Provenance, part 130 -Accuracy, and part 140 - Completeness.79 ISO defines quality stated 80 The data quality standard is related separated from a software application. Data that can only be used or read using a specific licensed software application is subject to the terms of the software license. An organization may not be able to use data it created unless that data can be detached from the software that was used to create it. To meet stated requirements requires that these requirements be defined in a clear, unambiguous manner. ISO 8000 is supported through ISO 22745, a standard for defining and exchanging Master Data. ISO 22745 defines how data requirement statements should be constructed, provides examples in XML, and defines a format for http://bit.ly/2ttdiZJ. http://bit.ly/2sANGdi. http://bit.ly/2rV1oWC. Order 11611 by LEXIE MAY on August 25, 2017 462 D MBO K 2 the exchange of encoded data.81 ISO 22745 creates portable data by labeling the data using an ISO 22745 compliant Open Technical Dictionary such as the ECCMA Open Technical Dictionary (eOTD). The intention of ISO 8000 is to help organizations define what is and is not quality data, enable them to ask for quality data using standard conventions, and verify that they have received quality data using those same standards. When standards are followed, requirements can be confirmed through a computer program. ISO 8000 - Part 61 Information and data quality management process reference model is under development. 82 This standard will describe the structure and organization of data quality management, including: Data Quality Planning Data Quality Control Data Quality Assurance Data Quality Improvement 1.3.6 Data Quality Improvement Lifecycle Most approaches to improving data quality are based on the techniques of quality improvement in the manufacture of physical products.83 In this paradigm, data is understood as the product of a set of processes. At its simplest, a process is defined as a series of steps that turns inputs into outputs. A process that creates data may consist of one-step (data collection) or many steps: data collection, integration into a data warehouse, aggregation in a data mart, etc. At any step, data can be negatively affected. It can be collected incorrectly, dropped or duplicated between systems, aligned or aggregated incorrectly, etc. Improving data quality requires the ability to assess the relationship between inputs and outputs, in order to ensure that inputs meet the requirements of the process and that outputs conform to expectations. Since outputs from one process become inputs to other processes, requirements must be defined along the whole data chain. A general approach to data quality improvement, shown in Figure 93, is a version of the Shewhart / Deming cycle.84 Based on the scientific method, the Shewhart / Deming cycle is a problem-solving model known as -do-checkgh a defined set of steps. The condition of data must be measured against standards and, if it does not meet standards, root cause(s) of the discrepancy from standards must be identified and remediated. Root causes may be found in any of the steps of the process, technical or non-technical. Once remediated, data should be monitored to ensure that it continues to meet requirements. http://bit.ly/2rUZyoz. http://bit.ly/2sVik3Q. 83 See Wang (1998), English (1999), Redman (2001), Loshin (2001), and McGilvray (2008). See Pierce (2004) for an overview of literature related to the concept of data as a product. 84 See American Society for Quality: http://bit.ly/1lelyBK Plan-Do-Check-Act was originated by Walter Shewhart and popularized by variation on this cycle. Order 11611 by LEXIE MAY on August 25, 2017 D A T A Q UA LIT Y 463 Figure 93 The Shewhart Chart For a given data set, a Data Quality Management cycle begins by identifying the data that does not meet data that are obstacles to the achievement of business objectives. Data needs to be assessed against key dimensions of quality and known business requirements. Root causes of issues will need to be identified so that stakeholders can understand the costs of remediation and the risks of not remediating the issues. This work is often done in conjunction with Data Stewards and other stakeholders. In the Plan stage, the Data Quality team assesses the scope, impact, and priority of known issues, and evaluates alternatives to address them. This plan should be based on a solid foundation of analysis of the root causes of issues. From knowledge of the causes and the impact of the issues, cost / benefit can be understood, priority can be determined, and a basic plan can be formulated to address them. In the Do stage, the DQ team leads efforts to address the root causes of issues and plan for ongoing monitoring of data. For root causes that are based on non-technical processes, the DQ team can work with process owners to implement changes. For root causes that require technical changes, the DQ team should work with technical teams to ensure that requirements are implemented correctly and that technical changes do not introduce errors. The Check stage involves actively monitoring the quality of data as measured against requirements. As long as data meets defined thresholds for quality, additional actions are not required. The processes will be considered under control and meeting business requirements. However, if the data falls below acceptable quality thresholds, then additional action must be taken to bring it up to acceptable levels. The Act stage is for activities to address and resolve emerging data quality issues. The cycle restarts, as the causes of issues are assessed and solutions proposed. Continuous improvement is achieved by starting a new cycle. New cycles begin as: Existing measurements fall below thresholds New data sets come under investigation New data quality requirements emerge for existing data sets Business rules, standards, or expectations change Order 11611 by LEXIE MAY on August 25, 2017 464 D MBO K 2 The cost of getting data right the first time is cheaper than the costs from getting data wrong and fixing it later. Building quality into the data management processes from the beginning costs less than retrofitting it. Maintaining high quality data throughout the data lifecycle is less risky than trying to improve quality in an existing process. It also creates a far lower impact on the organization. Establishing criteria for data quality at the beginning of a process or system build is one sign of a mature Data Management Organization. Doing so takes governance and discipline, as well as cross-functional collaboration. 1.3.7 Data Quality Business Rule Types Business rules describe how business should operate internally, in order to be successful and compliant with the outside world. Data Quality Business Rules describe how data should exist in order to be useful and usable within an organization. These rules can be aligned with dimensions of quality and used to describe data quality requirements. For example, a business rule that all state code fields must comply with the US State Abbreviations can be enforced by data entry pick lists and data integration lookups. The level of valid or invalid records can then be measured. Business rules are commonly implemented in software, or by using document templates for data entry. Some common simple business rule types are: Definitional conformance: Confirm that the same understanding of data definitions is implemented and used properly in processes across the organization. Confirmation includes algorithmic agreement on calculated fields, including any time, or local constraints, and rollup and status interdependence rules. Value presence and record completeness: Rules defining the conditions under which missing values are acceptable or unacceptable. Format compliance: One or more patterns specify values assigned to a data element, such as standards for formatting telephone numbers. Value domain membership enumerated in a defined data value domain, such as 2-Character United States Postal Codes for a STATE field. Range conformance: A data element assigned value must be within a defined numeric, lexicographic, or time range, such as greater than 0 and less than 100 for a numeric range. Mapping conformance: Indicating that the value assigned to a data element must correspond to one selected from a value domain that maps to other equivalent corresponding value domain(s). The STATE data domain again provides a good example, since State values may be represented using different value domains (USPS Postal codes, FIPS 2-digit codes, full names), and these types of rules Order 11611 by LEXIE MAY on August 25, 2017 D A T A Q UA LIT Y 465 Consistency rules: Conditional assertions that refer to maintaining a relationship between two (or more) attributes based on the actual values of those attributes. For example, address validation where postal codes correspond to particular States or Provinces. Accuracy verification: Compare a data value against a corresponding value in a system of record or other verified source (e.g., marketing data purchased from a vendor) to verify that the values match. Uniqueness verification: Rules that specify which entities must have a unique representation and whether one and only one record exists for each represented real world object. Timeliness validation: Rules that indicate the characteristics associated with expectations for accessibility and availability of data. Other types of rules may involve aggregating functions applied to sets of data instances (see Section 4.5). Examples of aggregation checks include: Validate reasonableness of the number of records in a file. This requires keeping statistics over time to generate trends. Validate reasonableness of an average amount calculated from a set of transactions. This requires establishing thresholds for comparison, and may be based on statistics over time. Validate the expected variance in the count of transactions over a specified timeframe. This requires keeping statistics over time and using them to establish thresholds. 1.3.8 Common Causes of Data Quality Issues Data quality issues can emerge at any point in the data lifecycle, from creation to disposal. When investigating root causes, analysts should look for potential culprits, like problems with data entry, data processing, system design, and manual intervention in automated processes. Many issues will have multiple causes and contributing factors (especially if people have created ways to work around them). These causes of issues also imply ways to prevent issues: through improvement to interface design, testing of data quality rules as part of processing, a focus on data quality within system design, and strict controls on manual intervention in automated processes. 1.3.8.1 Issues Caused by Lack of Leadership Many people assume that most data quality issues are caused by data entry errors. A more sophisticated understanding recognizes that gaps in or poor execution of business and technical processes cause many more problems than mis-keying. However, common sense says and research indicates that many data quality problems are caused by a lack of organizational commitment to high quality data, which itself stems from a lack of leadership, in the form of both governance and management. Order 11611 by LEXIE MAY on August 25, 2017 466 D MBO K 2 Every organization has information and data assets that are of value to its operations. Indeed, the operations of every organization depend on the ability to share information. Despite this, few organizations manage these assets with rigor. Within most organizations, data disparity (differences in data structure, format, and use of values) is a larger problem than just simple errors; it can be a major obstacle to the integration of data. One of the reasons data stewardship programs focus on defining terms and consolidating the language around data is because that is the starting point for getting to more consistent data. Many governance and information asset programs are driven solely by compliance, rather than by the potential value to be derived from data as an asset. A lack of recognition on the part of leadership means a lack of commitment within an organization to managing data as an asset, including managing its quality (Evans and Price, 2012). (See Figure 94.) Barriers to effective management of data quality include:85 Lack of awareness on the part of leadership and staff Lack of business governance Lack of leadership and management Difficulty in justification of improvements Inappropriate or ineffective instruments to measure value These barriers have negative effects on customer experience, productivity, morale, organizational effectiveness, revenue, and competitive advantage. They increase costs of running the organization and introduce risks as well. (See Chapter 11.) 1.3.8.2 Issues Caused by Data Entry Processes Data entry interface issues: Poorly designed data entry interfaces can contribute to data quality issues. If a data entry interface does not have edits or controls to prevent incorrect data from being put in the system data processors are likely to take shortcuts, such as skipping non-mandatory fields and failing to update defaulted fields. List entry placement: Even simple features of data entry interfaces, such as the order of values within a drop-down list, can contribute to data entry errors. Field overloading: Some organizations re-use fields over time for different business purposes rather than making changes to the data model and user interface. This practice results in inconsistent and confusing population of the fields. Training issues: Lack of process knowledge can lead to incorrect data entry, even if controls and edits are in place. If data processors are not aware of the impact of incorrect data or if they are incented for speed, rather than accuracy, they are likely to make choices based on drivers other than the quality of the data. Adapted from The Data Manifesto. https://dataleaders.org/. Order 11611 by LEXIE MAY on August 25, 2017 D A T A Q UA LIT Y 467 Difficulty in Justification Experience Information Lack of Information management Lack the ability to do the work Fail to invest in quality, adding cost and complicating efforts to use data Inappropriate culture (e.g. intuition valued over the valued as asset) information not Inappropriate structure (e.g. silos impede sharing) Lack of Management System Inappropriate or Instruments © 2017 dataleaders.org Used with permission Barriers that slow/hinder/prevent companies from managing their information as a business asset Most commonly observed root causes Danette McGilvray / James Price / Tom Redman October 2016 Work based on research by Dr. Nina Evans and James Price,see to the Effective Deployment of Information at www.dataleaders.org Figure 94 Barriers to Managing Information as a Business Asset86 Changes to business processes: Business processes change over time, and with these changes new business rules and data quality requirements are introduced. However, business rule changes are not always incorporated into systems in a timely manner or comprehensively. Data errors will result if an interface is not upgraded to accommodate new or changed requirements. In addition, data is likely to be impacted unless changes to business rules are propagated throughout the entire system. Diagram developed by Danette McGilvray, James Price, and Tom Redman. Used by permission. https://dataleaders.org/. Order 11611 by LEXIE MAY on August 25, 2017 468 D MBO K 2 Inconsistent business process execution: Data created through processes that are executed inconsistently is likely to be inconsistent. Inconsistent execution may be due to training or documentation issues as well as to changing requirements. 1.3.8.3 Issues Caused by Data Processing Functions Incorrect assumptions about data sources: Production issues can occur due to errors or changes, inadequate or obsolete system documentation, or inadequate knowledge transfer (for example, when SMEs leave without documenting their knowledge). System consolidation activities, such as those associated with mergers and acquisitions, are often based on limited knowledge about the relationship between systems. When multiple source systems and data feeds need to be integrated there is always a risk that details will be missed, especially with varying levels of source knowledge available and tight timelines. Stale business rules: Over time, business rules change. They should be periodically reviewed and updated. If there is automated measurement of rules, the technical process for measuring rules should also be updated. If it is not updated, issues may not be identified or false positives will be produced (or both). Changed data structures: Source systems may change structures without informing downstream consumers (both human and system) or without providing sufficient time to account for the changes. This can result in invalid values or other conditions that prevent data movement and loading, or in more subtle changes that may not be detected immediately. 1.3.8.4 Issues Caused by System Design Failure to enforce referential integrity: Referential integrity is necessary to ensure high quality data at an application or system level. If referential integrity is not enforced or if validation is switched off (for example, to improve response times), various data quality issues can arise: o o Duplicate data that breaks uniqueness rules Orphan rows, which can be included in some reports and excluded from others, leading to multiple values for the same calculation o o Inability to upgrade due to restored or changed referential integrity requirements Inaccurate data due to missing data being assigned default values Failure to enforce uniqueness constraints: Multiple copies of data instances within a table or file expected to contain unique instances. If there are insufficient checks for uniqueness of instances, or if the unique constraints are turned off in the database to improve performance, data aggregation results can be overstated. Order 11611 by LEXIE MAY on August 25, 2017 D A T A Q UA LIT Y 469 Coding inaccuracies and gaps: If the data mapping or layout is incorrect, or the rules for processing the data are not accurate, the data processed will have data quality issues, ranging from incorrect calculations to data being assigned to or linked to improper fields, keys, or relationships. Data model inaccuracies: If assumptions within the data model are not supported by the actual data, there will be data quality issues ranging from data loss due to field lengths being exceeded by the actual data, to data being assigned to improper IDs or keys. Field overloading: Re-use of fields over time for different purposes, rather than changing the data model or code can result in confusing sets of values, unclear meaning, and potentially, structural problems, like incorrectly assigned keys. Temporal data mismatches: In the absence of a consolidated data dictionary, multiple systems could implement disparate date formats or timings, which in turn lead to data mismatch and data loss when data synchronization takes place between different source systems. Weak Master Data Management: Immature Master Data Management can lead to choosing unreliable sources for data, which can cause data quality issues that are very difficult to find until the assumption that the data source is accurate is disproved. Data duplication: Unnecessary data duplication is often a result of poor data management. There are two main types of undesirable duplication issues: o Single Source Multiple Local Instances: For example, instances of the same customer in multiple (similar or identical) tables in the same database. Knowing which instance is the most accurate for use can be difficult without system-specific knowledge. o Multiple Sources Single Instance: Data instances with multiple authoritative sources or systems of record. For example, single customer instances coming from multiple point-of-sale systems. When processing this data for use, there can be duplicate temporary storage areas. Merge rules determine which source has priority over others when processing into permanent production data areas. 1.3.8.5 Issues Caused by Fixing Issues Manual data patches are changes made directly on the data in the database, not through the business rules in the application interfaces or processing. These are scripts or manual commands generally created in a hurry and external source for business disruption. Like any untested code, they have a high risk of causing further errors through unintended consequences, by changing more data than required, or not propagating the patch to all historical data affected by the original issue. Most such patches also change the data in place, rather than preserving the prior state and adding corrected rows. Order 11611 by LEXIE MAY on August 25, 2017 470 D MBO K 2 These changes are generally NOT undo-able without a complete restore from backup as there is only the database log to show the changes. Therefore, these shortcuts are strongly discouraged they are opportunities for security breaches and business disruption longer than a proper correction would cause. All changes should go through a governed change management process. 1.3.9 Data Profiling Data Profiling is a form of data analysis used to inspect data and assess quality. Data profiling uses statistical techniques to discover the true structure, content, and quality of a collection of data (Olson, 2003). A profiling engine produces statistics that analysts can use to identify patterns in data content and structure. For example: Counts of nulls: Identifies nulls exist and allows for inspection of whether they are allowable or not Max/Min value: Identifies outliers, like negatives Max/Min length: Identifies outliers or invalids for fields with specific length requirements Frequency distribution of values for individual columns: Enables assessment of reasonability (e.g., distribution of country codes for transactions, inspection of frequently or infrequently occurring values, as well as the percentage of the records populated with defaulted values) Data type and format: Identifies level of non-conformance to format requirements, as well as identification of unexpected formats (e.g., number of decimals, embedded spaces, sample values) Profiling also includes cross-column analysis, which can identify overlapping or duplicate columns and expose embedded value dependencies. Inter-table analysis explores overlapping values sets and helps identify foreign key relationships. Most data profiling tools allow for drilling down into the analyzed data for further investigation. Results from the profiling engine must be assessed by an analyst to determine whether data conforms to rules and other requirements. A good analyst can use profiling results to confirm known relationships and uncover hidden characteristics and patterns within and between data sets, including business rules, and validity constraints. Profiling is usually used as part of data discovery for projects (especially data integration projects; see Chapter 8) or to assess the current state of data that is targeted for improvement. Results of data profiling can be used to identify opportunities to improve the quality of both data and Metadata (Olson, 2003; Maydanchik, 2007). While profiling is an effective way to understand data, it is just a first step to data quality improvement. It enables organizations to identify potential problems. Solving problems requires other forms of analysis, including business process analysis, analysis of data lineage, and deeper data analysis that can help isolate root causes of problems. 1.3.10 Data Quality and Data Processing While the focus of data quality improvement efforts is often on the prevention of errors, data quality can also be improved through some forms of data processing. (See Chapter 8.) Order 11611 by LEXIE MAY on August 25, 2017 D A T A Q UA LIT Y 471 1.3.10.1 Data Cleansing Data Cleansing or Scrubbing transforms data to make it conform to data standards and domain rules. Cleansing includes detecting and correcting data errors to bring the quality of data to an acceptable level. It costs money and introduces risk to continuously remediate data through cleansing. Ideally, the need for data cleansing should decrease over time, as root causes of data issues are resolved. The need for data cleansing can be addressed by: Implementing controls to prevent data entry errors Correcting the data in the source system Improving the business processes that create the data In some situations, correcting on an ongoing basis may be necessary, as re-processing the data in a midstream system is cheaper than any other alternative. 1.3.10.2 Data Enhancement Data enhancement or enrichment is the process of adding attributes to a data set to increase its quality and usability. Some enhancements are gained by integrating data sets internal to an organization. External data can also be purchased to enhance organizational data (see Chapter 10). Examples of data enhancement include: Time/Date stamps: One way to improve data is to document the time and date that data items are created, modified, or retired, which can help to track historical data events. If issues are detected with the data, timestamps can be very valuable in root cause analysis, because they enable analysts to isolate the timeframe of the issue. Audit data: Auditing can document data lineage, which is important for historical tracking as well as validation. Reference vocabularies: Business specific terminology, ontologies, and glossaries enhance understanding and control while bringing customized business context. Contextual information: Adding context such as location, environment, or access methods and tagging data for review and analysis. Geographic information: Geographic information can be enhanced through address standardization and geocoding, which includes regional coding, municipality, neighborhood mapping, latitude / longitude pairs, or other kinds of location-based data. Demographic information: Customer data can be enhanced through demographic information, such as age, marital status, gender, income, or ethnic coding. Business entity data can be associated with annual revenue, number of employees, size of occupied space, etc. Order 11611 by LEXIE MAY on August 25, 2017 472 D MBO K 2 Psychographic information: Data used to segment the target populations by specific behaviors, habits, or preferences, such as product and brand preferences, organization memberships, leisure activities, commuting transportation style, shopping time preferences, etc. Valuation information: Use this kind of enhancement for asset valuation, inventory, and sale. 1.3.10.3 Data Parsing and Formatting Data Parsing is the process of analyzing data using pre-determined rules to define its content or value. Data parsing enables the data analyst to define sets of patterns that feed into a rule engine used to distinguish between valid and invalid data values. Matching specific pattern(s) triggers actions. Data parsing assigns characteristics to the data values appearing in a data instance, and those characteristics help in determining potential sources for added benefits. For example, if a data value is identified as the name of a business rather than the name of a person. Use the same approach for any situation in which data values organize into semantic hierarchies such as sub-parts, parts, and assemblies. Many data quality issues involve situations where variation in data values representing similar concepts introduces ambiguity. Extract and rearrange the separate components (commonly referred to extracted and rearranged into a standard representation to create a valid pattern. When an invalid pattern is recognized, the application may attempt to transform the invalid value into one that meets the rules. Perform standardization by mapping data from some source pattern into a corresponding target representation. For example, consider the different ways telephone numbers expected to conform to a numbering plan are formatted. While some have digits, some have alphabetic characters, and all use different special characters for separation. People can recognize each one as a telephone number. However, to determine if these numbers are accurate (perhaps by comparing them to a master customer directory), or to investigate whether duplicate numbers exist when there should be only one for each supplier, the values must be parsed into their component segments (area code, exchange, and line number) and then transformed into a standard format. Another good example is a customer name, since names may be represented in thousands of different forms. A good standardization tool will be able to parse the different components of a customer name, such as given name, middle name, family name, initials, titles, generational designations, and then rearrange those components into a canonical representation that other data services will be able to manipulate. The human ability to recognize familiar patterns contributes to an ability to characterize variant data values belonging to the same abstract class of values; people recognize different types of telephone numbers because they conform to frequently used patterns. An analyst describes the format patterns that all represent a data object, such as Person Name, Product Description, and so on. A data quality tool parses data values that conform to any of those patterns, and even transforms them into a single, standardized form that will simplify the assessment, similarity analysis, and r

Chapter 13: Data Quality PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue