Data Integration Lecture Notes PDF
Document Details
Menoufia National University
Amira Abdelatey
Tags
Summary
This document presents lecture notes on data integration, covering topics such as data preprocessing, data manipulation techniques (joins), data warehousing, and ETL processes. The notes also discuss issues related to data heterogeneity and cleaning.
Full Transcript
Data Integration Chapter 3 Assoc. Prof. Amira Abdelatey Chapter 3: Data Preprocessing Data Preprocessing: An Overview – Data Quality – Major Tasks in Data Preprocessing Data Cleaning Data Integration – Manipulating data – HETEROGENEITY PROBLEMS – ETL...
Data Integration Chapter 3 Assoc. Prof. Amira Abdelatey Chapter 3: Data Preprocessing Data Preprocessing: An Overview – Data Quality – Major Tasks in Data Preprocessing Data Cleaning Data Integration – Manipulating data – HETEROGENEITY PROBLEMS – ETL Data Reduction 2 Data Transformation and Data Discretization 2 Data Integration Data integration: combining data from multiple sources into a unified view – To improve data quality – To enrich data with additional information – To allow reliable data analytics Integrating in-house data within data warehouse together is relatively straightforward (with common attributes and structures across schemas) Manipulating Data - Joining Joining tables: – Extract and simultaneously process data from more than one table. Manipulating Data – Inner Join By default, the joining query performs an inner join, which includes matching rows only in the results. Manipulating Data – Full Outer Join A full outer join includes all rows from both tables. Manipulating Data – Left Join A left join includes all rows from the left table. Difficulties of Integrating Data The main problem is the heterogeneity problem among the data sources Source Type Heterogeneity – Systems storing the data can be different HETEROGENEITY PROBLEMS 1. Schema Heterogeneity – The structure of the tables storing the data can be different (even if storing the same data) HETEROGENEITY PROBLEMS 2. Data Type Heterogeneity – Storing the same data (and values) but with different data types – E.g., Storing the phone number as String or as Number – E.g., Storing the name as fixed length or variable length 3. Value Heterogeneity – Same logical values stored in different ways – E.g., ‘Prof’, ‘Prof.’, ‘Professor’ – E.g., ‘Right’, ‘R’, ‘1’ ……… ‘Left’, ‘L’, ‘-1’ 4. Entity identification – E.g., Bill Clinton = William Clinton Data Warehouse A data warehouse is a system used for reporting and data analysis. Integrating data from one or more different sources to create a central repository of data, a data warehouse ETL Targ et ETL Process Pre ETL Tasks Data Profiling: – Know your data – Find out how many rows have missing or invalid values, or what is the distribution of values in a specific column. This knowledge will help to specify rules (data standards / quality checks) in order to cleanse the data and to keep really bad data out of the repository. Doing data profiling before designing the ETL process helps you to better design a system that is correct, robust and has a clear structure. ETL Data moved from source to target data bases Focus: Preparing the data for reporting / analysis ETL = Extract -> Transform -> Load – Extract: Get the data from source(s) as efficiently as possible – Transform: Perform calculation / map data / clean data – Dirty Data Absence of Data / Missing Data Cryptic Data Contradicting Data Non-Unique Identifiers Data Integration Problems Data Cleaning in Integration Data Cleaning : Parsing / Combining Parsing locates and identifies individual data elements in the source files and then isolates these data elements in the target files. – Examples include full name stored in one column Data Cleaning : Parsing / Combining Combining locates and identifies individual data elements in the source files and then combines these data elements in the target files – Examples include event date stored in different columns as date, month and year Data Cleaning : Correcting Correct parsed individual data components using sophisticated data algorithms and secondary data sources. Correct data according to data rules Example includes converting the combined date into a standard date format Data Cleaning : Standardizing Standardizing applies conversion routines to transform data into its preferred (and consistent) format using both standard and custom data rules. Data Cleaning : Matching Searching and matching records within and across the parsed, combined, corrected and standardized data based on predefined data rules to eliminate duplications, sequences Data Staging Data staging is the process of preparing and organizing data before it moves to its final destination, including cleanliness and Transformation. Used as an interim step between data EXTRACT and LOAD Data in the staging file is transformed and loaded to the warehouse There is no end user access to the staging file Data Transformation Transforms data in accordance with the data rules and standards that have been established Example include: format changes, splitting up fields, replacement of codes, derived values, and aggregates ETL Tools Few popular commercial and freeware(open-sources) ETL Tools Freeware, open Commercial ETL source ETL tools: Tools: – Pentaho Data – IBM Infosphere Integration (PDI) - DataStage Kettle – Informatica – Talend Integrator PowerCenter – Oracle Warehouse Suite Builder (OWB) – CloverETL – Oracle Data – Jasper ETL Integrator (ODI) – SAS ETL Studio – Business Objects Data Integrator(BODI)