SQL Final Exam PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document provides an overview of data integration techniques which are important for SQL. The techniques explained cover data consolidation, federation, and propagation. It also includes an explanation of ETL (extract, transform, load) processes. Data warehousing and data marts are also discussed.
Full Transcript
Data Integration Integration is the process of consolidating data from different data sources to support business needs into a view 3 techniques that form the building blocks of any integration approach Data Consolidation Data Federation Data Propagation Data Consolidation Usually...
Data Integration Integration is the process of consolidating data from different data sources to support business needs into a view 3 techniques that form the building blocks of any integration approach Data Consolidation Data Federation Data Propagation Data Consolidation Usually done for Data Warehousing using Extract Transform Load techniques Data warehousing is the process of creating and using a data warehouse. A data warehouse combines data from different sources to help with reports, queries, and making decisions. Warehousing is to consolidate data in one database that is optimized for reads Read only databases Not flexible (has a predefined schema) Requires constant batch processes to refresh the warehouse Data Federation Data Federation is a method of combining data from different sources into a virtual database. It is an approach to data integration that provides a single source of data for front end applications Consolidation on demand (from constituent database(s)) Consolidation on demand refers to the process of combining data from various databases into a single, unified database when needed Provides virtual views to consolidated data Flexible (no schema defined in advance) Only copies or transforms data when needed Can affect the performance of the transactional OLTP databases Enterprise Information Integration (EII) is a common term used to apply to data federation Data Propagation Data propagation is the process of copying data within a case hierarchy. This helps share information among cases, saving time and providing relevant data to caseworkers. It's not just limited to subcases but can also occur when creating new, related cases. Essentially, data propagation happens when a new case is created Data is replicated across the databases Can be synchronous (data is consistent) or asynchronous Near real time or batch professing Possible near real time data availability Too many duplicates Enterprise Application Integration (EAI) and Enterprise Data Replication techniques are used for data propagation ETL Extract Transform Load is a process that used to integrate data or move data from one source to another Extract Process of getting data from the data sources usually the data will be extracted to a temporary repository called STAGING DATABASE or TEMPORARY REPOSITORY IT IS A COMMON PRACTICE TO FIRST LOAD THE DATA INTO THESE temporary repository Static extract is when the whole data is copied Incremental extract is when the deltas (mean new or modified records) the changes are copied Cleanse Extracted Data Integrating difference sources of data exposes quality issues in the original data sets Extract process should have processes that expose and reject erroneous data Extract process must not fix the errors. Instead it will send an error report to the owners of the data source to fix the data at the source Transform Record level transformation is a process of changing the structure of the source tables through: Selection: using WHERE clause to restrict rows in the select statement Joining: projecting columns from more than one tables Normalization Aggregation: converting data from more detailed to summary statistics on the data Transform - Field Level Transformation Is concerned with changing the data in the fields (columns) that include Data type changes Use functions and algorithms to change data from source to destination Load Insert the transformed data to its final destination (usually a data warehouse or a data mart) Build indexes if required Data Warehouses and Data marts Databases that are used to analyze and summarize data and report on the business entities Characteristics Subject Oriented - focused on business entities Integrated - comes from multiple systems and need to fit one structure Time-variant - show change over time Read only - cannot be updated Need for data warehouse We often need to integrate data from multiple systems in order to be able to analyze a complex business problem Separating reporting databases from operational databases (OLTP) enhances the performance DATA WAREHOUSE VS DATA MART Data mart is a subset of Data warehouse Data mart use dimensional models Usually centered around one subject (eg sales, purchasing , inventory) Have fewer data sources Fact table has quantitative values Dimension tables has descriptive values Grain of the fact table The minimum unit of reporting should be an intersection of all the dimensions