ETL Process PDF
Document Details
Uploaded by DelectableLearning8178
null
Tags
Summary
This document provides an overview of the ETL process, covering aspects such as data extraction, transformation, and loading. It also delves into various data cleaning techniques, and different types of data operations. The content discusses the ETL process, in relation to business intelligence and database administration in the context of data warehousing.
Full Transcript
Business Intelligence and Database 1 Administration A data warehouse is typically built with a three-layer architecture: 1. Warehouse server (data server); 2. OLAP server (such as HOLAP/MOLAP or ROLAP);...
Business Intelligence and Database 1 Administration A data warehouse is typically built with a three-layer architecture: 1. Warehouse server (data server); 2. OLAP server (such as HOLAP/MOLAP or ROLAP); Administration Business Intelligence and Database 3. Customer server: tool for executing requests; tool for data analysis. 2 Incomplete data “Unavailable” at time of collection Time difference between acquisition and analysis Human problems, software, hardware Administration Business Intelligence and Database Noisy or incorrect data Error in collection instruments Human error or entry system Transmission error Buffer overflow Inconsistent data Different data sources 3 Transgression of a functional dependency rule Duplicate data (synonyms, polyses,...) Poor data quality -> poor results Quality of decisions is based on data quality Administration Business Intelligence and Database ex. generate duplicate or missing data, inaccurate or misleading statistics Extraction, cleaning and treatment represent most of the data warehouse construction work Data warehouse must be able to integrate constant quality data 4 Extract Administration Business Intelligence and Database Transform Filter ETL is a back Sort room process Homogenize that isn’t Clean.. visible to end- user. 5 Load Extraction is to collect data from multiple targeted sources as SQL or NoSQL databases, cloud platforms or XML files Extraction is the most complicated task in the ETL Administration Business Intelligence and Database process, because many sources are in a way that lacks the quality or quantity required (unsatisfactorily), and determining the eligibility for extraction is not an easy process. Extraction needs a lot of work during the research phase, because before doing anything you should understand 6 your data correctly, and it’s a continuous process The data has to be extracted normally not only once, but several times in a periodic manner to supply all changed data to the warehouse and keep it up-to-date. Administration Business Intelligence and Database This extracted data is often stored temporarily in a staging area in a database to confirm data integrity and to apply any necessary business rules There are two types of extraction: logical extraction and physical extraction, each of them has other types inside it 7 There are two kinds of logical extraction: 1. Full extraction: data from the source is extracted completely, So extracted data reflects all the data currently available on the source system. Full extraction also used Administration Business Intelligence and Database when the system can’t identify which data is updated, in this situation, We get a full copy of the latest extraction, then start identifying changes 2. Incremental extraction: keeps track of updated data in source systems since the last successful extraction To extract and load only new or changed parts not the whole data like Full extraction the extraction tool should recognize new 8 or updated data using time of adding or updating. Business Intelligence and Database 9 Administration There are two kinds of physical extraction: 1. Online extraction: Here, data is extracted directly from source systems, the Extraction process connects directly to source systems and there’s no need for any external file, So Administration Business Intelligence and Database that we called it online. 2. Offline extraction: Data isn’t extracted directly from source systems, first, it’s copied to an external file, then our extraction process connects to that external file and starts processing. So when you want to start transforming your data, you can fetch records from the external file instead of 10 access the source directly. Business Intelligence and Database 11 Administration The second stage of the ETL process is transformation, which is when the data is transformed to meet the schema and requirements of the destination. Data transformation refers to converting the structure or format of a Administration Business Intelligence and Database data set to match that of the target system. This involves data mapping, linking data from two or more sources, data conversion, and data cleansing. 12 Generally, it takes more than one conversion to convert data into a consistent format, such as join, lookup, filter, expressions, and aggregation. Administration Business Intelligence and Database The transformation work in ETL takes place in a specialized engine, and it often involves using staging tables (staging Area) to temporarily hold data as it is being transformed and ultimately loaded to its destination. Example: unify date format (MM/DD/YY) to (DD/MM/YY) 13 Very important step that guarantees the consistency and reliability of the data Basic Transformations Remove duplicated data. Mapping Null values to 0 or mean or median of the column. Format conversion, like: convert ID from integer to string. Establish key relationships between tables. Administration Business Intelligence and Database Advanced Transformations Splitting columns into multiple columns. Joining data from multiple sources together. Filter rows/columns. Derive new columns from existing columns. Aggregate data from multiple sources. Data validation. 14 It involves placing the data into the target system, typically a cloud data warehouse, where it is ready to be analyzed by BI tools or data analytics tools. Administration Business Intelligence and Database It’s the last step of the ETL process. In this process we need to load a huge amount of data in short periods, So we have to optimize the performance of this process. If the load process failed, We have to configure a recovery mechanism that should restart from the 15 point of failure without any integrity problems. There are three types of load: 1. Initial load: All data from source system will be loaded one- time at the target system We take the whole data set from the staging area and load it into its data warehouse. Administration Business Intelligence and Database 2. Incremental load: we apply all the ongoing changes from the source systems into the data warehouse periodically Only updated records and new ones are loaded. ETL system should check for new and updated rows. 3. Full refresh: Easier than incremental load but more time- 16 consuming. Delete all data available in the target system, then load full data again from the source system Business Intelligence and Database 17 Administration Business Intelligence and Database 18 Administration Business Intelligence and Database 19 Administration A semantic data model is a conceptual model that describes the meaning of the data elements it contains. Organizations often have their own terms for things, sometimes with synonyms, or even different meanings for the same term. Administration Business Intelligence and Database Example: an inventory database might track a piece of equipment with an asset ID and a serial number, but a sales database might refer to the serial number as the asset ID. There is no simple way to relate these values without a model that describes the relationship. 20 Semantic modeling provides a level of abstraction over the database schema, so that users don't need to know the underlying data structures. Makes it easier for end users to query data without Administration Business Intelligence and Database performing aggregates and joins over the underlying schema. Usually columns are renamed to more user-friendly names, so that the context and meaning of the data are more obvious. 21 Semantic models include: Hiding tables, columns and relationships that are not relevant to the business like surrogate keys and management columns. Administration Business Intelligence and Database Renaming tables and columns if necessary (although your data warehouse, if modeled properly, should already be using the correct naming standard). Adding hierarchies to Dimensions. Adding calculated measures to Facts.. 22 There are two primary types of semantic models: 1. Tabular.: Uses relational modeling constructs (model, tables, columns). Code and script use OLAP metadata. Administration Business Intelligence and Database 2. Multidimensional: Uses traditional OLAP modeling constructs (cubes, dimensions, measures). 23 Aggregation behaviors are set so that reporting tools display them properly. Business logic and calculations are defined. Administration Business Intelligence and Database Time-oriented calculations are included. Data is often integrated from multiple sources For these reasons, the semantic layer is placed over a data warehouse 24 Semantic models can help business users abstract relationship complexities and make it easier to analyze data quickly 25 An organization has data stored in a large database. It wants to make this data available to business users and customers to create their own reports and do some analysis. 1. One option is just to give those users direct access to the database. However, there are several drawbacks to doing this: Managing security and controlling access Administration Business Intelligence and Database design of the database hard for a user to understand Users would need to know which tables to query how those tables should be know a query language like SQL.. 2. Another option is to encapsulate all of the information that users need into a semantic model: The semantic model can be more easily queried by users with a reporting tool of their choice The data provided by the semantic model is pulled from a data warehouse, ensuring that all users see a single version of the truth 26 The semantic model also provides friendly table and column names, relationships between tables, descriptions, calculations, and row-level security. OLTP systems is constantly updated through transactions flowing in from various sources, OLAP data stores are typically refreshed at a much slower intervals, depending on business needs This means OLAP systems are better suited for strategic business decisions, rather than immediate responses to changes. Administration Business Intelligence and Database Unlike traditional, normalized, relational tables found in OLTP systems, OLAP data models tend to be multidimensional. This makes it difficult or impossible to directly map to entity-relationship or object-oriented models, where each attribute is mapped to one column. Instead, OLAP systems typically use a star or snowflake schema in place of traditional normalization. 27 Often, the three ETL phases are run in parallel to save time. Example: while data is being extracted, a transformation process could be working on data already received and Administration Business Intelligence and Database prepare it for loading, and a loading process can begin working on the prepared data, rather than waiting for the entire extraction process to complete. 28 ELT is an abbreviation for Extract, Load, and Transform. It is a process that extracts data from a source system to a target system and then transforms the data. Unlike ETL, where data transformation takes place in a staging Administration Business Intelligence and Database area before being loaded into the target system, ELT extracts the raw data directly to the target system and transform it there We can work with various types of data by getting it directly from source to destination without caring of the schema that will handle this data, and after loading it we can start thinking in transformations that needed to handle this data and 29 integrate it together data lakes adopt ELT. Business Intelligence and Database 30 Administration Pros: data lakes adopt ELT and cloud data warehouses and it’s one of the factors that lead to implementing ELT. There is better performance and data integrity as it works with high-end data devices such as Hadoop cluster and Administration Business Intelligence and Database cloud. ELT needs less time and resources as the data is transformed and loaded in parallel. The volume of data can be very large. Cons: There are limited tools and expertise available that fully support ELT operations. 31 ETL ELT A highly supported A different approach from process and used widely traditional ETL so not all Availability of tools and for more than 15 years, tools or solutions support experts there’s a lot of readily this approach natively expertise, developers, and a large community. Administration Business Intelligence and Database Data passed through the Data is loaded directly multi-staged process ( from source to target loading to the staging area system, then and make Transformations take transformations) before place with the needed Order of the Processes being loaded to the data data for analytics or any warehouse the loading other purpose loading time from source to time is better than ETL destination is longer than ELT. 32 ETL ELT works with the data A different approach from warehouse (commonly), traditional ETL so not all Data Warehousing and it also supports cloud tools or solutions support Support data warehouses, but it this approach natively works only with Administration Business Intelligence and Database structured-based schema and OLAP Working with a small to Working with a huge Data Size medium amount of data amount of data, and any that required complex type of data structured, transformations. unstructured and semi- structured. 33 Administration Business Intelligence and Database 34 Source: https://www.businessintelligencesoftware.co/fr/bi-tools-market-share-2017.html ETL DWH OLAP Reporting Data Mining Octopus MySql Mondrian Birt Weka Administration Business Intelligence and Database Kettle Postgresql Palo Open Report R-Project CloverETL Greenplum/Biz Jasper Report Orange Talend gres JFreeReport Xelopes Integrated Tool Pentaho (Kettle, Mondrian, JFreeReport, Weka) SpagoBI 35 36 Business Intelligence and Database 37 Administration Product Country Oranges Product Pears Spain PK id_prodct Apples Germany label Administration Business Intelligence and Database category Purchase France PK id_purchase FK id_client id_product Apple sales client January April in Germany quantity PK id_client in April February name Time 38 address Online analytical processing (OLAP) is a technology that organizes large business databases and supports complex analysis. It can be used to perform complex analytical queries without negatively affecting transactional systems The databases that a business uses to store all its transactions and Administration Business Intelligence and Database records are called online transaction processing (OLTP) databases. These databases usually have records that are entered one at a time. They contain a great deal of information that is valuable to the organization. Were not designed for analysis. OLAP systems were designed to help extract business intelligence information from the data in a highly performant 39 way. OLAP databases are optimized for heavy read, low write workloads. Business Intelligence and Database 40 Administration Relational OLAP: Data stored in a relational database OLAP engine allows to simulate the behavior of multidimensional DBMS Administration Business Intelligence and Database Easier and cheaper to implement Less efficient during the calculation phases Examples of ROLAP engines: Mondrian 41 Multi dimensional OLAP: Use a “pure” multidimensional system that manages native multidimensional structures (cubes) Direct access to data in the cube Administration Business Intelligence and Database Harder to implement Often proprietary formats Designed exclusively for multidimensional analysis Examples of MOLAP engines: Microsoft Analysis Services Hyperion 42 OLAP Hybrid: Fact tables and dimension tables stored in relational DBMS (basic data) aggregate data stored in cubes Administration Business Intelligence and Database Hybrid solution between MOLAP and ROLAP Good compromise in terms of cost and performance 43 Multidimensional data modelling to facilitate the analysis of a quantity in different dimensions: Time Geographical location … Administration Business Intelligence and Database Calculations are performed when loading or updating the cub 44 Operation affecting the structure: Rotate: show another side of the cube 05 06 07 05 06 07 Administration Business Intelligence and Database Eggs 221 263 139 Id 101 120 52 Meat 275 257 116 Loc 395 400 203 45 Operation affecting the structure: Slicing: consists in working on only one slice of the cube. One of the dimensions is then reduced to a single value 06 Administration Business Intelligence and Database 05 06 07 Eggs Id 220 265 284 Eggs Id 265 Loc 225 245 240 Loc 245 Meat Id 163 152 145 Meat Id 152 Loc 187 174 184 Loc 174 46 Operation affecting the structure: Dicing (extracting a data block): work only under a sub-cube 05 06 07 Eggs Id 220 265 284 05 06 07 Loc 225 245 240 Eggs Id 220 265 284 Meat Id 163 152 145 Loc 225 245 240 Loc 187 174 184 47 Operation affecting the granularity: Roll-up drilling: “zoom out” Obtain a higher level of granularity Administration Business Intelligence and Database Use of aggregation functions Drill-down: “zoom in” Obtain a lower level of granularity More detailed data 48 Roll up 05 06 07 Time Roll up Food 496 520 255 Dimension 05-07 05 06 07 Administration Business Intelligence and Database Fruit 623 Fruit 221 263 139 Meat 648 Meat 275 257 116 05 06 07 Drill down Apple 20 19 22 … … … … Beef 40 43 48 49 Product Dimension Drill down A calculation/query language to express queries for online analytical processing - OLAP, in a database management system. Language to define, use and retrieve data from Administration Business Intelligence and Database multidimensional objects Performs the operations described above Equivalent SQL for OLAP world an SQL extension to query data stored in a multidimensional structure. As for multidimensional databases they relate to OLAP cubes intended for reporting and analysis. Origin: Microsoft 50 Provide axis/attributes (time, payment) of a company in the years 2004 and 2005 crossed by the type of payment SELECT {([Time].), ([Time].)} ON COLUMNS, Administration Business Intelligence and Database {[Pay].[Pay Type].Members} ON ROWS FROM RH Cube Dimensions, Dimensions, Analysis axes axis d’analyse WHERE ([Measures].[Count]) 2004 2005 Hours 3396 4015 51 Days 3678 2056 Business Intelligence and Database 52 Administration Business Intelligence and Database 53 Administration Business Intelligence and Database 54 Administration Business Intelligence and Database 55 Administration Administration Business Intelligence and Database Calendar.Years have been selected on the rows, and Staff.ServiceRole on the columns. The measure is the calculated measure Cleaning Hours: Cleaning Hours := CALCULATE( SUM ( [Hours] ), dim_Task[ServiceTypeName] = “Cleaning”) At the intersection of the rows and columns is a cell. 56 The calculated measure is calculated for each cell. Take Year 2013, and service role “Field Coordinator” in the pivot table. The cell at the intersection of these two attributes shows 1.26. The cell value is calculated in the following way: 1. The current member of the Service Role attribute is set to “Field Coordinator” filtering the Time Record Fact rows result set to only Administration Business Intelligence and Database those associated with Field Coordinators. 2. The current member of the Year attribute is set to 2013 further filtering the Time Record Fact rows result set to only those that occurred in 2013. 3. Then the calculated measure in the cell is evaluated: 1. First the filter is applied dim_Task[ServiceTypeName] = “Cleaning”, again further filtering the Fact row result set to only TimeRecord fact rows for cleaning Tasks 2. then the measure aggregation (SUM([Hours])) is applied to the 57 Hours column of the TimeRecord Fact table over the remaining rows in the result set, summing Hours to return the result 1.26.