Data Analytics PDF
Document Details
Uploaded by UnparalleledRegionalism
Tags
Summary
This document discusses data analytics, specifically data warehousing. It details the components of a data warehouse, including ETL tools, metadata, and access tools. Different types of data warehousing approaches like ROLAP and MOLAP are also explained in this handout. Examples, technical details and classifications are given in the document.
Full Transcript
IT2003 Data Analytics Components of Data Warehouse (DW) A. Data Warehouse Data Warehouse Database – This is a databank that stocks all...
IT2003 Data Analytics Components of Data Warehouse (DW) A. Data Warehouse Data Warehouse Database – This is a databank that stocks all enterprise data and makes it manageable for reporting. A data warehouse is a database designed to enable and support always implemented on the relational database management business intelligence (BI) activities, especially analytics. system (RDBMS) technology like SQL intended to perform queries and analysis Extraction, Transformation, and Loading Tools (ETL) – These optimized for data retrieval, not for transaction processing tools are used for performing all the conversions, summarizations, centralizes and consolidates large amounts of data from and all the changes needed to transform data into a unified format multiple sources in the data warehouse. These include: allows organizations to derive valuable business insights from In case of missing data, populating them with defaults their data to improve decision-making Calculating summaries and derived data can be considered an organization’s “single source of truth" Eliminating unwanted data in operational databases from loading into the data warehouse Characteristics of Data Warehouses (DW) Converting to common data names and definitions Subject-Oriented – The DW can analyze data about a particular Metadata – is data about data that describes the data warehouse. subject or functional area. It provides the source, transformation, integration, storage, usage, Subjects can be products, customers, departments, regions, relationships, and history of each data element. etc. Example: A line in a sales department database contains: The functional area can be sales, marketing, finance, SNY-JP-0010-15000 distribution, etc. This is meaningless data until we consult the meta that tell us the Focuses on the data rather than on the processes that modify following: the data Brand Model: Sony Integrated – The DW creates consistency among different data Country Manufactured: Japan types from different sources. Product ID: 0010 Example: A student’s level in the database might be defined as Price: ₱15,000 “freshman”, “sophomore”, “junior”, or “senior” in the accounting Metadata can be classified into two (2) categories: department, and “FR”, “SO”, “JR”, “SR” in the computer information 1. Technical Metadata – contains information about the systems department. warehouse, which is used by data warehouse designers and DW must conform to a common format that is acceptable administrators. throughout the organization. 2. Business Metadata – contains details that give end-users an Time-variant – Data in DW represents the flow of data through easy way to understand the information stored in the data time. It can be organized weekly, monthly, or annually, etc. warehouse. Example: When data for previous weekly sales is uploaded to the Data Warehouse Access Tools – Corporate users generally data warehouse, the weekly, monthly, yearly, and other time- cannot work with databases directly. They use the assistance of dependent subjects such as products, customers, stores, etc. are the following tools: also updated. Query and reporting tool – help users produce corporate Non-volatile – Once data is in a data warehouse, it is stable and reports for analysis that can be in the form of spreadsheets, does not change. calculations, or interactive visuals. Application development tools – In such cases, custom reports are developed using application development tools 04 Handout 1 *Property of STI [email protected] Page 1 of 6 IT2003 when built-in graphical and analytical tools do not satisfy the 4. Helps to reduce total turnaround time for analysis and reporting analytical needs of an organization. 5. Restructuring and integration make it easier for the user to use for Data mining – a process of discovering meaningful new reporting and analysis. correlations, patterns, and trends by mining a large amount of 6. Stores a large amount of historical data. This helps users to analyze data. Data mining tools are used to make this process different time periods and trends to make future predictions. automatic. OLAP tools – allow users to analyze the data using elaborate and complex multi-dimensional views. Data Marts – a small, single-subject data warehouse subset that provides decision support for the particular user group. Figure 1. Data Warehouse Architecture Benefits of a Data Warehouse Star Schema 1. Allows business users to quickly access critical data from some A star schema is a data-modeling technique used to map multi- sources all in one place. Therefore, it saves the user's time of dimensional decision support data into a relational database. retrieving data from multiple sources It has two (2) common components: 2. Provides consistent information on various cross-functional activities. Facts table – are data that will be included in reports and It is also supporting ad-hoc reporting and query. used as the basis of business decisions. It contains 3. Helps to integrate many sources of data to reduce stress on the measurement or facts to the data and foreign key to production system dimension table. 04 Handout 1 *Property of STI [email protected] Page 2 of 6 IT2003 Dimension table – are attributes that qualify and provide OLAP OLTP more information about facts. It contains dimensions of a fact Provides historical data for Manages day to day operations and they are joined to fact table via foreign key. reporting and planning Example: Uses complex queries for Uses standard queries for data retrieving a large amount of data such as inserting, deleting, and updating Characteristics of OLAP Multi-dimensional data analysis techniques – Data is processed and viewed as part of a multi-dimensional structure. Advanced Database support – To deliver efficient decision support, OLAP tools must have the following: Access to many kinds of DBMSs, flat files, and internal and external data sources Rapid and consistent query response times Support for very large databases because the data warehouse could easily and quickly grow to multiple terabytes in size Easy-to-use end-user interfaces – permit the user to navigate the data in a way that simplifies and accelerates decision making or data analysis with easy-to-use graphical interfaces B. Online Analytical Processing (OLAP) Types of OLAP Online Analytical Processing (OLAP) Relational OLAP (ROLAP) a software tool that is used for data analysis and reporting Works directly with relational databases purposes for business decisions Fact and dimension tables are stored as relations. used by business analysts, managers, and executives. Multi-dimensional OLAP (MOLAP) Example: In Netflix, OLAP was used for movie extends OLAP functionality to multi-dimensional database recommendations based on watch history. management systems (MDBMS) Online Transaction Processing (OLTP) best suited to manage, store, and analyze multi-dimensional an operational system that manages the day-to-day data transactions of an organization ROLAP vs. MOLAP used by the Database Administrator (DBA) and Database Characteristic ROLAP MOLAP Professionals Schema Uses Star Schema Uses data cubes Example: In ATM centers, OLTP is used for money Speed Good with small data Faster for large data sets withdrawals, transfers, deposits, and inquiries. sets Access Unlimited dimensions Limited to predefined dimensions OLAP vs. OLTP 04 Handout 1 *Property of STI [email protected] Page 3 of 6 IT2003 Campus NumberOfStudents Program Ortigas-Cainta 400 BSIT Ortigas-Cainta 200 BSCS Cubao 600 BSIT Cubao 300 BSCS Table 1. Enrolled_Students Using the ROLLUP operator, we will display the total number of students enrolled in specific campuses and the grand total of students enrolled in all campuses. SELECT Program, Campus, SUM(NumberOfStudents) AS 'TotalStudents' FROM Enrolled_Students GROUP BY ROLLUP (Campus, Program) Output: Figure 2. Multi-dimensional data/data cubes OLAP operations SQL has been enhanced with analytic functions that support OLAP-type processing. This includes: ROLLUP operator – an extension of the GROUP BY clause that is used to create subtotals and grand totals for a set of columns CUBE operator – Like ROLLUP, this generates subtotals for all Explanation: the combinations of grouping column s specified in the GROUP ROLLUP operator creates an additional row that represents BY clause. subtotals for each campus. In the last row, it represents the grand PIVOT operator – allows you to write a cross-tabulation, which total for all values in the NumberOfStudents column. means you can aggregate your results and rotate rows into (Note: To make the output more readable, you can use the COALESCE() function columns to substitute the appropriate value representing subtotal and grand total to the NULL values.) Using the CUBE operator, we will display all possible combinations of Example: columns in the Enrolled_Students table (see Table 1). Assume that we have a table named Enrolled_Students (see Table 1) having columns Campus, NumberOfStudents, and Program. SELECT COALESCE(Program, 'All Program') AS 'Program', 04 Handout 1 *Property of STI [email protected] Page 4 of 6 IT2003 COALESCE(Campus, 'All Campus') AS 'Campus', SELECT NumberOfStudents, Program FROM SUM(NumberOfStudents) AS 'TotalStudents' Enrolled_Students FROM Enrolled_Students ) AS SourceTable GROUP BY CUBE (Program, Campus) PIVOT ( Output: SUM(NumberOfStudents) FOR Program IN ([BSIT], [BSCS]) ) AS PivotTable Output: Explanation: The first query specifies the column for cross-tabulation results. We want to display the first column as the identifier of the Explanation: remaining column (second and third columns). We use the COALESCE function to specify the returning text of As for the source table, we specify the returning data that will be NULL values in a specific column. used for the pivot statement. It has similar output to ROLLUP, but it returns two (2) additional In the pivot statement, we used the SUM() function to get the total rows below the grand total. This is because the ROLLUP operator number of students that are enrolled. generates aggregated results for the selected columns like We need to specify what rows/values to include from the Program Campus in a hierarchical way, while the CUBE operator generates column as it will become our column headings in our pivot table. an aggregated result that contains all the possible combinations for the selected columns. C. Data Mining Data mining refers to analyzing massive amounts of data in a Using the PIVOT operator, we will turn the unique values/rows in the data warehouse or other sources to uncover hidden trends, Program column into multiple columns. patterns, and relationships. This explains the past and predicting SELECT 'Total students in all campus:' AS 'Program:', [BSIT], [BSCS] the future for analysis. FROM Data Mining Implementation Process ( 1. Business Understanding: In this step, the goals of the businesses are set, and the important factors that will help in achieving the goal are discovered. 04 Handout 1 *Property of STI [email protected] Page 5 of 6 IT2003 2. Data Understanding: This step will collect the entire data and Facilitates automated prediction of trends and behaviors as well populate the data in the tool (if using any tool). as the automated discovery of hidden patterns 3. Data Preparation: This step involves selecting the appropriate The speedy process which makes it easy for the users to analyze data, cleaning, constructing attributes from data, integrating data a huge amount of data in less time from multiple databases. 4. Modeling: Selection of the data mining technique such as decision-tree, generate test design for evaluating the selected model, building models from the dataset, and assessing the built model with experts to discuss the result is done in this step. 5. Evaluation: This step will determine the degree to which the resulting model meets the business requirements. The model is reviewed for any mistakes or steps that should be repeated. 6. Deployment: In this step, a deployment plan is made. The strategy to monitor and maintain the data mining model results to check for its usefulness is formed. Final reports are also made, and a review of the whole process is done to check any mistake and see if any step is repeated. Data Mining Techniques 1. Classification: used to retrieve important and relevant information Figure 3. Extracting knowledge from data about data and metadata. 2. Clustering: used to identify data that are like each other. This process helps to understand the differences and similarities REFERENCES Coronel, C. and Morris, S. (2018). Database systems design, implementation, & between the data. management (13th ed.). Cengage Learning. 3. Regression: used to identify and analyze the relationship Elmasri, R. & Navathe, S. (2016). Fundamentals of Database Systems (7th ed.). Pearson between variables. Higher Education. 4. Association Rules: used to help find the association between two Kroenke, D. & Auer, D. Database Processing: Fundamentals, Design, and Implementation or more Items. It discovers a hidden pattern in the data set. (12th ed.). Pearson Higher Education. 5. Outer detection: used to observe data items in the dataset that Silberschatz A., Korth H.F., & Sudarshan, S. (2019). Database system concepts (7th ed.). do not match an expected pattern or expected behavior. McGraw-Hill Education. 6. Sequential Patterns: used to discover or identify similar patterns or trends in transaction data for a certain period. 7. Prediction: used to combine other data mining techniques like trends, sequential patterns, clustering, classification, etc. It analyzes past events or instances in the right sequence for predicting a future event. Benefits of Data Mining Helps with the decision-making process Helps companies to get knowledge-based information 04 Handout 1 *Property of STI [email protected] Page 6 of 6