Introduction to Data Engineering PDF

Data Engineering and Analytics YARMOUK UNIVERSITY FACULTY OF INFORMATION TECHNOLOGY AND COMPUTER SCI...

Data Engineering and Analytics YARMOUK UNIVERSITY FACULTY OF INFORMATION TECHNOLOGY AND COMPUTER SCIENCES DA 330: Data Engineering and Analysis Topic 1: Introduction to Data Engineering Dr. Rafat Hammad 1 Acknowledgements: Most of these slides have been prepared based on various online tutorials and presentations, with respect to their authors and adopted for our course. Additional slides have been added from the mentioned references in the syllabus 1 TOPIC 1 : OUTLINE ❑ What is Data Engineering? ❑ What is Data? ❑ Data Ecosystems ❑ Operation vs. Strategic Systems ❑ Data Warehouse Architectures ❑ Data Warehouse Design Methodologies ❑ Big Data Engineering 2 2 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics WHAT IS DATA ENGINEERING?  Data Engineering: the process of designing, building, and maintaining systems that collect, store, and process data.  Data engineering is a critical part of data science, as it ensures that data is collected, stored, and processed in a way that is efficient, reliable, and scalable. Without data engineering, data science would not be possible.  Data Engineer: develops and maintains data architecture and pipelines. Essentially, they build the programs that generate data and aim to do so in a way that ensures the output is meaningful for operations and analysis. 3 3 RESPONSIBILITIES OF DATA ENGINEER?  1) Data collection: This involves designing and executing systems to collect and extract data from different sources. These sources could be social media, transactional databases, sensor data from IoT devices, maps, texts, documents, images, stock prices, etc.  2) Data storage: Using data warehouses or data lakes to store large volumes of data and ensuring data is organized for easy accessibility.  3) Data processing: Creating distributed processing systems to clean, aggregate, and transform data, ensuring it’s ready for analysis.  4) Data integration: Developing data pipelines that integrate data from various sources to create a comprehensive view.  5) Data quality and governance: Ensuring that data is of high- quality, reliable and adheres/complies with regulatory standards.  6) Data provisioning: Ensuring the processed data is available to end users and applications. 4 4 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics WHAT IS A DATA ANALYST?  Data Analyst: brings together data sources in a way that makes it possible to drive consolidated insights. They do the work of building systems that can model data in a clean, clear way repeatedly so that everyone can use those systems to answer questions on an ongoing basis.  Responsibilities:  Descriptive statistics to summarize data.  Exploratory data analysis to understand patterns and relationships.  Creating visualizations to communicate findings.  Often involves using tools like Excel, SQL, or statistical software 5 5 WHAT IS A DATA SCIENTIST?  Data Scientist: studies large data sets using advanced statistical analysis and machine learning algorithms. In doing so, they identify patterns in data to drive critical business insights, and then typically use those patterns to develop machine learning solutions for more efficient and accurate insights at scale. Critically, they combine this statistics experience with software engineering experience.  Responsibilities:  Developing machine learning models for predictions and classifications.  Analyzing complex data sets to identify patterns and trends.  Extracting meaningful insights to inform business decisions.  Often involves coding in languages like Python or R. 6 6 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics DATA ANALYST VS DATA SCIENTIST VS DATA ENGINEER  Data Engineers are primarily focused on building and maintaining the systems that data scientists and data analysts use to collect, store, and analyze data.  Data Analyst: Analyze data to summarize the past in visual form.  Data Scientist: Analyze data to identify patterns and trends to predict future outcomes. 7 7 THE DATA ENGINEERING LEARNING PATH  The following are the core topics a data engineering should master:  1) Programming- Programming is a fundamental skill for data engineers as most of the tasks performed by data engineers rely on writing programming scripts. Learning python is highly encouraged because it is widely used in many industries today. Python is easy to understand for beginners and has many modules and libraries that can be used for various tasks, including data wrangling, data analysis, machine learning, and deep learning. Lastly, python is commonly used for developing scripts. 8 8 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics THE DATA ENGINEERING LEARNING PATH (CONT.)  2)Scripting and Automation- In data engineering, scripting and automation refers to the process of automating the creation and maintenance of data pipelines. This can include tasks such as provisioning resources, configuring settings, and deploying code. It can also include more complex tasks such as monitoring data flows and managing data quality. Here learners should focus on the basics of scripting languages such as Ruby or Python and how to use automation tools such as Puppet. Additionally, one should learn how to integrate automation into the data engineering workflow. Lastly, how to troubleshot and debug automation scripts. 9 9 THE DATA ENGINEERING LEARNING PATH (CONT.)  3)Relational Databases and SQL- Relational databases and SQL are the fundamental technologies for storing and querying data. Data engineering need to understand the following concepts:  The basics of relational databases, including how to structure data in tables and how to query data using the SQL language.  The basics of SQL, including how to select data, how to insert and update data, and how to use SQL functions and operators.  How to design efficient and effective database schema, including how to normalize data and how to choose appropriate data types.  How to optimize SQL queries for performance, including how to use indexes and how to write efficient SQL code. 10 10 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics THE DATA ENGINEERING LEARNING PATH (CONT.)  4) NoSQL Databases and Map Reduce: There is a lot to learn in NoSQL databases and Map Reduce in data engineering. However, here are some key things to focus on:  How NoSQL databases work and their key features.  How to design data models for NoSQL databases.  How to query NoSQL databases using Map Reduce.  How to optimize Map Reduce jobs for performance.  How to troubleshoot and debug Map Reduce jobs. 11 11 THE DATA ENGINEERING LEARNING PATH (CONT.)  5) Data Analysis- There are a few key things to learn in data analysis when working in data engineering.  Firstly, it is important to understand the basics of statistical analysis and how to use various tools to effectively analyze data.  Secondly, it is also beneficial to learn how to effectively visualize data so that it can be easily interpreted.  Finally, it is also important to be familiar with the different types of data that can be collected and stored to effectively engineer data solutions. 12 12 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics THE DATA ENGINEERING LEARNING PATH (CONT.)  6) Data Processing Techniques- There are a few key things to learn in Data Processing Techniques for data engineering:  Batch Processing: This is a process where data is processed in batches, typically on a schedule. This can be used to process large amounts of data efficiently.  Building Data Pipelines: This involves creating a system to efficiently move data from one place to another. This is often done using ETL (Extract, Transform, Load) tools.  Debugging: This is a process of finding and fixing errors in data processing systems. This can be done using tools like Hadoop or Spark. 13 13 THE DATA ENGINEERING LEARNING PATH (CONT.)  7) Big Data- The most important thing is to learn how to effectively use the tools available to manage and process large data sets. The most popular tools for this purpose include Hadoop, HDFS, MapReduce, Spark, Hive, and Pig.  8) Workflows- There are a few key concepts that are important to learn to create efficient and effective data engineering workflows. These include understanding how to extract, transform, and load data (ETL), as well as how to create and use data pipelines. Additionally, it is important to have a solid understanding of data warehousing and how to optimize data storage and retrieval. 14 14 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics THE DATA ENGINEERING LEARNING PATH (CONT.)  9) Infrastructure- In infrastructure, data engineering refers to the process of designing, building, and maintaining data infrastructure. This includes the data warehouse, data lakes, data marts, and data pipelines that are necessary to support data-driven applications and analytics. Data engineers are responsible for ensuring that data is accessible, reliable, and scalable. They work with data architects to design and build data infrastructure, and with data scientists to optimize and tune it for performance. 15 15 THE DATA ENGINEERING LEARNING PATH (CONT.)  10) Cloud Computing-Cloud computing is a way to use technology to make it easier for businesses to work with large amounts of data. It allows businesses to store data in the cloud, which is a network of computers that can be accessed from anywhere in the world. Some key things to keep in mind include understanding how to use cloud-based data storage and processing services, as well as how to manage and monitor cloud-based data systems. Additionally, it is important to be familiar with the different types of cloud computing architectures and how they can be used to support data engineering workloads. 16 16 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics TOPIC 2 : OUTLINE ❑ What is Data Engineering? ❑ What is Data? ❑ Data Ecosystems ❑ Operation vs. Strategic Systems ❑ Data Warehouse Architectures ❑ Data Warehouse Design Methodologies ❑ Big Data Engineering 17 17 WHAT IS DATA?  Data is defined as individual facts, such as numbers, words, measurements, observations or just descriptions of things.  For example, data might include individual prices, weights, addresses, ages, names, temperatures, dates, or distances.  There are two main types of data:  Quantitative data is provided in numerical form, like the weight, volume, or cost of an item.  Qualitative data is descriptive, but non-numerical, like the name, gender, or eye color of a person. 18 18 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics CHARACTERISTICS OF DATA  The following are six key characteristics of data which discussed below: 1) Accuracy 2) Validity 3) Reliability 4) Timeliness 5) Relevance 6) Completeness 19 19 CHARACTERISTICS OF DATA (CONT.)  Accuracy: Data should be sufficiently accurate for the intended use and should be captured only once, although it may have multiple uses. Data should be captured at the point of activity.  Validity: Data should be recorded and used in compliance with relevant requirements, including the correct application of any rules or definitions. This will ensure consistency between periods and with similar organizations, measuring what is intended to be measured.  Reliability: Data should reflect stable and consistent data collection processes across collection points and over time. Progress toward performance targets should reflect real changes rather than variations in data collection approaches or methods. Source data is clearly identified and readily available from manual, automated, or other systems and records. 20 20 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics CHARACTERISTICS OF DATA (CONT.)  Timeliness: Data should be captured as quickly as possible after the event or activity and must be available for the intended use within a reasonable time. Data must be available quickly and frequently enough to support information needs and to influence service or management decisions.  Relevance: Data captured should be relevant to the purposes for which it is to be used. This will require a periodic review of requirements to reflect changing needs.  Completeness: Data requirements should be clearly specified based on the information needs of the organization and data collection processes matched to these requirements. 21 21 TYPES OF DIGITAL DATA  Digital data is the electronic representation of information in a format or language that machines can read and understand.  In more technical terms, Digital data is a binary format of information that's converted into a machine-readable digital format.  The power of digital data is that any analog inputs, from very simple text documents to genome sequencing results, can be represented with the binary system.  Types of Digital Data:  Structured  Unstructured  Semi-Structured Data 22 22 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics TYPES OF DIGITAL DATA (CONT.)  Structured Data: Any data that are accessible and are stored or processed in the form of fixed-format is termed structured data. The employee table in the Database is an example of structured data. Banking transaction data is an example of structured data. The attributes present in structured data must be related to each other in some form. These data are stored in a relational database.  Unstructured Data: Irregular and ambiguous data, having no predefined data model and no pre-defined structure, are referred to as unstructured data. These data can be a combination of text, numbers, audio, video, images, messages, social media posts and many more. Twitter, Instagram, Facebook, and Google all are made up of unstructured data. 23 23 TYPES OF DIGITAL DATA (CONT.)  Semi-structured Data: These kinds of data falls between structured and unstructured data. It is a combination of partly structured data and partly unstructured data. For example, XML, and JSON are all semi-structured data. 24 24 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics TYPES OF DIGITAL DATA (CONT.) Structured Unstructured Semi-structured Well organised Not organised at Partially organised data all It is less flexible It is flexible and It is more flexible and and difficult to scalable. It is simpler to scale than scale. It is schema schema structured data but dependent. independent. lesser than unstructured data. It is based on It is based on It is based on XML/ relational character and RDF database. binary data. 25 25 TYPES OF DIGITAL DATA (CONT.) Structured Unstructured Semi-structured Versioning over Versioning is like Versioning over tuples tuples, row, tables as a whole data. is possible. Easy analysis Difficult analysis Difficult analysis compared to structured data but easier when compared to unstructured data. Examples: Examples: Media Examples: Tweets Financial data, logs, videos, organised by bar codes audios hashtags; folder organised by topics 26 26 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics DATA LIFECYCLE MANAGEMENT (DLM) A data lifecycle refers to the stages that data goes through from its creation or acquisition, through its usage and maintenance, to its eventual disposal. The stages of the data lifecycle may vary depending on the organization, but they generally include the following: 27 27 DATA LIFECYCLE MANAGEMENT (CONT.) Data Creation  This is the first stage of data lifecycle. It refers to any input or source for generating data, including data acquisition, data capture, and data entry by applications, artificial intelligence (AI), machine learning (ML), and sensors.  That said, not all data that is generated is collected and utilized. Your team should be able to identify what information should be captured, the best way of capturing the data, and what’s irrelevant or unnecessary to the project at hand. 28 28 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics DATA LIFECYCLE MANAGEMENT (CONT.) Data Storage  When an organization generates large volumes of data from multiple sources, it is common for them to use a data warehouse to store the data and prepare it for use. The data stored in the data warehouse is cleaned and analyzed such that it can be used to make informed decisions.  You should ensure that you store the data in a stable environment and properly maintain it to ensure its integrity, protection, and security. 29 29 DATA LIFECYCLE MANAGEMENT (CONT.) Data Usage  What value do you accrue from your data? How are you leveraging data analytics results? In this stage, you need to align value with action. How is data shared and used within your organization? You need to establish rules that define the management of the transfer and publication of data and who can access sensitive data 30 30 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics DATA LIFECYCLE MANAGEMENT (CONT.) Data Archival  There are some data sets that can’t be destroyed immediately because they still have value from a compliance or historical perspective, and so they should be archived. The archived data is typically not active and is kept for long-term retention purposes. Most organizations leverage data warehousing capabilities for archived data that are rarely used for decision- making. They also use technology to retrieve such data if needed. 31 31 DATA LIFECYCLE MANAGEMENT (CONT.) Data Destruction  Keeping too much data increases the data management cost, thereby impacting the total cost of ownership and ROI (Return On Investment) of an organization’s products or services. While it’s mandatory to delete data at some point, you should also ensure that you free yourself by deleting active or archived data that doesn’t benefit your organization in any way. 32 32 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics TOPIC 3 : OUTLINE ❑ What is Data Engineering? ❑ What is Data? ❑ Data Ecosystems ❑ Operation vs. Strategic Systems ❑ Data Warehouse Architectures ❑ Data Warehouse Design Methodologies ❑ Big Data Engineering 33 33 WHAT IS A DATA ECOSYSTEM? A data ecosystem is a combination of a company’s infrastructure and applications that is used to collect and analyze information. Data ecosystems enable businesses to better understand their customers and craft superior operations strategies. There are three key components data systems consist of: The people who use it. The technology that supports it. The processes that facilitate it.  There are no two organizations that make use of the same data in the same way. Each business has a unique data system. Of course, data systems may overlap in some cases, especially when data is pulled or scraped from a public source. 34 34 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics FILE FORMATS: THE LANGUAGE OF DATA  Understanding file formats is essential when working with data. Different formats are suitable for specific use cases and come with their own advantages and drawbacks. CSV (Comma-Separated Values)  CSV stores data as text and uses delimiters (such as commas, tabs, or colons) to separate values. It’s one of the most common and versatile data formats used in various applications. 35 35 FILE FORMATS: THE LANGUAGE OF DATA (CONT.) XLSX (Microsoft Excel)  XLSX files are Excel spreadsheets, each containing data organized in rows and columns. They are widely used in businesses and support various functions while maintaining compatibility across applications. 36 36 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics FILE FORMATS: THE LANGUAGE OF DATA (CONT.) XML (Extensible Markup Language)  XML is a markup language with predefined rules for encoding data. It’s both human-readable and machine-readable, making it ideal for sharing data between systems. 37 37 FILE FORMATS: THE LANGUAGE OF DATA (CONT.) JSON (JavaScript Object Notation)  JSON is a text-based format designed for transmitting data over the web. Its simplicity, compatibility with various programming languages, and ease of use make it a popular choice for data sharing. 38 38 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics FILE FORMATS: THE LANGUAGE OF DATA (CONT.) PDF (Portable Document Format)  PDF files, developed by Adobe, ensure that documents retain their formatting and appearance regardless of software or platform. They are often used for legal and financial documents 39 39 DATA SOURCES: WHERE DATA RESIDES  Data engineering starts with extracting data from various sources. Common data sources include: Relational Databases  Relational databases store structured data related to business activities, transactions, human resources, and more. Examples include SQL Server, Oracle, and MySQL, which are widely used for data analysis and projections. 40 40 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics DATA SOURCES: WHERE DATA RESIDES (CONT.) Flat Files and XML Datasets  Flat files, such as CSVs, serve as public and private datasets, containing information like demographic data, financial records, or weather data. XML files offer flexibility for complex data structures, such as surveys or bank statements. APIs and Web Services  APIs (Application Programming Interfaces) and web services enable data retrieval through network requests. They play a vital role in applications like sentiment analysis, stock market analysis, and data validation. 41 41 DATA SOURCES: WHERE DATA RESIDES (CONT.) Web Scraping  Web scraping involves extracting data from unstructured sources on the internet. It can gather information ranging from text and contact details to images and product listings. Tools like Beautiful Soup and Scrapy facilitate this process. Data Streams and Feeds  Data streams aggregate real-time data from various sources, such as IoT devices, sensors, and social media. They are often geotagged and timestamped, supporting applications like stock market analysis and real-time event monitoring. 42 42 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics LANGUAGES FOR DATA PROFESSIONALS  Professionals in the data engineering field use various languages to accomplish their tasks. These languages fall into three main categories: Query Languages  Query languages, like SQL (Structured Query Language), allow users to access and manipulate data from relational databases. SQL is renowned for its simplicity and effectiveness in querying data 43 43 LANGUAGES FOR DATA PROFESSIONALS (CONT.) Programming Languages  Programming languages, including Python, R, and Java, enable the development and control of data engineering applications. Python stands out for its readability and extensive library support. Shell Scripting  Shell scripting languages like Unix and Linux are ideal for automating repetitive and time-consuming tasks. They are commonly used for file manipulation, system administration, and routine backups. 44 44 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics METADATA  Metadata plays a crucial role in data engineering, providing information about other data. There are three main types of metadata: Technical Metadata  Technical metadata defines data structures from a technical perspective. It includes information about tables, data catalogues, and other technical aspects of data storage. 45 45 METADATA Process Metadata  Process metadata describes the processes operating behind business systems, such as data warehouses. It tracks process start and end times, disk usage, and user access to systems. Business Metadata  Business metadata offers readily interpretable information about the data, such as its acquisition method, purpose, and relationships with other data sources. It also serves as documentation for the entire data warehouse system. 46 46 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics DATA REPOSITORIES Data repositories store organized and isolated data for various purposes, including business operations and data analysis. Common types of data repositories include databases, data warehouses, and big data stores. Databases  Databases are collections of data designed for input, storage, search, retrieval, and modification. They come in two main types: relational (RDBMS) and non-relational. Each has its strengths and is chosen based on factors like data type, structure, querying mechanisms, and use cases. 47 47 DATA REPOSITORIES (CONT.) Data Warehouses  Data warehouses serve as central repositories that consolidate data from multiple sources. They facilitate analytics and business intelligence by using Extract, Transform, Load (ETL) processes to prepare data for analysis. Big Data Stores  Big data stores provide the infrastructure to store, scale, and process large datasets efficiently. They are essential for handling the high volume, velocity, and variety of big data 48 48 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics TOPIC 4 : OUTLINE ❑ What is Data Engineering? ❑ What is Data? ❑ Data Ecosystems ❑ Operation vs. Strategic Systems ❑ Data Warehouse Architectures ❑ Data Warehouse Design Methodologies ❑ Big Data Engineering 49 49 OPERATION VS. STRATEGIC SYSTEMS  Operation systems gather, store, and process all the data needed to successfully perform the daily routine operations. They provide online information and produce a variety of reports to monitor and run the business.  Strategic decision-support systems are not for running the day-to-day operations of the business. It is to formulate the business strategies, establish goals, set objectives, and monitor results.  Here are some examples of business objectives:  Gain market share by 10% in the next 3 years  Enhance customer service level in shipments  Increase sales by 15% in the Northeast Division 50 50 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics OPERATION VS. STRATEGIC SYSTEMS (CONT.) 51 51 OPERATION VS. STRATEGIC SYSTEMS (CONT.) 52 52 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics CHARACTERISTICS OF STRATEGIC SYSTEMS 53 53 BUSINESS INTELLIGENCE (BI)  BI is “A set of processes, architectures, and technologies that convert raw data into meaningful information that drives profitable business actions. It is a suite of software and services to transform data into actionable intelligence and knowledge”.  BI allows a business to transform: Data into information Information into knowledge Knowledge into intelligence  BI tools perform data analysis and create reports, summaries, dashboards, maps, graphs, and charts to provide users with detailed intelligence about the nature of the business. 54 54 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics WHAT IS DATA WAREHOUSE (DW)?  Data Warehouse (DW) is an information system which stores historical and commutative data from single or multiple sources. It is designed to analyze, report, integrate transaction data from different sources.  Data Warehousing: is defined as a technique for collecting, transforming, and managing data from varied sources to provide meaningful business insights.  The data warehousing concept was intended to provide an architectural model for the flow of data from operational (transactional) systems to decision support environments. 55 55 TOP 10 BENEFITS OF A DATA WAREHOUSE 56 56 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics MULTIPLE DATA TYPES IN A DATA WAREHOUSE 57 57 OPERATIONAL DATABASE  Operationaldatabase management systems are used to update data in real-time.  These types of databases allow users to do more than simply view archived data. Operational databases allow you to modify that data (add, change or delete data), doing it in real- time.  Operational databases provide transactions as main abstraction to guarantee data consistency that guarantee the so-called ACID properties (Atomicity, Consistency, Isolation, and Durability). 58 58 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics OPERATIONAL DATABASE VS. DATA WAREHOUSE (CONT.)  Operational database management systems is also known by the following names: Operational databases Transactional databases Online Transaction Processing (OLTP) databases.  Datawarehouse systems is also known by the following names: Data warehouse databases Data analytics databases Online Analytical Processing (OLAP) databases 59 59 OPERATIONAL DATABASE VS. DATA WAREHOUSE Operational Parameter Data Warehouse Database Data Content Current values Archived, derived, summarized Data Structure Optimized for Optimized for complex transactions queries Access High Medium to low Frequency Access Type Read, update, Read delete Usage Predictable, Ad hoc, random, repetitive heuristic Response Time Sub-seconds Several seconds to minutes Users Large number Relatively small number 60 60 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics ACID PROPERTIES IN DBMS  In the context of databases and data storage systems, a transaction is any operation that is treated as a single unit of work, which either completes fully or does not complete at all and leaves the storage system in a consistent state.  The classic example of a transaction is what occurs when you withdraw money from your bank account. Either the money has left your bank account, or it has not — there cannot be an in- between state.  To maintain consistency in a database, before and after the transaction, certain properties are followed. These are called ACID properties: Atomicity, Consistency, Isolation, and Durability. 61 61 ACID PROPERTIES - ATOMICITY  Transactions are often composed of multiple statements.  Atomicity guarantees that each transaction is treated as a single "unit", which either succeeds completely, or fails completely: if any of the statements constituting a transaction fails to complete, the entire transaction fails, and the database is left unchanged.  An atomic system must guarantee atomicity in each situation, including power failures, errors and crashes. 62 62 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics ACID PROPERTIES - CONSISTENCY  Consistency ensures that a transaction can only bring the database from one valid state to another, maintaining database invariants: any data written to the database must be valid according to all defined rules, including constraints, cascades, triggers, and any combination thereof.  Thisprevents database corruption by an illegal transaction but does not guarantee that a transaction is correct.  Referentialintegrity guarantees the primary key - foreign key relationship. 63 63 ACID PROPERTIES - ISOLATION  Transactions are often executed concurrently (e.g., multiple transactions reading and writing to a table at the same time).  Isolation ensures that concurrent execution of transactions leaves the database in the same state that would have been obtained if the transactions were executed sequentially.  Isolation is the main goal of concurrency control; depending on the method used, the effects of an incomplete transaction might not even be visible to other transactions. 64 64 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics ACID PROPERTIES - DURABILITY  Durability guarantees that once a transaction has been committed, it will remain committed even in the case of a system failure (e.g., power outage or crash).  Thisusually means that completed transactions (or their effects) are recorded in non-volatile memory. 65 65 ACID PROPERTIES - COMPARISON 66 66 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics TOPIC 5 : OUTLINE ❑ What is Data Engineering? ❑ What is Data? ❑ Data Ecosystems ❑ Operation vs. Strategic Systems ❑ Data Warehouse Architectures ❑ Data Warehouse Design Methodologies ❑ Big Data Engineering 67 67 DATA WAREHOUSE COMPONENTS  We identiﬁed the following major components of the data warehouse:  Source data  Data staging  Data storage  Metadata  Management and control  Information delivery 68 68 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics DATA WAREHOUSE COMPONENTS (CONT.) 69 69 DATA WAREHOUSE COMPONENTS (CONT.) Metadata is like the data dictionary in a database system. Metadata repository is an integral part of a data warehouse system. It has the following information: 1) Definition of data warehouse − It includes the description of structure of data warehouse. The description is defined by schema, view, hierarchies, derived data definitions, and data mart locations and contents. 2) Business metadata − It contains has the data ownership information, business definition, and changing policies. 70 70 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics DATA WAREHOUSE COMPONENTS (CONT.) 3) Operational Metadata − It includes currency of data and data lineage. Currency of data means whether the data is active, archived, or purged. Lineage of data means the history of data migrated and transformation applied on it. 4) Data for mapping from operational environment to data warehouse − It includes the source databases and their contents, data extraction, data partition cleaning, transformation rules, data refresh and purging rules. 5) Algorithms for summarization − It includes dimension algorithms, data on granularity, aggregation, summarizing, etc. 71 71 DATA WAREHOUSE COMPONENTS (CONT.)  The management & control component has the following tasks:  Manages and controls data acquisition functions, ensuring that extracts and transformations are carried out correctly and in a timely fashion.  Manages backing up signiﬁcant parts of the data warehouse and recovering from failures.  Monitoring the growth and periodically archiving data from the data warehouse.  Governs data security and provides authorized access to the data warehouse.  Interfaces with the end-user information delivery component to ensure that information delivery is carried out properly. 72 72 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics DATA WAREHOUSE COMPONENTS (CONT.)  The information delivery component makes it easy for the users to access the information either directly from the enterprise-wide data warehouse, from the dependent data marts, or from the set of conformed data marts.  Most of the information access in a data warehouse is through online analytical processing (OLAP) queries and interactive analysis sessions.  With OLAP, the primary data warehouse feeds data to proprietary multidimensional databases (MDDBs) where summarized data is kept as multidimensional cubes of information.  The users perform complex multidimensional analysis using the information cubes in the MDDBs. 73 73 DATA WAREHOUSE ARCHITECTURES TYPES  Datawarehouses and their architectures vary depending upon the specifics of an organization's situation. Three common architectures are: Data Warehouse Architecture (Basic) Data Warehouse Architecture (with a Staging Area) Data Warehouse Architecture (with a Staging Area and Data Marts) 74 74 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics DATA WAREHOUSE BASIC End users directly access data derived from several source systems through the data warehouse. 75 75 DATA WAREHOUSE WITH A STAGING AREA 76 76 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics DATA WAREHOUSE WITH A STAGING AREA  This is the place where all the extracted data is put together and prepared for loading into the data warehouse.  A staging area simplifies building summaries and general warehouse management.  Thedata staging area is both a storage area and a set of processes commonly referred to as Extract-Transformation- Load or ETL. It is somewhat analogous (‫ )مشابه‬to a kitchen where raw materials are combined to form a fine meal. 77 77 DATA WAREHOUSE WITH A STAGING AREA AND DATA MARTS 78 78 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics DATA WAREHOUSE WITH A STAGING AREA AND DATA MARTS  You may want to customize your warehouse's architecture for different groups within your organization.  You can do this by adding data marts, which are systems designed for a particular line of business.  The previous figure illustrates an example where purchasing, sales, and inventories are separated. In this example, a financial analyst might want to analyze historical data for purchases and sales. 79 79 TYPES OF DATA WAREHOUSE  Thereare three common types of data warehouses: Operational Data Store (ODS) Enterprise Data Warehouse (DW) Data Mart (DM) 80 80 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics TYPES OF DATA WAREHOUSE (CONT.) 81 81 OPERATIONAL DATA STORE  Anoperational data store (ODS) is used for immediate reporting with current operational data.  AnODS contains lightly transformed and lightly integrated operational data with a short time window. It is used for real time and near real time reporting.  ODS is directly loaded from operational data, staging area, or incoming files. It can optionally serve as a data source for the data warehouse.  AnODS must be frequently refreshed so that it contains very current data. An ODS can be updated daily, hourly, or even immediately after transactions on operational data. 82 82 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics ENTERPRISE DATA WAREHOUSE  Enterprise Data Warehouse is a centralized place where all business information from different sources and applications are made available.  It offers a unified approach for organizing and representing data.  It also provide the ability to classify and analyze data according to the subject and give access according to those divisions. 83 83 DATA MART  Data marts contain a subset of organization-wide data. This data is valuable to a specific group of people in an organization.  It is cost-effective alternatives to a data warehouse, which can take high costs to build.  Data Mart allows faster access of Data.  Data Mart is easy to use as it is specifically designed for the needs of its users. Thus, a data mart can accelerate business processes.  Data Marts needs less implementation time compared to Data Warehouse systems.  It contains historical data which enables the analyst to determine data trends. 84 84 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics TYPES OF DATA MARTS  Three basic types of data marts are:  Dependent  Independent  Hybrid  Thecategorization is based primarily on the data source that feeds the data mart. 85 85 DEPENDENT DATA MARTS  Dependent data marts draw data from a central data warehouse that has already been created.  This gives you the usual advantages of centralization 86 86 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics INDEPENDENT DATA MARTS  Independent data marts, in contrast, are standalone systems built by drawing data directly from operational or external sources of data or both.  This could be desirable for smaller groups within an organization. 87 87 HYBRID DATA MARTS A hybrid data mart allows you to combine input from sources other than a data warehouse.  This could be useful for many situations, especially when you need ad hoc integration, such as after a new group or product is added to the organization. 88 88 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics TOPIC 6 : OUTLINE ❑ What is Data Engineering? ❑ What is Data? ❑ Data Ecosystems ❑ Operation vs. Strategic Systems ❑ Data Warehouse Architectures ❑ Data Warehouse Design Methodologies ❑ Big Data Engineering 89 89 DATA WAREHOUSE DESIGN METHODOLOGIES  Bill Inmon considered to be the father of data warehousing. He publishes his book Building the Data Warehouse (1991)  Ralph Kimball publishes his book The Data Warehousing Toolkit (1996)  Thereare two common approaches for designing data warehouses: Inmon’s Top-Down Approach Kimball’s Bottom-Up Approach 90 90 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics INMON’S TOP-DOWN APPROACH  Inmon defines a data warehouse as a centralized repository for the entire enterprise.  Inmon defines the data warehouse in the following terms: Subject-oriented, Integrated, Time-variant, and Non-Volatile  A data warehouse stores the “atomic” data at the lowest level of detail.  A normalized data model is designed first. Then the dimensional data marts, which contain data required for specific business processes or specific departments are created from the data warehouse.  Central data warehouse which follow the E-R model / normalized model. 91 91 INMON’S TOP-DOWN APPROACH (CONT.) 92 92 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics SUBJECT-ORIENTED DATA  Subject-oriented: The data in the data warehouse is organized so that all the data elements relating to the same real-world event or subject are linked together. These subjects can be sales, marketing, distributions, etc.  In operational systems, we store data by individual applications. In the data sets for an order processing application, we keep the data for that application. 93 93 SUBJECT-ORIENTED DATA (CONT.) 94 94 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics INTEGRATED DATA  Integrated:The database contains data from most or all of an organization’s operational applications, and that this data is made consistent. 95 95 NON-VOLATILE DATA  Non-Volatile: Data in the data warehouse is never overwritten or deleted. Once committed, the data is static, read-only and retained for future reporting. 96 96 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics TIME-VARIANT  Time-variant: The changes to the data in the data warehouse are tracked and recorded so that reports can be produced showing changes over time.  The data in the data warehouse is meant for analysis and decision making. If a user is looking at the buying pattern of a speciﬁc customer, the user needs data not only about the current purchase, but on the past purchases as well.  A data warehouse, because of the very nature of its purpose, has to contain historical data and current values. 97 97 KIMBALL’S BOTTOM-UP APPROACH  Kimballdefines data warehouse as “a copy of transaction data specifically structured for query and analysis”.  Thedata marts should be created first; these are then combined to create a broad data warehouse. The data warehouse is essentially a union of all the data marts.  Data marts are built using dimensional modelling approach  Dimensional modelling focuses on ease of end-user accessibility and provides a high level of performance to the data warehouse. 98 98 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics KIMBALL’S BOTTOM-UP APPROACH (CONT.) 99 99 INMON VS. KIMBALL Inmon (Data Kimball (Data Mart) Warehouse) Building Data Time Consuming Takes lesser time warehouse Maintenance Easy Difficult, often redundant and subject to revisions Cost High initial cost. Low Initial cost, Each Subsequent project subsequent phase will development costs cost almost the same will be much lower 100 100 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics INMON VS. KIMBALL (CONT.) Inmon (Data Kimball (Data Mart) Warehouse) Time Longer start-up time Shorter time to initial set-up Skill Specialist team Generalist team Requirement Data Enterprise-wide Individual business Integration areas requirements Data Model ER data model Dimensional data model 101 101 TOPIC 7 : OUTLINE ❑ What is Data Engineering? ❑ What is Data? ❑ Data Ecosystems ❑ Operation vs. Strategic Systems ❑ Data Warehouse Architectures ❑ Data Warehouse Design Methodologies ❑ Big Data Engineering 102 102 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics BIG DATA ENGINEERING  Big Data is a collection of data that is huge in volume yet growing exponentially with time. It is a data with so large size and complexity that none of traditional data management tools can store it or process it efficiently. Big data is also a data but with huge size..  Big data architecture differs from conventional data handling, as here we’re talking about such massive volumes of rapidly changing information streams that a data warehouse can’t accommodate. That’s where a data lake comes in handy. 103 103 DATA LAKE A data lake is a vast pool for saving data in its native, unprocessed form. It stands out for its high agility as it isn’t limited to a warehouse’s fixed configuration. 104 104 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics DATA LAKE (CONT.) A data lake uses the ELT approach and starts data loading immediately after extracting it, handling raw (often unstructured) data.  A data lake is worth building in those projects that will scale and need a more advanced architecture. Besides, it’s very convenient when the purpose of the data hasn’t been determined yet. In this case, you can load data quickly, store it, and modify it as necessary.  Data lakes are also a powerful tool for data scientists and ML engineers, who would use raw data to prepare it for predictive analytics and machine learning.  Lakes are built on large, distributed clusters that would be able to store and process masses of data. A famous example of such a data lake platform is Hadoop. 105 105 HADOOP AND ITS ECOSYSTEM  Hadoop is a large-scale, Java-based data processing framework capable of analyzing massive datasets. The platform facilitates splitting data analysis jobs across various servers and running them in parallel. It consists of three components:  Hadoop Distributed File System (HDFS) capable of storing Big Data,  a processing engine MapReduce, and  a resource manager YARN to control and monitor workloads.  Also, Hadoop benefits from a vast ecosystem of open- source tools that enhance its capabilities and address various challenges of Big Data. 106 106 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics ENTERPRISE DATA HUB  When a big data pipeline is not managed correctly, data lakes quickly become data swamps (‫ )مستنقعات‬- a collection of miscellaneous data that is neither governable nor usable. A new data integration approach called a data hub emerged to tackle this problem.  Enterprise data hubs (EDHs) are the next generation of data architecture aiming at sharing managed data between systems. They connect multiple sources of information, including DWs and data lakes. Unlike DWs, the data hub supports all types of data and easily integrates systems. Besides that, it can be deployed within weeks or even days while DW deployment can last months and even years.  At the same time, data hubs come with additional capabilities for data management, harmonizing, exploration, and analysis — something data lakes lack. 107 107 SUMMARY  To sum it all up, a data warehouse is constructed to deal mainly with structured data for the purpose of self-service analytics and BI; a data lake is built to deal with sizable aggregates of both structured and unstructured data to support deep learning, machine learning, and AI in general; and a data hub is created for multi-structured data portability, easier exchange, and efficient processing. An EDH can be integrated with a DW and/or a data lake to streamline data processing and deal with these architectures' everyday challenges. 108 108 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics THE END 109 109 Dr. Rafat Hammad - Yarmouk University

Introduction to Data Engineering PDF

Document Details

Tags

Related

Summary

Full Transcript