Data Engineering PDF
Document Details
Uploaded by FreshRetinalite437
Tags
Summary
This document provides a high-level overview of data engineering. It outlines the core concepts, stages, and best practices for data engineers. It touches on the history, principles, and architecture of data systems.
Full Transcript
Data Engineering Definition Endless definitions of data engineering exist. Data engineering is all about the movement, manipulation, and management of data. Data engineering is a set of operations aimed at creating interfaces and mechanisms for the flow and access of information. It takes...
Data Engineering Definition Endless definitions of data engineering exist. Data engineering is all about the movement, manipulation, and management of data. Data engineering is a set of operations aimed at creating interfaces and mechanisms for the flow and access of information. It takes dedicated specialists—data engineers—to maintain data so that it remains available and usable by others. The data engineering field could be thought of as a superset of business intelligence and data warehousing that brings more elements from software engineering. Definition Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning. Data engineering is the intersection of security, data management, DataOps, data architecture, orchestration, and software engineering. A data engineer manages the data engineering lifecycle, beginning with getting data from source systems and ending with serving data for use cases, such as analysis or machine learning. The Data Engineering Lifecycle History The early days: 1980 to 2000, from data warehousing to the web roots in data warehousing and business intelligence The early 2000s: The birth of contemporary data engineering innovations started decentralizing and breaking apart traditionally monolithic services. The “big data” era had begun (Apache Hadoop, AWS, Amazon S3, DynamoDB). The 2000s and 2010s: Big data engineering Simplification of open-source big data tools The 2020s: Engineering for the data lifecycle Trend is moving toward decentralized, modularized, managed, and highly abstracted tools. Modern data stack - representing a collection of off-the-shelf open source and third- party products assembled to make analysts’ lives easier https://mattturck.wpenginepowered.com/wp-content/uploads/2021/12/Data-and-AI-Landscape-2021-v3-small.jpg Data Engineering and Data Science Data Engineering is a subdiscipline of Data Science Data Engineering is standalone discipline Data Maturity and the Data Engineer The level of data engineering complexity within a company depends a great deal on the company’s data maturity. Data maturity is the progression toward higher data utilization, capabilities, and integration across the organization. Data maturity models have many versions, such as Data Management Maturity (DMM) The Background and Skills of a Data Engineer Business Responsibilities Know how to communicate with nontechnical and technical people. Understand how to scope and gather business and product requirements. Understand the cultural foundations of Agile, DevOps, and DataOps. Control costs. Learn continuously. Technical Responsibilities Understand how to build architectures that optimize performance and cost at a high level using prepackaged or homegrown components. All skills related to the data engineering lifecycle. Data Engineers and Other Technical Roles The Data Engineering Lifecycle Comprises stages that turn raw data ingredients into a useful product, ready for consumption by analysts, data scientists, ML engineers, and others. Stages Generation Storage Ingestion Transformation Serving data Undercurrents cut across multiple stages of the data engineering lifecycle Security Data management DataOps Data architecture Orchestration and software engineering. Generation A source system is the origin of the data used in the data engineering lifecycle. Relation database systems NoSQL IoT Data streams All must be evaluated thoroughly (essential characteristics, persistence, frequency of generation, errors, schema presence) Storage After ingesting data, you need a place to store it. Storage runs across the entire data engineering lifecycle, often occurring in multiple places in a data pipeline, with storage systems crossing over with source systems, ingestion, transformation, and serving. Key engineering questions are focused on choosing a storage system for a data warehouse, data lakehouse, database, or object storage Temperatures of data relate to understanding data access frequency Data that is most frequently accessed is called hot data Lukewarm data might be accessed every so often—say, every week or month Cold data is seldom queried and is appropriate for storing in an archival system. There is no one-size-fits-all universal storage recommendation. Ingestion Data ingestion from source systems, which are normally outside the direct control. Source systems and ingestion represent the most significant bottleneck. Batch versus streaming Data are inherently streaming Batch ingestion is simply a specialized and convenient way of processing this stream in large chunks (size, interval) Real-time (or near real-time) means that the data is available to a downstream system a short time after it is produced Push versus Pull Push model - a source system writes data out to a target, whether a database, object store or filesystem. Pull model - data is retrieved from the source system. Hybrid model Transformation Data needs to be changed from its original form into something useful for downstream use cases. Map data into the correct types Clean Normalize Selection and creation of new features Often included in other phases of the lifecycle Serving Data - Analytics Data has value when it’s used for practical purposes. Data vanity projects are a major risk for companies. Analytics is the core of most data endeavors. BI describe a business’s past and current state. Operational analytics focuses on the present and on the fine-grained details of operations consumed in real time. Embedded analytics (customer-facing) the request rate for reports goes up dramatically Multitenancy - Data engineers may choose to house data for many customers in common tables to allow a unified view for internal analytics and ML. This data is presented externally to individual customers through logical views with appropriately defined controls and filters. Serving Data - Machine learning The feature store is a recently developed tool that combines data engineering and ML engineering. Before investing a ton of resources into ML, take the time to build a solid data foundation. Reverse ETL Takes processed data from the output side of the data engineering lifecycle and feeds it back into source systems. Allows us to take analytics, scored models, etc., and feed these back into production systems or SaaS platforms. Especially important as businesses rely increasingly on SaaS and external platforms. Major Undercurrents Across the Data Engineering Lifecycle Data engineering now encompasses far more than tools and technology and incorporates traditional enterprise practices = undercurrents. Data Management and Governance Data management is the development, execution, and supervision of plans, policies, programs, and practices that deliver, control, protect, and enhance the value of data and information assets throughout their lifecycle. Data governance, including discoverability and accountability Data modeling and design Data lineage Storage and operations Data integration and interoperability Data lifecycle management Data systems for advanced analytics and ML Ethics and privacy Data governance is, first and foremost, a data management function to ensure the quality, integrity, security, and usability of the data collected by an organization. Main categories are discoverability, security, and accountability. Orchestration The process of coordinating many jobs to run as quickly and efficiently as possible on a scheduled cadence. Differs from schedulers (crons) which are aware only of time. An orchestration engine (Apache Airflow) builds in metadata on job dependencies, generally in the form of a directed acyclic graph (DAG). Orchestration systems also build job history capabilities, visualization, and alerting. DataOps DataOps maps the best practices of Agile methodology, DevOps, and statistical process control (SPC) to data. DataOps is a collection of technical practices, workflows, cultural norms, and architectural patterns that enable: Rapid innovation and experimentation delivering new insights to customers with increasing velocity Extremely high data quality and very low error rates Collaboration across complex arrays of people, technology, and environments Clear measurement, monitoring, and transparency of results Aims to improve the release and quality of data products. Data products differ from software products because of the way data is used. Data product is built around sound business logic and metrics, whose users make decisions or build models that perform automated actions. DataOps has three core technical elements: automation, monitoring and observability and incident response. Software Engineering Core data processing code (SQL) Development of open-source frameworks (Hadoop ecosystem) Streaming Infrastructure as Code (IaC) applies software engineering practices to the configuration and management of infrastructure. Pipeline as code is the core concept of present-day orchestration systems, which touch every stage of the data engineering lifecycle General-purpose problem-solving. Enterprise Architecture TOGAF “enterprise” in the context of “enterprise architecture” can denote an entire enterprise—encompassing all of its information and technology services, processes, and infrastructure—or a specific domain within the enterprise. Enterprise Architecture (EA) is an organizational model; an abstract representation of an Enterprise that aligns strategy, operations, and technology to create a roadmap for success. Enterprise architecture is the design of systems to support change in the enterprise, achieved by flexible and reversible decisions reached through careful evaluation of trade-offs. Data Architecture Reflects the current and future state of data systems that support an organization’s long-term data needs and strategy. Part of Enterprise architecture A description of the structure and interaction of the enterprise’s major types and sources of data, logical data assets, physical data assets, and data management resources. Data architecture is the design of systems to support the evolving data needs of an enterprise, achieved by flexible and reversible decisions reached through a careful evaluation of trade-offs. Data Architecture Serves business requirements with a common, widely reusable set of building blocks while maintaining flexibility and making appropriate trade-offs. Is flexible and easily maintainable. It is never finished. Principles of Good Data Architecture AWS 1. Choose common components Operational excellence wisely. Security 2. Plan for failure. Reliability Performance efficiency 3. Architect for scalability. Cost optimization 4. Architecture is leadership. Sustainability 5. Always be architecting. Google 6. Build loosely coupled systems. Design for automation. Be smart with state. 7. Make reversible decisions. Favor managed services. 8. Prioritize security. Practice defense in depth. 9. Embrace FinOps. Always be architecting. Examples and Types of Data Architecture Data warehouse and data mart Data lake Data lakehouse Modern data stack Lambda architecture Kappa architecture Architecture for IoT Data mesh Data Warehouse A subject-oriented, integrated, nonvolatile, and time-variant collection of data in support of management’s decisions. Central data hub used for reporting and analysis. Today the scalable, pay-as-you-go model has made cloud data warehouses accessible even to tiny companies. The organizational data warehouse architecture has two main characteristics: Separates analytics processes (OLAP) from production databases (online transaction processing) Centralizes and organizes data ETL vs ELT The ELT data warehouse architecture, data gets moved more or less directly from production systems into a staging area in the data warehouse. Staging in this setting indicates that the data is in a raw form. Data is processed in batches, and transformed Output is written into tables and views for analytics. Data Marts Data mart is a more refined subset of a warehouse designed to serve analytics and reporting, focused on a single suborganization, department, or line of business. Makes data more easily accessible to analysts and report developers. Data marts provide an additional stage of transformation beyond that provided by the initial ETL or ELT pipelines. Data lake vs data lakehouse Simply dumps all of data—structured and unstructured—into a central location. Data lake 1.0 made solid contributions but generally failed due to complexity Data lakehouse introduced by Datablicks. Lakehouse incorporates the controls, data management, and data structures found in a data warehouse while still housing data in object storage and supporting a variety of query and transformation engines. Lakehouse supports ACID Modern data stack Use cloud-based, plug-and-play, easy-to-use, off-the-shelf components to create a modular and cost-effective data architecture. Typical component include data pipelines, storage, transformation, data anagement/governance, monitoring, visualization, and exploration. Lambda architecture Reaction to the request to analyse streamed data Represents systems operating independently of each other—batch, streaming, and serving. The source system is ideally immutable and append-only, sending data to two destinations for processing: stream, and batch. Has several shortcomings. Kappa Architecture Why not just use a stream-processing platform as the backbone for all data handling—ingestion, storage, and serving. Represents a true event-based architecture. Real-time and batch processing can be applied seamlessly to the same data by reading the live event stream directly and replaying large chunks of data for batch processing. It is not yet widely adopted. Architecture for IoT The Internet of Things (IoT) is the distributed collection of devices, aka things— computers, sensors, mobile devices, smart home devices, and anything else with an internet connection. Data mesh Recent response to sprawling monolithic data platforms. Attempts to invert the challenges of centralized data architecture, taking the concepts of domain-driven design. A big part of the data mesh is decentralization. Resources Joe Reis and Matt Housley (2022). Fundamentals of Data Engineering. Plan and Build Robust Data Systems. O’Reilly Media. 554 pp. Matt Bornstein, Jennifer Li, and Martin Casado (2020). Emerging Architectures for Modern Data Infrastructure