Data Engineering: What? Why? PDF
Document Details
Uploaded by MeritoriousConstructivism363
Tags
Summary
This document provides an overview of data engineering, including data systems paradigms and the modern data ecosystem. It also discusses the role of data engineering in data science projects, highlighting data scientists' focus on data engineering in real-world scenarios. The presentation explores the differences between data science and data engineering, emphasizes the importance of data engineering for machine learning, and showcases the emerging role of data engineers.
Full Transcript
Data Engineering: What? Why? 1 Data Engineering: What? Why? Data System Paradigms Modern Data Ecosystem Data Engineering 2 Data Engineeri...
Data Engineering: What? Why? 1 Data Engineering: What? Why? Data System Paradigms Modern Data Ecosystem Data Engineering 2 Data Engineering: What? Why? Data System Paradigms Data Engineering: What? Why? 3 Data Science: The Conventional View Data Science: The Conventional View A data scientist operating alone, on one static dataset at a time, with a clean “rectangular” shape and fitting in main- memory, employing various statistical and ML algorithms on predefined objectives. A valuable component, but sadly, missing the complete picture! 4 Data Science: The Conventional View Now with Data Engineering Data Science: The Conventional View Data Science today involves Data Engineering: A data scientist operating alone, on one A set of activities that include collecting, collating, static dataset at a time, with a clean extracting, moving, transforming, cleaning, “rectangular” shape and fitting in main- integrating, organizing, representing, storing, and memory, employing various statistical processing data. and ML algorithms on predefined objectives. Happens on a large set of messy (often non- rectangular) dynamic and large datasets A valuable component, but sadly, Happens across teams and across the organization missing the complete picture! The team generating the data may not be the same team(s) consuming it The objectives are often rather unclear and ill-defined A lot of data engineering needs to happen to support the conventional view! Data systems are tools that support data engineering. 5 The Data Science Industry Now …once these junior people get to the market, they come in with an unrealistic set of expectations about what data science work will look like. Everyone thinks they’re going to be doing machine learning, deep learning, … Vicky Boykis, 2019.[blog] This is not their fault; this is what data science curriculums [sic.] and the tech media emphasize…. The reality is that “data science” has never been as much about machine learning as it has about cleaning, shaping data, and moving it from place to place. I personally like 2 more 6 than 1! [1/4] Why Learn Data Engineering? 1. Data science projects largely focus on data engineering. Most of the time spent in real-world data science projects involve data engineering. Often underappreciated compared to other activities, e.g., ML. 7 [1/4] Why Learn Data Engineering? 1. Data science projects largely focus on data engineering. Most of the time spent in real-world data science projects involve data engineering. Often underappreciated compared to other activities, e.g., ML. Data engineering activities, e.g., cleaning, moving, and processing data occupies the majority of time in data science. 8 [2/4] Why Learn Data Engineering? 1. Data science projects largely focus on data engineering. 2. Data engineer roles >> data scientist roles. “… 70% more open roles at companies in data engineering as compared to data science. As we train the next generation of data and ML practitioners, let’s place more Mihail Eric, Jan 2021.[blog] emphasis on engineering skills.” “Data engineer” has emerged as a new specialized job category: Data scientist: Use various techniques in statistics & ML to process & analyze data. Data engineer: Develops a robust and scalable set of data processing tools/platforms. 9 [2/4] Why Learn Data Engineering? 1. Data science projects largely focus on data engineering. 2. Data engineer roles >> data scientist roles. “… 70% more open roles at companies in data engineering as compared to data science. As we train the next generation of data and ML practitioners, let’s place more Mihail Eric, Jan 2021.[blog] emphasis on engineering skills.” “Data engineer” has emerged as a new Snarky follow-up: specialized job category: Data scientist: Use various techniques in statistics & ML to process & analyze data. Data engineer: Develops a robust and scalable set of data processing tools/platforms. Devin Petersohn, Apr 2021. [blog] 10 [2/4] Why Learn Data Engineering? 1. Data science projects largely focus on data engineering. 2. Data engineer roles >> data scientist roles. Estimates from PWC in 2015: data scientist positions: 50k data engineer positions: 500k data analyst positions: 125k 10x the number of data engineer positions! PWC, 2015. [link unknown] Even bolder claim: data science roles may disappear!?! “Many data science teams have not delivered results that can be measured in ROI by executives.” Forbes, Feb 2019. [blog] Many teams have struggled because they can do “ML” but can’t do data engineering to get to “ML” “For complex data engineering tasks, you need five data engineers for every one data scientist.” 11 [3/4] Why Learn Data Engineering? 1. Data science projects largely focus on data engineering. 2. Data engineer roles >> data scientist roles. 3. Data engineering is essential to ML/AI. Even when doing ML, the vast fraction of ML-powered systems is not “ML code.” In most cases, “ML code” corresponds to calls to standard libraries, e.g., scikit-learn, pytorch, tensorflow, etc. The hard part is getting the data to the format and quality that these ML libraries expect! Sculley et al., SE4ML 2014 [google research]. 12 Data Engineering is Essential in ML/AI Monica Rogali, 2017 [blog]. Stuff you need to do first! A lot of this is data engineering. In fact, for any sort of data-driven decision- making (ML/AI or not) you will need these skills. 13 Data Engineering is Essential in ML/AI “More often than not, companies are not ready for AI. Monica Rogali, 2017 [blog]. Maybe they hired their first data scientist to less-than- “However, under the strong stellar outcomes, or maybe influence of the current AI hype, data literacy is not central to people try to plug in data that’s dirty their culture. But the most & full of gaps, that spans years common scenario is that they while changing in format and have not yet built the meaning, that’s not understood yet, infrastructure to implement that’s structured in ways that don’t (and reap the benefits of) the make sense, and expect those tools most basic data science to magically handle it.” algorithms and operations, much less ML.” 14 New role: Machine Learning Engineer Tomasz Dudek,, 2018 [blog]. “ML Engineer”: a specialization of data engineer focused on operationalizing ML. “A need for a person that would reunite two warring parties. One being fluent just enough in both fields [Data Science and Software Engineering] to get the product up and running. Somebody taking data scientists’ code and making it more effective and scalable.... Explaining the reasons behind architectural ideas to the devops team. “ 15 Why Learn Data Engineering? Data science projects largely focus on data engineering. Data engineer roles >> data scientist roles. Data engineering is essential to ML/AI. Balance your data techniques with a systems perspective. As a Data Science major, you Techniques are likely familiar with …but you are likely less techniques: statistics/ML Systems familiar with systems. concepts & algorithms… You will learn systems and the infrastructure that enables these techniques. You’ll start thinking about efficiency, especially on large datasets. Various “plumbing analogies”: Data engineering is as essential as plumbing! data pipelines, data flows, … When it works well, you don’t realize it exists. When it doesn’t, you’ll really know. 16 All these Data Systems!!! 2023 MAD (ML/AI/Data) Landscape: blog, interactive 17 2023 MAD (ML/AI/Data) Landscape Data systems is a difficult subject! There are many, many data systems – too many for us to cover. In this course, we will try to cover the key categories and underlying principles. This way, you can make informed decisions about when to use what type of system. 2023 MAD (ML/AI/Data) Landscape: blog, interactive 18 The Bottom Line Data engineering is an essential ingredient of real-world data science projects. A set of activities that include collecting, collating, extracting, moving, The backbone, plumbing, or transforming, cleaning, integrating, infrastructure that supports organizing, representing, storing, and data science. processing data. Understanding these skills will help you…: Apply skills from intro data science classes to messy, large real-world datasets; Get your datasets to the point where you can apply AI/ML; Explore new, sought-after, & specialized roles, e.g., data engineer/ML engineer; Make informed decisions within the vast and confusing landscape of data systems; and Start worrying about efficiency :-) 19 Data Engineering: What? Why? Data System Paradigms Data System Modern Data Ecosystem Paradigms 20 Data Engineering Lifecycle: What you mostly learn Use-Case-Specific Data Fit for purpose Raw Data Preparation Self-Service Transactions Sensors Log Files Experiments Data preparation example: “Experts are close to the data and Research experiments should be the ones extracting/analyzing” 21 Alternative picture, but more traditional enterprise Use-Case-Specific Data Fit for purpose Raw Data Preparation Self-Service Transactions Sensors Log Files Experiments Data Integration Source of Truth Governed Secure Alternative picture, but more traditional Audited Managed Data must be integrated into the system, checked for correctness, verified, audited “Compute is expensive, data is and managed precious” 22 E, T, and L Extract: Scrape raw data from all the source systems, e.g., transactions, sensors, log files, experiments, tables, bytestreams, … Transform: Apply a series of rules or functions, wrangle data into schema(s)/format(s) Load: Load data into a data storage solution What do you think is an enterprise's preferred order of the steps Extract, Transform, and Load? 23 Answer: It depends Two common enterprise data system implementations: Data Warehouse, ~1990s Data Lake, ~2010s “Single source of truth”: A central, “Landing zone”: unconstrained storage for organized repository of data used for any and all data analytics throughout an enterprise. Data is then analyzed on demand Design the uber-schema up-front of all of the rectangular tables you’d ever want. Extract from trusted sources Extract into files/storage Transform to warehouse schema using Load into storage custom tools Transform on demand for any use. Load data warehouse ○ Create new files in the lake, catalog files as they go for reuse 24 ETL for Data Warehouses Transform Raw Data Transactions Sensors Log Files Extract Data Experiments Integration Source of Truth Governed Secure Load Audited Complex transformations at Managed scale, in parallel (high volume) Often relational data 25 Data Warehouse ELT for Data Warehouses Transformations done in SQL Faster, scalable, allows unstructured data, but harder. Requires deep knowledge of Raw Data warehousing tools Transactions Sensors Log Files Extract Data Experiments Integration Source of Truth Governed Load Secure Audited Managed Transform 26 Data Warehouse ET? For Data Lakes? Data Lake Use-Case-Specific Raw Data Transform Data Fit for purpose Preparation Self-Service Transactions Sensors Log Files Extract Experiments (joke) 27 Modern solution is likely Many-to-Many, ETLT Data Lake Use-Case-Specific Transform Fit for purpose Self-Service Raw Data Transactions Transform Data Sensors Log Files Extract Preparation / Integration Experiments Source of Truth Governed Load Secure Audited Managed Put data in a bunch of different places as needed. Transform Transform and load when wanted. Find the right tool for the right job! 28 Data Warehouse …but that was just the beginning…. 29 Really, really important considerations Data Lake Data Discovery & Assessment Use-Case-Specific Fit for purpose Self-Service Raw Data Transactions Data Sensors Preparation / Data Quality Log Files Integration & Integrity Experiments Source of Truth Governed Secure Audited Managed 30 Data Warehouse Important considerations Data Discovery, Data Assessment Ad-Hoc: End-users land data, explore it, label it Systematic: Crawl the data lake for files Very content-centric: really a form of analytics/prediction ○ Try to figure out what type of data you have. AI + People! Data Quality & Integrity Boolean Integrity checks Often specified by people, also “mined” by AI Data changes ALL the time, especially from clients. Enforced: can “reject” or “sequester” data that violates ○ e.g no two products that have the same product ID! 31 Don’t forget about Metadata!! Storing the data is not enough. Also need to store metadata! Generally three types: Application Metadata: Data entities (e.g. students, courses, employees for a university) Relationships between data Constraints Behavioral Metadata: Data Lineage – where did it come from? Audit Trails of Usage – who ran this job, and what did it do? Change Metadata Version info for all the above 32 Modern solutions Data Lake Data Discovery & Assessment Use-Case-Specific Fit for purpose Self-Service Raw Data Transactions Data Sensors Preparation / Data Quality Log Files Integration & Integrity Experiments Source of Truth Metadata Governed Store Secure Audited Managed 33 Data Warehouse Operationalization and Feedback Operationalization: Everything is an ongoing feed! When do jobs kick off, and what do they do? How are tests registered, exceptions handled, people alerted? How do experiments “graduate” into processes? Feedback: Every data “product” is of interest! Some are datasets in their own right. If you produce a table, that’s also data! Many are new processes that generating new data feeds! ○ ML models: Constantly yielding predictions. Compare old predictions to new predictions? 34 Modern solutions Data Lake Data Discovery & Assessment Use-Case-Specific Fit for purpose Self-Service Raw Data Transactions Data Sensors Preparation / Data Quality Log Files Integration & Integrity Experiments Source of Truth Metadata Governed Store Secure Audited Managed 35 Data Warehouse Data Engineering: What? Why? Data System Paradigms Modern Data Modern Data Ecosystem Ecosystem 20 Modern Data Ecosystem Key Players in the Data Ecosystem Modern Data Ecosystem "The constant increase in data processing speeds and bandwidth, Modern Data Ecosystem "The constant increase in data processing speeds and bandwidth, the nonstop invention of new tools for creating, sharing, and consuming data, Modern Data Ecosystem "The constant increase in data processing speeds and bandwidth, the nonstop invention of new tools for creating, sharing, and consuming data, and the steady addition of new data creators and consumers around the world, ensure that data growth continues unabated. Modern Data Ecosystem "The constant increase in data processing speeds and bandwidth, the nonstop invention of new tools for creating, sharing, and consuming data, and the steady addition of new data creators and consumers around the world, ensure that data growth continues unabated. Data begets more data in a constant virtuous cycle." Forbes 2020 Report The data ecosystem in use Organizations are using data to uncover opportunities and apply that knowledge to differentiate themselves from their competitors. The data ecosystem in use Organizations are using data to uncover opportunities and apply that knowledge to differentiate themselves from their competitors. Identifying patterns in financial data to detect patterns such as fraud The data ecosystem in use Organizations are using data to uncover opportunities and apply that knowledge to differentiate themselves from their competitors. Identifying patterns in financial data to detect patterns such as fraud Using recommendation engines to drive conversion The data ecosystem in use Organizations are using data to uncover opportunities and apply that knowledge to differentiate themselves from their competitors. Identifying patterns in financial data to detect patterns such as fraud Using recommendation engines to drive conversion Mining social media posts for customer’s voice The data ecosystem in use Organizations are using data to uncover opportunities and apply that knowledge to differentiate themselves from their competitors. Identifying patterns in financial data to detect patterns such as fraud Using recommendation engines to drive conversion Mining social media posts for customer’s voice Analyzing customers behavior for personalizing offers Emerging technologies shaping the modern data ecosystem Every enterprise today has access to limitless storage, high- performance computing, open source technologies, machine Learning technologies, and the latest tools and libraries Emerging technologies shaping the modern data ecosystem Every enterprise today has access to limitless storage, high- performance computing, open source technologies, machine Learning technologies, and the latest tools and libraries Data Scientists are creating predictive models by training machine learning algorithms on past data Emerging technologies shaping the modern data ecosystem Every enterprise today has access to limitless storage, high- performance computing, open source technologies, machine Learning technologies, and the latest tools and libraries Data Scientists are creating predictive models by training machine learning algorithms on past data Big Data is paving the way for new tools and techniques and also new knowledge and insights Data Professionals: Data Engineers Data Analysts Data Scientists Business Analysts Business Intelligence Analysts Data architectures Businessoperations Analysis Data Engineers work within the data ecosystem to: Extract, integrate, and organize data from disparate sources Clean, transform, and prepare data Design, store, and manage data in data repositories Business Data Analysts & applications data scientists Skills: Good knowledge of programming Sound knowledge of systems and technology architectures In-depth understanding of relational databases and non-relational data stores Translates data and numbers into plain language Responsibilities of a DataAnalyst: Inspect and clean data to derive insights Identify correlations, find patterns, and apply statistical methods to analyze and mine data Visualize data to interpret and present the findings of data analysis "Are the users' search experiences generally good or bad with the search functionality on our site?" "What is the popular perception of people regarding our rebranding initiatives?" "Is there a co-relation between sales of one product and another?” Skills: Good knowledge of spreadsheets, writing queries, and using statistical tools to create charts and dashboards Skills: Good knowledge of spreadsheets, writing queries, and using statistical tools to create charts anddashboards Programming skills Strong analytical and story telling skills Responsibilities of a Data Scientist: Analyze data for actionable insights Create predictive models using Machine Learning and Deep Learning "How many new social media followers am I likely to get next month?" "How many new social media followers am I likely to get next month?" "What percentage of my customers am I likely to lose to competition in the next quarter?" "How many new social media followers am I likely to get next month?" "What percentage of my customers am I likely to lose to competition in the next quarter?" "Is this financial transaction unusual for this customer?" Skills: Knowledge of Mathematics and Statistics Understanding of programming languages, databases, and building data models Domain Knowledge Business Analysts leverage the work of Data Analysts and Data Scientists to look at possible implications for their business and the actions they need to take or recommend. DataAnalysts Data Scientists BI Analysts Focus on market forces and external influences that shape their business BI Analysts Focus on market forces and external influences that shape their business Organize and monitor data on different business functions BI Analysts Focus on market forces and external influences that shape their business Organize and monitor data on different business functions Explore data to extract insights and actionable that improve business performance To summarize Data Engineering converts raw data into usable data To summarize Data Engineering converts raw data into usable data Data Analytics use this data to generate insights To summarize Data Engineering converts raw data into usable data Data Analytics use this data to generate insights Data Scientists use Data Analytics and Data Engineering to predict the future using data from the past To summarize Data Engineering converts raw data into usable data Data Analytics use this data to generate insights Data Scientists use Data Analytics and Data Engineering to predict the future using data from the past Business Analysts and Business Intelligence Analysts use these insights and predictions to drive decisions that benefit and grow their business