Fundamentals of Data Engineering PDF
Document Details
Uploaded by SlickCotangent
Tilburg University, Eindhoven University of Technology, University of Salerno
2022
Joe Reis & Matt Housley
Tags
Summary
Fundamentals of Data Engineering is a book about data engineering, providing a comprehensive understanding of this rapidly growing field. The book covers various cloud technologies, data concepts and processes.
Full Transcript
Fundamentals of Data Engineering Plan and Build Robust Data Systems Joe Reis & Matt Housley Fundamentals of Data Engineering Data engineering has grown rapidly in the past decade, leaving many software engineers, data scientists, and “...
Fundamentals of Data Engineering Plan and Build Robust Data Systems Joe Reis & Matt Housley Fundamentals of Data Engineering Data engineering has grown rapidly in the past decade, leaving many software engineers, data scientists, and “The world of data has been analysts looking for a comprehensive view of this practice. evolving for a while now. First With this practical book, you’ll learn how to plan and there were designers. Then build systems to serve the needs of your organization and database administrators. customers by evaluating the best technologies available Then CIOs. Then data through the framework of the data engineering lifecycle. architects. This book signals Authors Joe Reis and Matt Housley walk you through the data the next step in the evolution engineering lifecycle and show you how to stitch together and maturity of the industry. It a variety of cloud technologies to serve the needs of down- is a must read for anyone who stream data consumers. You’ll understand how to apply the takes their profession and concepts of data generation, ingestion, orchestration, trans- career honestly.” formation, storage, and governance that are critical in any —Bill Inmon data environment regardless of the underlying technology. creator of the data warehouse This book will help you: “Fundamentals of Data Get a concise overview of the entire data engineering Engineering is a great landscape introduction to the business Assess data engineering problems using an end-to-end of moving, processing, and framework of best practices handling data. I’d highly Cut through marketing hype when choosing data recommend it for anyone technologies, architecture, and processes wanting to get up to speed in data engineering or Use the data engineering lifecycle to design and build a analytics, or for existing robust architecture practitioners who want Incorporate data governance and security across the data to fill in any gaps in their engineering lifecycle understanding.” Joe Reis is a “recovering data scientist,” and a data engineer and —Jordan Tigani architect. founder and CEO, MotherDuck, and founding engineer and Matt Housley is a data engineering consultant and cloud specialist. cocreator of BigQuery DATA Twitter: @oreillymedia linkedin.com/company/oreilly-media US $69.99 CAN $87.99 youtube.com/oreillymedia ISBN: 978-1-098-10830-4 Praise for Fundamentals of Data Engineering The world of data has been evolving for a while now. First there were designers. Then database administrators. Then CIOs. Then data architects. This book signals the next step in the evolution and maturity of the industry. It is a must read for anyone who takes their profession and career honestly. —Bill Inmon, creator of the data warehouse Fundamentals of Data Engineering is a great introduction to the business of moving, processing, and handling data. It explains the taxonomy of data concepts, without focusing too heavily on individual tools or vendors, so the techniques and ideas should outlast any individual trend or product. I’d highly recommend it for anyone wanting to get up to speed in data engineering or analytics, or for existing practitioners who want to fill in any gaps in their understanding. —Jordan Tigani, founder and CEO, MotherDuck, and founding engineer and cocreator of BigQuery If you want to lead in your industry, you must build the capabilities required to provide exceptional customer and employee experiences. This is not just a technology problem. It’s a people opportunity. And it will transform your business. Data engineers are at the center of this transformation. But today the discipline is misunderstood. This book will demystify data engineering and become your ultimate guide to succeeding with data. —Bruno Aziza, Head of Data Analytics, Google Cloud What a book! Joe and Matt are giving you the answer to the question, “What must I understand to do data engineering?” Whether you are getting started as a data engineer or strengthening your skills, you are not looking for yet another technology handbook. You are seeking to learn more about the underlying principles and the core concepts of the role, its responsibilities, its technical and organizational environment, its mission—that’s exactly what Joe and Matt offer in this book. —Andy Petrella, founder of Kensu This is the missing book in data engineering. A wonderfully thorough account of what it takes to be a good practicing data engineer, including thoughtful real-life considerations. I’d recommend all future education of data professionals include Joe and Matt’s work. —Sarah Krasnik, data engineering leader It is incredible to realize the breadth of knowledge a data engineer must have. But don’t let it scare you. This book provides a great foundational overview of various architectures, approaches, methodologies, and patterns that anyone working with data needs to be aware of. But what is even more valuable is that this book is full of golden nuggets of wisdom, best-practice advice, and things to consider when making decisions related to data engineering. It is a must read for both experienced and new data engineers. —Veronika Durgin, data and analytics leader I was honored and humbled to be asked by Joe and Matt to help technical review their masterpiece of data knowledge, Fundamentals of Data Engineering. Their ability to break down the key components that are critical to anyone wanting to move into a data engineering role is second to none. Their writing style makes the information easy to absorb, and they leave no stone unturned. It was an absolute pleasure to work with some of the best thought leaders in the data space. I can’t wait to see what they do next. —Chris Tabb, cofounder of LEIT DATA Fundamentals of Data Engineering is the first book to take an in-depth and holistic look into the requirements of today’s data engineer. As you’ll see, the book dives into the critical areas of data engineering including skill sets, tools, and architectures used to manage, move, and curate data in today’s complex technical environments. More importantly, Joe and Matt convey their master of understanding data engineering and take the time to further dive into the more nuanced areas of data engineering and make it relatable to the reader. Whether you’re a manager, experienced data engineer, or someone wanting to get into the space, this book provides practical insight into today’s data engineering landscape. —Jon King, Principal Data Architect Two things will remain relevant to data engineers in 2042: SQL and this book. Joe and Matt cut through the hype around tools to extract the slowly changing dimensions of our discipline. Whether you’re starting your journey with data or adding stripes to your black belt, Fundamentals of Data Engineering lays the foundation for mastery. —Kevin Hu, CEO of Metaplane In a field that is rapidly changing, with new technology solutions popping up constantly, Joe and Matt provide clear, timeless guidance, focusing on the core concepts and foundational knowledge required to excel as a data engineer. This book is jam packed with information that will empower you to ask the right questions, understand trade-offs, and make the best decisions when designing your data architecture and implementing solutions across the data engineering lifecycle. Whether you’re just considering becoming a data engineer or have been in the field for years, I guarantee you’ll learn something from this book! —Julie Price, Senior Product Manager, SingleStore Fundamentals of Data Engineering isn’t just an instruction manual—it teaches you how to think like a data engineer. Part history lesson, part theory, and part acquired knowledge from Joe and Matt’s decades of experience, the book has definitely earned its place on every data professional’s bookshelf. —Scott Breitenother, founder and CEO, Brooklyn Data Co. There is no other book that so comprehensively covers what it means to be a data engineer. Joe and Matt dive deep into responsibilities, impacts, architectural choices, and so much more. Despite talking about such complex topics, the book is easy to read and digest. A very powerful combination. —Danny Leybzon, MLOps Architect I wish this book was around years ago when I started working with data engineers. The wide coverage of the field makes the involved roles clear and builds empathy with the many roles it takes to build a competent data discipline. —Tod Hansmann, VP Engineering A must read and instant classic for anyone in the data engineering field. This book fills a gap in the current knowledge base, discussing fundamental topics not found in other books. You will gain understanding of foundational concepts and insight into historical context about data engineering that will set up anyone to succeed. —Matthew Sharp, Data and ML Engineer Data engineering is the foundation of every analysis, machine learning model, and data product, so it is critical that it is done well. There are countless manuals, books, and references for each of the technologies used by data engineers, but very few (if any) resources that provide a holistic view of what it means to work as a data engineer. This book fills a critical need in the industry and does it well, laying the foundation for new and working data engineers to be successful and effective in their roles. This is the book that I’ll be recommending to anyone who wants to work with data at any level. —Tobias Macey, host of The Data Engineering Podcast Fundamentals of Data Engineering Plan and Build Robust Data Systems Joe Reis and Matt Housley Beijing Boston Farnham Sebastopol Tokyo Fundamentals of Data Engineering by Joe Reis and Matt Housley Copyright © 2022 Joseph Reis and Matthew Housley. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or [email protected]. Acquisitions Editor: Jessica Haberman Indexer: Judith McConville Development Editor: Michele Cronin Interior Designer: David Futato Production Editor: Gregory Hyman Cover Designer: Karen Montgomery Copyeditor: Sharon Wilkey Illustrator: Kate Dullea Proofreader: Amnet Systems, LLC July 2022: First Edition Revision History for the First Edition 2022-06-22: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781098108304 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Fundamentals of Data Engineering, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-098-10830-4 [LSI] Table of Contents Preface...................................................................... xiii Part I. Foundation and Building Blocks 1. Data Engineering Described.................................................. 3 What Is Data Engineering? 3 Data Engineering Defined 4 The Data Engineering Lifecycle 5 Evolution of the Data Engineer 6 Data Engineering and Data Science 11 Data Engineering Skills and Activities 13 Data Maturity and the Data Engineer 13 The Background and Skills of a Data Engineer 17 Business Responsibilities 18 Technical Responsibilities 19 The Continuum of Data Engineering Roles, from A to B 21 Data Engineers Inside an Organization 22 Internal-Facing Versus External-Facing Data Engineers 23 Data Engineers and Other Technical Roles 24 Data Engineers and Business Leadership 28 Conclusion 31 Additional Resources 32 2. The Data Engineering Lifecycle.............................................. 33 What Is the Data Engineering Lifecycle? 33 The Data Lifecycle Versus the Data Engineering Lifecycle 34 Generation: Source Systems 35 iii Storage 38 Ingestion 39 Transformation 43 Serving Data 44 Major Undercurrents Across the Data Engineering Lifecycle 48 Security 49 Data Management 50 DataOps 59 Data Architecture 64 Orchestration 64 Software Engineering 66 Conclusion 68 Additional Resources 69 3. Designing Good Data Architecture........................................... 71 What Is Data Architecture? 71 Enterprise Architecture Defined 72 Data Architecture Defined 75 “Good” Data Architecture 76 Principles of Good Data Architecture 77 Principle 1: Choose Common Components Wisely 78 Principle 2: Plan for Failure 79 Principle 3: Architect for Scalability 80 Principle 4: Architecture Is Leadership 80 Principle 5: Always Be Architecting 81 Principle 6: Build Loosely Coupled Systems 81 Principle 7: Make Reversible Decisions 83 Principle 8: Prioritize Security 84 Principle 9: Embrace FinOps 85 Major Architecture Concepts 87 Domains and Services 87 Distributed Systems, Scalability, and Designing for Failure 88 Tight Versus Loose Coupling: Tiers, Monoliths, and Microservices 90 User Access: Single Versus Multitenant 94 Event-Driven Architecture 95 Brownfield Versus Greenfield Projects 96 Examples and Types of Data Architecture 98 Data Warehouse 98 Data Lake 101 Convergence, Next-Generation Data Lakes, and the Data Platform 102 Modern Data Stack 103 Lambda Architecture 104 iv | Table of Contents Kappa Architecture 105 The Dataflow Model and Unified Batch and Streaming 105 Architecture for IoT 106 Data Mesh 109 Other Data Architecture Examples 110 Who’s Involved with Designing a Data Architecture? 111 Conclusion 111 Additional Resources 111 4. Choosing Technologies Across the Data Engineering Lifecycle................... 115 Team Size and Capabilities 116 Speed to Market 117 Interoperability 117 Cost Optimization and Business Value 118 Total Cost of Ownership 118 Total Opportunity Cost of Ownership 119 FinOps 120 Today Versus the Future: Immutable Versus Transitory Technologies 120 Our Advice 122 Location 123 On Premises 123 Cloud 124 Hybrid Cloud 127 Multicloud 128 Decentralized: Blockchain and the Edge 129 Our Advice 129 Cloud Repatriation Arguments 130 Build Versus Buy 132 Open Source Software 133 Proprietary Walled Gardens 137 Our Advice 138 Monolith Versus Modular 139 Monolith 139 Modularity 140 The Distributed Monolith Pattern 142 Our Advice 142 Serverless Versus Servers 143 Serverless 143 Containers 144 How to Evaluate Server Versus Serverless 145 Our Advice 146 Optimization, Performance, and the Benchmark Wars 147 Table of Contents | v Big Data...for the 1990s 148 Nonsensical Cost Comparisons 148 Asymmetric Optimization 148 Caveat Emptor 149 Undercurrents and Their Impacts on Choosing Technologies 149 Data Management 149 DataOps 149 Data Architecture 150 Orchestration Example: Airflow 150 Software Engineering 151 Conclusion 151 Additional Resources 151 Part II. The Data Engineering Lifecycle in Depth 5. Data Generation in Source Systems......................................... 155 Sources of Data: How Is Data Created? 156 Source Systems: Main Ideas 156 Files and Unstructured Data 156 APIs 157 Application Databases (OLTP Systems) 157 Online Analytical Processing System 159 Change Data Capture 159 Logs 160 Database Logs 161 CRUD 162 Insert-Only 162 Messages and Streams 163 Types of Time 164 Source System Practical Details 165 Databases 166 APIs 174 Data Sharing 176 Third-Party Data Sources 177 Message Queues and Event-Streaming Platforms 177 Whom You’ll Work With 181 Undercurrents and Their Impact on Source Systems 183 Security 183 Data Management 184 DataOps 184 Data Architecture 185 vi | Table of Contents Orchestration 186 Software Engineering 187 Conclusion 187 Additional Resources 188 6. Storage................................................................. 189 Raw Ingredients of Data Storage 191 Magnetic Disk Drive 191 Solid-State Drive 193 Random Access Memory 194 Networking and CPU 195 Serialization 195 Compression 196 Caching 197 Data Storage Systems 197 Single Machine Versus Distributed Storage 198 Eventual Versus Strong Consistency 198 File Storage 199 Block Storage 202 Object Storage 205 Cache and Memory-Based Storage Systems 211 The Hadoop Distributed File System 211 Streaming Storage 212 Indexes, Partitioning, and Clustering 213 Data Engineering Storage Abstractions 215 The Data Warehouse 215 The Data Lake 216 The Data Lakehouse 216 Data Platforms 217 Stream-to-Batch Storage Architecture 217 Big Ideas and Trends in Storage 218 Data Catalog 218 Data Sharing 219 Schema 219 Separation of Compute from Storage 220 Data Storage Lifecycle and Data Retention 223 Single-Tenant Versus Multitenant Storage 226 Whom You’ll Work With 227 Undercurrents 228 Security 228 Data Management 228 DataOps 229 Table of Contents | vii Data Architecture 230 Orchestration 230 Software Engineering 230 Conclusion 230 Additional Resources 231 7. Ingestion............................................................... 233 What Is Data Ingestion? 234 Key Engineering Considerations for the Ingestion Phase 235 Bounded Versus Unbounded Data 236 Frequency 237 Synchronous Versus Asynchronous Ingestion 238 Serialization and Deserialization 239 Throughput and Scalability 239 Reliability and Durability 240 Payload 241 Push Versus Pull Versus Poll Patterns 244 Batch Ingestion Considerations 244 Snapshot or Differential Extraction 246 File-Based Export and Ingestion 246 ETL Versus ELT 246 Inserts, Updates, and Batch Size 247 Data Migration 247 Message and Stream Ingestion Considerations 248 Schema Evolution 248 Late-Arriving Data 248 Ordering and Multiple Delivery 248 Replay 249 Time to Live 249 Message Size 249 Error Handling and Dead-Letter Queues 249 Consumer Pull and Push 250 Location 250 Ways to Ingest Data 250 Direct Database Connection 251 Change Data Capture 252 APIs 254 Message Queues and Event-Streaming Platforms 255 Managed Data Connectors 256 Moving Data with Object Storage 257 EDI 257 Databases and File Export 257 viii | Table of Contents Practical Issues with Common File Formats 258 Shell 258 SSH 259 SFTP and SCP 259 Webhooks 259 Web Interface 260 Web Scraping 260 Transfer Appliances for Data Migration 261 Data Sharing 262 Whom You’ll Work With 262 Upstream Stakeholders 262 Downstream Stakeholders 263 Undercurrents 263 Security 264 Data Management 264 DataOps 266 Orchestration 268 Software Engineering 268 Conclusion 268 Additional Resources 269 8. Queries, Modeling, and Transformation..................................... 271 Queries 272 What Is a Query? 273 The Life of a Query 274 The Query Optimizer 275 Improving Query Performance 275 Queries on Streaming Data 281 Data Modeling 287 What Is a Data Model? 288 Conceptual, Logical, and Physical Data Models 289 Normalization 290 Techniques for Modeling Batch Analytical Data 294 Modeling Streaming Data 307 Transformations 309 Batch Transformations 310 Materialized Views, Federation, and Query Virtualization 323 Streaming Transformations and Processing 326 Whom You’ll Work With 329 Upstream Stakeholders 329 Downstream Stakeholders 330 Undercurrents 330 Table of Contents | ix Security 330 Data Management 331 DataOps 332 Data Architecture 333 Orchestration 333 Software Engineering 333 Conclusion 334 Additional Resources 335 9. Serving Data for Analytics, Machine Learning, and Reverse ETL................. 337 General Considerations for Serving Data 338 Trust 338 What’s the Use Case, and Who’s the User? 339 Data Products 340 Self-Service or Not? 341 Data Definitions and Logic 342 Data Mesh 343 Analytics 344 Business Analytics 344 Operational Analytics 346 Embedded Analytics 348 Machine Learning 349 What a Data Engineer Should Know About ML 350 Ways to Serve Data for Analytics and ML 351 File Exchange 351 Databases 352 Streaming Systems 354 Query Federation 354 Data Sharing 355 Semantic and Metrics Layers 355 Serving Data in Notebooks 356 Reverse ETL 358 Whom You’ll Work With 360 Undercurrents 360 Security 361 Data Management 362 DataOps 362 Data Architecture 363 Orchestration 363 Software Engineering 364 Conclusion 365 Additional Resources 365 x | Table of Contents Part III. Security, Privacy, and the Future of Data Engineering 10. Security and Privacy...................................................... 369 People 370 The Power of Negative Thinking 370 Always Be Paranoid 370 Processes 371 Security Theater Versus Security Habit 371 Active Security 371 The Principle of Least Privilege 372 Shared Responsibility in the Cloud 372 Always Back Up Your Data 372 An Example Security Policy 373 Technology 374 Patch and Update Systems 374 Encryption 375 Logging, Monitoring, and Alerting 375 Network Access 376 Security for Low-Level Data Engineering 377 Conclusion 378 Additional Resources 378 11. The Future of Data Engineering............................................ 379 The Data Engineering Lifecycle Isn’t Going Away 380 The Decline of Complexity and the Rise of Easy-to-Use Data Tools 380 The Cloud-Scale Data OS and Improved Interoperability 381 “Enterprisey” Data Engineering 383 Titles and Responsibilities Will Morph... 384 Moving Beyond the Modern Data Stack, Toward the Live Data Stack 385 The Live Data Stack 385 Streaming Pipelines and Real-Time Analytical Databases 386 The Fusion of Data with Applications 387 The Tight Feedback Between Applications and ML 388 Dark Matter Data and the Rise of...Spreadsheets?! 388 Conclusion 389 A. Serialization and Compression Technical Details................................ 391 B. Cloud Networking......................................................... 399 Index....................................................................... 403 Table of Contents | xi Preface How did this book come about? The origin is deeply rooted in our journey from data science into data engineering. We often jokingly refer to ourselves as recovering data scientists. We both had the experience of being assigned to data science projects, then struggling to execute these projects due to a lack of proper foundations. Our journey into data engineering began when we undertook data engineering tasks to build foundations and infrastructure. With the rise of data science, companies splashed out lavishly on data science talent, hoping to reap rich rewards. Very often, data scientists struggled with basic problems that their background and training did not address—data collection, data cleansing, data access, data transformation, and data infrastructure. These are problems that data engineering aims to solve. What This Book Isn’t Before we cover what this book is about and what you’ll get out of it, let’s quickly cover what this book isn’t. This book isn’t about data engineering using a particular tool, technology, or platform. While many excellent books approach data engineering technologies from this perspective, these books have a short shelf life. Instead, we focus on the fundamental concepts behind data engineering. What This Book Is About This book aims to fill a gap in current data engineering content and materials. While there’s no shortage of technical resources that address specific data engineering tools and technologies, people struggle to understand how to assemble these compo‐ nents into a coherent whole that applies in the real world. This book connects the dots of the end-to-end data lifecycle. It shows you how to stitch together various technologies to serve the needs of downstream data consumers such as analysts, data scientists, and machine learning engineers. This book works as a complement xiii to O’Reilly books that cover the details of particular technologies, platforms, and programming languages. The big idea of this book is the data engineering lifecycle: data generation, storage, ingestion, transformation, and serving. Since the dawn of data, we’ve seen the rise and fall of innumerable specific technologies and vendor products, but the data engi‐ neering lifecycle stages have remained essentially unchanged. With this framework, the reader will come away with a sound understanding for applying technologies to real-world business problems. Our goal here is to map out principles that reach across two axes. First, we wish to distill data engineering into principles that can encompass any relevant technology. Second, we wish to present principles that will stand the test of time. We hope that these ideas reflect lessons learned across the data technology upheaval of the last twenty years and that our mental framework will remain useful for a decade or more into the future. One thing to note: we unapologetically take a cloud-first approach. We view the cloud as a fundamentally transformative development that will endure for decades; most on-premises data systems and workloads will eventually move to cloud hosting. We assume that infrastructure and systems are ephemeral and scalable, and that data engineers will lean toward deploying managed services in the cloud. That said, most concepts in this book will translate to non-cloud environments. Who Should Read This Book Our primary intended audience for this book consists of technical practitioners, mid- to senior-level software engineers, data scientists, or analysts interested in moving into data engineering; or data engineers working in the guts of specific technologies, but wanting to develop a more comprehensive perspective. Our secondary target audience consists of data stakeholders who work adjacent to technical practition‐ ers—e.g., a data team lead with a technical background overseeing a team of data engineers, or a director of data warehousing wanting to migrate from on-premises technology to a cloud-based solution. Ideally, you’re curious and want to learn—why else would you be reading this book? You stay current with data technologies and trends by reading books and articles on data warehousing/data lakes, batch and streaming systems, orchestration, modeling, management, analysis, developments in cloud technologies, etc. This book will help you weave what you’ve read into a complete picture of data engineering across technologies and paradigms. xiv | Preface Prerequisites We assume a good deal of familiarity with the types of data systems found in a corporate setting. In addition, we assume that readers have some familiarity with SQL and Python (or some other programming language), and experience with cloud services. Numerous resources are available for aspiring data engineers to practice Python and SQL. Free online resources abound (blog posts, tutorial sites, YouTube videos), and many new Python books are published every year. The cloud provides unprecedented opportunities to get hands-on experience with data tools. We suggest that aspiring data engineers set up accounts with cloud services such as AWS, Azure, Google Cloud Platform, Snowflake, Databricks, etc. Note that many of these platforms have free tier options, but readers should keep a close eye on costs and work with small quantities of data and single node clusters as they study. Developing familiarity with corporate data systems outside of a corporate environ‐ ment remains difficult, and this creates certain barriers for aspiring data engineers who have yet to land their first data job. This book can help. We suggest that data novices read for high-level ideas and then look at materials in the Additional Resources section at the end of each chapter. On a second read through, note any unfamiliar terms and technologies. You can utilize Google, Wikipedia, blog posts, YouTube videos, and vendor sites to become familiar with new terms and fill gaps in your understanding. What You’ll Learn and How It Will Improve Your Abilities This book aims to help you build a solid foundation for solving real-world data engineering problems. By the end of this book you will understand: How data engineering impacts your current role (data scientist, software engi‐ neer, or data team lead) How to cut through the marketing hype and choose the right technologies, data architecture, and processes How to use the data engineering lifecycle to design and build a robust architecture Best practices for each stage of the data lifecycle Preface | xv And you will be able to: Incorporate data engineering principles in your current role (data scientist, ana‐ lyst, software engineer, data team lead, etc.) Stitch together a variety of cloud technologies to serve the needs of downstream data consumers Assess data engineering problems with an end-to-end framework of best practices Incorporate data governance and security across the data engineering lifecycle Navigating This Book This book is composed of four parts: Part I, “Foundation and Building Blocks” Part II, “The Data Engineering Lifecycle in Depth” Part III, “Security, Privacy, and the Future of Data Engineering” Appendices A and B: covering serialization and compression, and cloud net‐ working, respectively In Part I, we begin by defining data engineering in Chapter 1, then map out the data engineering lifecycle in Chapter 2. In Chapter 3, we discuss good architecture. In Chapter 4, we introduce a framework for choosing the right technology—while we frequently see technology and architecture conflated, these are in fact very different topics. Part II builds on Chapter 2 to cover the data engineering lifecycle in depth; each lifecycle stage—data generation, storage, ingestion, transformation and serving—is covered in its own chapter. Part II is arguably the heart of the book, and the other chapters exist to support the core ideas covered here. Part III covers additional topics. In Chapter 10, we discuss security and privacy. While security has always been an important part of the data engineering profession, it has only become more critical with the rise of for profit hacking and state sponsored cyber attacks. And what can we say of privacy? The era of corporate privacy nihilism is over—no company wants to see its name appear in the headline of an article on sloppy privacy practices. Reckless handling of personal data can also have significant legal ramifications with the advent of GDPR, CCPA, and other regulations. In short, security and privacy must be top priorities in any data engineering work. xvi | Preface In the course of working in data engineering, doing research for this book and interviewing numerous experts, we thought a good deal about where the field is going in the near and long term. Chapter 11 outlines our highly speculative ideas on the future of data engineering. By its nature, the future is a slippery thing. Time will tell if some of our ideas are correct. We would love to hear from our readers on how their visions of the future agree with or differ from our own. In the appendices, we cover a handful of technical topics that are extremely relevant to the day-to-day practice of data engineering but didn’t fit into the main body of the text. Specifically, engineers need to understand serialization and compression (see Appendix A) both to work directly with data files and to assess performance considerations in data systems, and cloud networking (see Appendix B) is a critical topic as data engineering shifts into the cloud. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords This element signifies a tip or suggestion. This element signifies a general note. This element indicates a warning or caution. Preface | xvii How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/fundamentals-of-data. Email [email protected] to comment or ask technical questions about this book. For news and information about our books and courses, visit https://oreilly.com. Find us on LinkedIn: https://linkedin.com/company/oreilly-media Follow us on Twitter: https://twitter.com/oreillymedia Watch us on YouTube: https://www.youtube.com/oreillymedia Acknowledgments When we started writing this book, we were warned by many people that we faced a hard task. A book like this has a lot of moving parts, and due to its comprehensive view of the field of data engineering, it required a ton of research, interviews, discus‐ sions, and deep thinking. We won’t claim to have captured every nuance of data engineering, but we hope that the results resonate with you. Numerous individuals contributed to our efforts, and we’re grateful for the support we received from many experts. First, thanks to our amazing crew of technical reviewers. They slogged through many readings and gave invaluable (and often ruthlessly blunt) feedback. This book would be a fraction of itself without their efforts. In no particular order, we give endless thanks to Bill Inmon, Andy Petrella, Matt Sharp, Tod Hanseman, Chris Tabb, Danny Lebzyon, Martin Kleppman, Scott Lorimor, Nick Schrock, Lisa Steckman, Veronika Durgin, and Alex Woolford. Second, we’ve had a unique opportunity to talk with the leading experts in the field of data on our live shows, podcasts, meetups, and endless private calls. Their ideas helped shape our book. There are too many people to name individually, but we’d like to give shoutouts to Jordan Tigani, Zhamak Dehghani, Ananth Packkildurai, xviii | Preface Shruti Bhat, Eric Tschetter, Benn Stancil, Kevin Hu, Michael Rogove, Ryan Wright, Adi Polak, Shinji Kim, Andreas Kretz, Egor Gryaznov, Chad Sanderson, Julie Price, Matt Turck, Monica Rogati, Mars Lan, Pardhu Gunnam, Brian Suk, Barr Moses, Lior Gavish, Bruno Aziza, Gian Merlino, DeVaris Brown, Todd Beauchene, Tudor Girba, Scott Taylor, Ori Rafael, Lee Edwards, Bryan Offutt, Ollie Hughes, Gilbert Eijkelen‐ boom, Chris Bergh, Fabiana Clemente, Andreas Kretz, Ori Reshef, Nick Singh, Mark Balkenende, Kenten Danas, Brian Olsen, Lior Gavish, Rhaghu Murthy, Greg Coquillo, David Aponte, Demetrios Brinkmann, Sarah Catanzaro, Michel Tricot, Levi Davis, Ted Walker, Carlos Kemeny, Josh Benamram, Chanin Nantasenamat, George Firican, Jordan Goldmeir, Minhaaj Rehmam, Luigi Patruno, Vin Vashista, Danny Ma, Jesse Anderson, Alessya Visnjic, Vishal Singh, Dave Langer, Roy Hasson, Todd Odess, Che Sharma, Scott Breitenother, Ben Taylor, Thom Ives, John Thompson, Brent Dykes, Josh Tobin, Mark Kosiba, Tyler Pugliese, Douwe Maan, Martin Traverso, Curtis Kowalski, Bob Davis, Koo Ping Shung, Ed Chenard, Matt Sciorma, Tyler Folkman, Jeff Baird, Tejas Manohar, Paul Singman, Kevin Stumpf, Willem Pineaar, and Michael Del Balso from Tecton, Emma Dahl, Harpreet Sahota, Ken Jee, Scott Taylor, Kate Strachnyi, Kristen Kehrer, Taylor Miller, Abe Gong, Ben Castleton, Ben Rogojan, David Mertz, Emmanuel Raj, Andrew Jones, Avery Smith, Brock Cooper, Jeff Larson, Jon King, Holden Ackerman, Miriah Peterson, Felipe Hoffa, David Gonzalez, Richard Wellman, Susan Walsh, Ravit Jain, Lauren Balik, Mikiko Bazeley, Mark Freeman, Mike Wimmer, Alexey Shchedrin, Mary Clair Thompson, Julie Burroughs, Jason Pedley, Freddy Drennan, Jason Pedley, Kelly and Matt Phillipps, Brian Campbell, Faris Chebib, Dylan Gregerson, Ken Myers, Jake Carter, Seth Paul, Ethan Aaron, and many others. If you’re not mentioned specifically, don’t take it personally. You know who you are. Let us know and we’ll get you on the next edition. We’d also like to thank the Ternary Data team (Colleen McAuley, Maike Wells, Patrick Dahl, Aaron Hunsaker, and others), our students, and the countless people around the world who’ve supported us. It’s a great reminder the world is a very small place. Working with the O’Reilly crew was amazing! Special thanks to Jess Haberman for having confidence in us during the book proposal process, our amazing and extremely patient development editors Nicole Taché and Michele Cronin for invalua‐ ble editing, feedback, and support. Thank you also to the superb production team at O’Reilly (Greg and crew). Joe would like to thank his family—Cassie, Milo, and Ethan—for letting him write a book. They had to endure a ton, and Joe promises to never write a book again. ;) Matt would like to thank his friends and family for their enduring patience and support. He’s still hopeful that Seneca will deign to give a five-star review after a good deal of toil and missed family time around the holidays. Preface | xix PART I Foundation and Building Blocks CHAPTER 1 Data Engineering Described If you work in data or software, you may have noticed data engineering emerging from the shadows and now sharing the stage with data science. Data engineering is one of the hottest fields in data and technology, and for a good reason. It builds the foundation for data science and analytics in production. This chapter explores what data engineering is, how the field was born and its evolution, the skills of data engineers, and with whom they work. What Is Data Engineering? Despite the current popularity of data engineering, there’s a lot of confusion about what data engineering means and what data engineers do. Data engineering has exis‐ ted in some form since companies started doing things with data—such as predictive analysis, descriptive analytics, and reports—and came into sharp focus alongside the rise of data science in the 2010s. For the purpose of this book, it’s critical to define what data engineering and data engineer mean. First, let’s look at the landscape of how data engineering is described and develop some terminology we can use throughout this book. Endless definitions of data engineering exist. In early 2022, a Google exact-match search for “what is data engi‐ neering?” returns over 91,000 unique results. Before we give our definition, here are a few examples of how some experts in the field define data engineering: Data engineering is a set of operations aimed at creating interfaces and mechanisms for the flow and access of information. It takes dedicated specialists—data engineers— to maintain data so that it remains available and usable by others. In short, data engineers set up and operate the organization’s data infrastructure, preparing it for further analysis by data analysts and scientists. 3 —From “Data Engineering and Its Main Concepts” by AlexSoft1 The first type of data engineering is SQL-focused. The work and primary storage of the data is in relational databases. All of the data processing is done with SQL or a SQL-based language. Sometimes, this data processing is done with an ETL tool.2 The second type of data engineering is Big Data–focused. The work and primary storage of the data is in Big Data technologies like Hadoop, Cassandra, and HBase. All of the data processing is done in Big Data frameworks like MapReduce, Spark, and Flink. While SQL is used, the primary processing is done with programming languages like Java, Scala, and Python. —Jesse Anderson3 In relation to previously existing roles, the data engineering field could be thought of as a superset of business intelligence and data warehousing that brings more elements from software engineering. This discipline also integrates specialization around the operation of so-called “big data” distributed systems, along with concepts around the extended Hadoop ecosystem, stream processing, and in computation at scale. —Maxime Beauchemin4 Data engineering is all about the movement, manipulation, and management of data. —Lewis Gavin5 Wow! It’s entirely understandable if you’ve been confused about data engineering. That’s only a handful of definitions, and they contain an enormous range of opinions about the meaning of data engineering. Data Engineering Defined When we unpack the common threads of how various people define data engineer‐ ing, an obvious pattern emerges: a data engineer gets data, stores it, and prepares it for consumption by data scientists, analysts, and others. We define data engineering and data engineer as follows: Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning. Data engi‐ neering is the intersection of security, data management, DataOps, data architecture, orchestration, and software engineering. A data engineer manages the data engineering lifecycle, beginning with getting data from source systems and ending with serving data for use cases, such as analysis or machine learning. 1 “Data Engineering and Its Main Concepts,” AlexSoft, last updated August 26, 2021, https://oreil.ly/e94py. 2 ETL stands for extract, transform, load, a common pattern we cover in the book. 3 Jesse Anderson, “The Two Types of Data Engineering,” June 27, 2018, https://oreil.ly/dxDt6. 4 Maxime Beauchemin, “The Rise of the Data Engineer,” January 20, 2017, https://oreil.ly/kNDmd. 5 Lewis Gavin, What Is Data Engineering? (Sebastapol, CA: O’Reilly, 2020), https://oreil.ly/ELxLi. 4 | Chapter 1: Data Engineering Described The Data Engineering Lifecycle It is all too easy to fixate on technology and miss the bigger picture myopically. This book centers around a big idea called the data engineering lifecycle (Figure 1-1), which we believe gives data engineers the holistic context to view their role. Figure 1-1. The data engineering lifecycle The data engineering lifecycle shifts the conversation away from technology and toward the data itself and the end goals that it must serve. The stages of the data engineering lifecycle are as follows: Generation Storage Ingestion Transformation Serving The data engineering lifecycle also has a notion of undercurrents—critical ideas across the entire lifecycle. These include security, data management, DataOps, data architec‐ ture, orchestration, and software engineering. We cover the data engineering lifecycle and its undercurrents more extensively in Chapter 2. Still, we introduce it here because it is essential to our definition of data engineering and the discussion that follows in this chapter. Now that you have a working definition of data engineering and an introduction to its lifecycle, let’s take a step back and look at a bit of history. What Is Data Engineering? | 5 Evolution of the Data Engineer History doesn’t repeat itself, but it rhymes. —A famous adage often attributed to Mark Twain Understanding data engineering today and tomorrow requires a context of how the field evolved. This section is not a history lesson, but looking at the past is invaluable in understanding where we are today and where things are going. A common theme constantly reappears: what’s old is new again. The early days: 1980 to 2000, from data warehousing to the web The birth of the data engineer arguably has its roots in data warehousing, dating as far back as the 1970s, with the business data warehouse taking shape in the 1980s and Bill Inmon officially coining the term data warehouse in 1989. After engineers at IBM developed the relational database and Structured Query Language (SQL), Oracle popularized the technology. As nascent data systems grew, businesses needed dedicated tools and data pipelines for reporting and business intelligence (BI). To help people correctly model their business logic in the data warehouse, Ralph Kimball and Inmon developed their respective eponymous data-modeling techniques and approaches, which are still widely used today. Data warehousing ushered in the first age of scalable analytics, with new massively parallel processing (MPP) databases that use multiple processors to crunch large amounts of data coming on the market and supporting unprecedented volumes of data. Roles such as BI engineer, ETL developer, and data warehouse engineer addressed the various needs of the data warehouse. Data warehouse and BI engineer‐ ing were a precursor to today’s data engineering and still play a central role in the discipline. The internet went mainstream around the mid-1990s, creating a whole new genera‐ tion of web-first companies such as AOL, Yahoo, and Amazon. The dot-com boom spawned a ton of activity in web applications and the backend systems to support them—servers, databases, and storage. Much of the infrastructure was expensive, monolithic, and heavily licensed. The vendors selling these backend systems likely didn’t foresee the sheer scale of the data that web applications would produce. The early 2000s: The birth of contemporary data engineering Fast-forward to the early 2000s, when the dot-com boom of the late ’90s went bust, leaving behind a tiny cluster of survivors. Some of these companies, such as Yahoo, Google, and Amazon, would grow into powerhouse tech companies. Initially, these companies continued to rely on the traditional monolithic, relational databases and data warehouses of the 1990s, pushing these systems to the limit. As these systems 6 | Chapter 1: Data Engineering Described buckled, updated approaches were needed to handle data growth. The new genera‐ tion of the systems must be cost-effective, scalable, available, and reliable. Coinciding with the explosion of data, commodity hardware—such as servers, RAM, disks, and flash drives—also became cheap and ubiquitous. Several innovations allowed distributed computation and storage on massive computing clusters at a vast scale. These innovations started decentralizing and breaking apart traditionally monolithic services. The “big data” era had begun. The Oxford English Dictionary defines big data as “extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions.” Another famous and succinct descrip‐ tion of big data is the three Vs of data: velocity, variety, and volume. In 2003, Google published a paper on the Google File System, and shortly after that, in 2004, a paper on MapReduce, an ultra-scalable data-processing paradigm. In truth, big data has earlier antecedents in MPP data warehouses and data manage‐ ment for experimental physics projects, but Google’s publications constituted a “big bang” for data technologies and the cultural roots of data engineering as we know it today. You’ll learn more about MPP systems and MapReduce in Chapters 3 and 8, respectively. The Google papers inspired engineers at Yahoo to develop and later open source Apache Hadoop in 2006.6 It’s hard to overstate the impact of Hadoop. Software engineers interested in large-scale data problems were drawn to the possibilities of this new open source technology ecosystem. As companies of all sizes and types saw their data grow into many terabytes and even petabytes, the era of the big data engineer was born. Around the same time, Amazon had to keep up with its own exploding data needs and created elastic computing environments (Amazon Elastic Compute Cloud, or EC2), infinitely scalable storage systems (Amazon Simple Storage Service, or S3), highly scalable NoSQL databases (Amazon DynamoDB), and many other core data building blocks.7 Amazon elected to offer these services for internal and external consumption through Amazon Web Services (AWS), becoming the first popular public cloud. AWS created an ultra-flexible pay-as-you-go resource marketplace by virtualizing and reselling vast pools of commodity hardware. Instead of purchasing hardware for a data center, developers could simply rent compute and storage from AWS. 6 Cade Metz, “How Yahoo Spawned Hadoop, the Future of Big Data,” Wired, October 18, 2011, https://oreil.ly/iaD9G. 7 Ron Miller, “How AWS Came to Be,” TechCrunch, July 2, 2016, https://oreil.ly/VJehv. What Is Data Engineering? | 7 As AWS became a highly profitable growth engine for Amazon, other public clouds would soon follow, such as Google Cloud, Microsoft Azure, and DigitalOcean. The public cloud is arguably one of the most significant innovations of the 21st century and spawned a revolution in the way software and data applications are developed and deployed. The early big data tools and public cloud laid the foundation for today’s data ecosys‐ tem. The modern data landscape—and data engineering as we know it now—would not exist without these innovations. The 2000s and 2010s: Big data engineering Open source big data tools in the Hadoop ecosystem rapidly matured and spread from Silicon Valley to tech-savvy companies worldwide. For the first time, any busi‐ ness had access to the same bleeding-edge data tools used by the top tech companies. Another revolution occurred with the transition from batch computing to event streaming, ushering in a new era of big “real-time” data. You’ll learn about batch and event streaming throughout this book. Engineers could choose the latest and greatest—Hadoop, Apache Pig, Apache Hive, Dremel, Apache HBase, Apache Storm, Apache Cassandra, Apache Spark, Presto, and numerous other new technologies that came on the scene. Traditional enterprise- oriented and GUI-based data tools suddenly felt outmoded, and code-first engineer‐ ing was in vogue with the ascendance of MapReduce. We (the authors) were around during this time, and it felt like old dogmas died a sudden death upon the altar of big data. The explosion of data tools in the late 2000s and 2010s ushered in the big data engineer. To effectively use these tools and techniques—namely, the Hadoop eco‐ system including Hadoop, YARN, Hadoop Distributed File System (HDFS), and MapReduce—big data engineers had to be proficient in software development and low-level infrastructure hacking, but with a shifted emphasis. Big data engineers typically maintained massive clusters of commodity hardware to deliver data at scale. While they might occasionally submit pull requests to Hadoop core code, they shifted their focus from core technology development to data delivery. Big data quickly became a victim of its own success. As a buzzword, big data gained popularity during the early 2000s through the mid-2010s. Big data captured the imagination of companies trying to make sense of the ever-growing volumes of data and the endless barrage of shameless marketing from companies selling big data tools and services. Because of the immense hype, it was common to see companies using big data tools for small data problems, sometimes standing up a Hadoop cluster to process just a few gigabytes. It seemed like everyone wanted in on the big data action. Dan Ariely tweeted, “Big data is like teenage sex: everyone talks about it, nobody 8 | Chapter 1: Data Engineering Described really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.” Figure 1-2 shows a snapshot of Google Trends for the search term “big data” to get an idea of the rise and fall of big data. Figure 1-2. Google Trends for “big data” (March 2022) Despite the term’s popularity, big data has lost steam. What happened? One word: simplification. Despite the power and sophistication of open source big data tools, managing them was a lot of work and required constant attention. Often, companies employed entire teams of big data engineers, costing millions of dollars a year, to babysit these platforms. Big data engineers often spent excessive time maintaining complicated tooling and arguably not as much time delivering the business’s insights and value. Open source developers, clouds, and third parties started looking for ways to abstract, simplify, and make big data available without the high administrative overhead and cost of managing their clusters, and installing, configuring, and upgrading their open source code. The term big data is essentially a relic to describe a particular time and approach to handling large amounts of data. Today, data is moving faster than ever and growing ever larger, but big data process‐ ing has become so accessible that it no longer merits a separate term; every company aims to solve its data problems, regardless of actual data size. Big data engineers are now simply data engineers. What Is Data Engineering? | 9 The 2020s: Engineering for the data lifecycle At the time of this writing, the data engineering role is evolving rapidly. We expect this evolution to continue at a rapid clip for the foreseeable future. Whereas data engineers historically tended to the low-level details of monolithic frameworks such as Hadoop, Spark, or Informatica, the trend is moving toward decentralized, modu‐ larized, managed, and highly abstracted tools. Indeed, data tools have proliferated at an astonishing rate (see Figure 1-3). Popular trends in the early 2020s include the modern data stack, representing a collection of off-the-shelf open source and third-party products assembled to make analysts’ lives easier. At the same time, data sources and data formats are growing both in variety and size. Data engineering is increasingly a discipline of interoperation, and connecting various technologies like LEGO bricks, to serve ultimate business goals. Figure 1-3. Matt Turck’s Data Landscape in 2012 versus 2021 The data engineer we discuss in this book can be described more precisely as a data lifecycle engineer. With greater abstraction and simplification, a data lifecycle engineer is no longer encumbered by the gory details of yesterday’s big data frameworks. While data engineers maintain skills in low-level data programming and use these as required, they increasingly find their role focused on things higher in the value chain: security, data management, DataOps, data architecture, orchestration, and general data lifecycle management.8 As tools and workflows simplify, we’ve seen a noticeable shift in the attitudes of data engineers. Instead of focusing on who has the “biggest data,” open source projects and services are increasingly concerned with managing and governing data, making it easier to use and discover, and improving its quality. Data engineers are 8 DataOps is an abbreviation for data operations. We cover this topic in Chapter 2. For more information, read the DataOps Manifesto. 10 | Chapter 1: Data Engineering Described now conversant in acronyms such as CCPA and GDPR;9 as they engineer pipelines, they concern themselves with privacy, anonymization, data garbage collection, and compliance with regulations. What’s old is new again. While “enterprisey” stuff like data management (including data quality and governance) was common for large enterprises in the pre-big-data era, it wasn’t widely adopted in smaller companies. Now that many of the challenging problems of yesterday’s data systems are solved, neatly productized, and packaged, technologists and entrepreneurs have shifted focus back to the “enterprisey” stuff, but with an emphasis on decentralization and agility, which contrasts with the traditional enterprise command-and-control approach. We view the present as a golden age of data lifecycle management. Data engineers managing the data engineering lifecycle have better tools and techniques than ever before. We discuss the data engineering lifecycle and its undercurrents in greater detail in the next chapter. Data Engineering and Data Science Where does data engineering fit in with data science? There’s some debate, with some arguing data engineering is a subdiscipline of data science. We believe data engineer‐ ing is separate from data science and analytics. They complement each other, but they are distinctly different. Data engineering sits upstream from data science (Figure 1-4), meaning data engineers provide the inputs used by data scientists (downstream from data engineering), who convert these inputs into something useful. Figure 1-4. Data engineering sits upstream from data science Consider the Data Science Hierarchy of Needs (Figure 1-5). In 2017, Monica Rogati published this hierarchy in an article that showed where AI and machine learning (ML) sat in proximity to more “mundane” areas such as data movement/storage, collection, and infrastructure. 9 These acronyms stand for California Consumer Privacy Act and General Data Protection Regulation, respectively. What Is Data Engineering? | 11 Figure 1-5. The Data Science Hierarchy of Needs Although many data scientists are eager to build and tune ML models, the reality is an estimated 70% to 80% of their time is spent toiling in the bottom three parts of the hierarchy—gathering data, cleaning data, processing data—and only a tiny slice of their time on analysis and ML. Rogati argues that companies need to build a solid data foundation (the bottom three levels of the hierarchy) before tackling areas such as AI and ML. Data scientists aren’t typically trained to engineer production-grade data systems, and they end up doing this work haphazardly because they lack the support and resources of a data engineer. In an ideal world, data scientists should spend more than 90% of their time focused on the top layers of the pyramid: analytics, experimentation, and ML. When data engineers focus on these bottom parts of the hierarchy, they build a solid foundation for data scientists to succeed. With data science driving advanced analytics and ML, data engineering straddles the divide between getting data and getting value from data (see Figure 1-6). We believe data engineering is of equal importance and visibility to data science, with data engineers playing a vital role in making data science successful in production. Figure 1-6. A data engineer gets data and provides value from the data 12 | Chapter 1: Data Engineering Described Data Engineering Skills and Activities The skill set of a data engineer encompasses the “undercurrents” of data engineering: security, data management, DataOps, data architecture, and software engineering. This skill set requires an understanding of how to evaluate data tools and how they fit together across the data engineering lifecycle. It’s also critical to know how data is produced in source systems and how analysts and data scientists will consume and create value after processing and curating data. Finally, a data engineer juggles a lot of complex moving parts and must constantly optimize along the axes of cost, agility, scalability, simplicity, reuse, and interoperability (Figure 1-7). We cover these topics in more detail in upcoming chapters. Figure 1-7. The balancing act of data engineering As we discussed, in the recent past, a data engineer was expected to know and under‐ stand how to use a small handful of powerful and monolithic technologies (Hadoop, Spark, Teradata, Hive, and many others) to create a data solution. Utilizing these technologies often requires a sophisticated understanding of software engineering, networking, distributed computing, storage, or other low-level details. Their work would be devoted to cluster administration and maintenance, managing overhead, and writing pipeline and transformation jobs, among other tasks. Nowadays, the data-tooling landscape is dramatically less complicated to manage and deploy. Modern data tools considerably abstract and simplify workflows. As a result, data engineers are now focused on balancing the simplest and most cost-effective, best-of-breed services that deliver value to the business. The data engineer is also expected to create agile data architectures that evolve as new trends emerge. What are some things a data engineer does not do? A data engineer typically does not directly build ML models, create reports or dashboards, perform data analysis, build key performance indicators (KPIs), or develop software applications. A data engineer should have a good functioning understanding of these areas to serve stakeholders best. Data Maturity and the Data Engineer The level of data engineering complexity within a company depends a great deal on the company’s data maturity. This significantly impacts a data engineer’s day-to-day job responsibilities and career progression. What is data maturity, exactly? Data maturity is the progression toward higher data utilization, capabilities, and integration across the organization, but data maturity does not simply depend on the Data Engineering Skills and Activities | 13 age or revenue of a company. An early-stage startup can have greater data maturity than a 100-year-old company with annual revenues in the billions. What matters is the way data is leveraged as a competitive advantage. Data maturity models have many versions, such as Data Management Maturity (DMM) and others, and it’s hard to pick one that is both simple and useful for data engineering. So, we’ll create our own simplified data maturity model. Our data maturity model (Figure 1-8) has three stages: starting with data, scaling with data, and leading with data. Let’s look at each of these stages and at what a data engineer typically does at each stage. Figure 1-8. Our simplified data maturity model for a company Stage 1: Starting with data A company getting started with data is, by definition, in the very early stages of its data maturity. The com