Big Data Principles and Best Practices of Scalable Real-time Data Systems PDF

Principles and best practices of scalable real-time data systems Nathan Marz WITH James Warren MANNING Big Data PRINCIPLES AND BEST PRACTICES OF SCALABLE REAL-TIME DATA SYSTEMS...

Principles and best practices of scalable real-time data systems Nathan Marz WITH James Warren MANNING Big Data PRINCIPLES AND BEST PRACTICES OF SCALABLE REAL-TIME DATA SYSTEMS NATHAN MARZ with JAMES WARREN MANNING Shelter Island Licensed to Mark Watson For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Email: [email protected] ©2015 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. Manning Publications Co. Development editors: Renae Gregoire, Jennifer Stout 20 Baldwin Road Technical development editor: Jerry Gaines PO Box 761 Copyeditor: Andy Carroll Shelter Island, NY 11964 Proofreader: Katie Tennant Technical proofreader: Jerry Kuch Typesetter: Gordan Salinovic Cover designer: Marija Tudor ISBN 9781617290343 Printed in the United States of America 1 2 3 4 5 6 7 8 9 10 – EBM – 20 19 18 17 16 15 Licensed to Mark Watson brief contents 1 A new paradigm for Big Data 1 PART 1 BATCH LAYER.................................................................25 2 Data model for Big Data 27 3 Data model for Big Data: Illustration 47 4 Data storage on the batch layer 54 5 Data storage on the batch layer: Illustration 65 6 Batch layer 83 7 Batch layer: Illustration 111 8 An example batch layer: Architecture and algorithms 139 9 An example batch layer: Implementation 156 PART 2 SERVING LAYER............................................................177 10 Serving layer 179 11 Serving layer: Illustration 196 iii Licensed to Mark Watson iv BRIEF CONTENTS PART 3 SPEED LAYER................................................................205 12 Realtime views 207 13 Realtime views: Illustration 220 14 Queuing and stream processing 225 15 Queuing and stream processing: Illustration 242 16 Micro-batch stream processing 254 17 Micro-batch stream processing: Illustration 269 18 Lambda Architecture in depth 284 Licensed to Mark Watson contents preface xiii acknowledgments xv about this book xviii 1 A new paradigm for Big Data 1 1.1 How this book is structured 2 1.2 Scaling with a traditional database 3 Scaling with a queue 3 Scaling by sharding the database 4 Fault-tolerance issues begin 5 Corruption issues 5 What went wrong? 5 How will Big Data techniques help? 6 1.3 NoSQL is not a panacea 6 1.4 First principles 6 1.5 Desired properties of a Big Data system 7 Robustness and fault tolerance 7 Low latency reads and updates 8 Scalability 8 Generalization 8 Extensibility 8 Ad hoc queries 8 Minimal maintenance 9 Debuggability 9 1.6 The problems with fully incremental architectures 9 Operational complexity 10 Extreme complexity of achieving eventual consistency 11 Lack of human-fault tolerance 12 Fully incremental solution vs. Lambda Architecture solution 13 v Licensed to Mark Watson vi CONTENTS 1.7 Lambda Architecture 14 Batch layer 16 Serving layer 17 Batch and serving layers satisfy almost all properties 17 Speed layer 18 1.8 Recent trends in technology 20 CPUs aren’t getting faster 20 Elastic clouds 21 Vibrant open source ecosystem for Big Data 21 1.9 Example application: SuperWebAnalytics.com 22 1.10 Summary 23 PART 1 BATCH LAYER.......................................................25 2 Data model for Big Data 27 2.1 The properties of data 29 Data is raw 31 Data is immutable 34 Data is eternally true 36 2.2 The fact-based model for representing data 37 Example facts and their properties 37 Benefits of the fact-based model 39 2.3 Graph schemas 43 Elements of a graph schema 43 The need for an enforceable schema 44 2.4 A complete data model for SuperWebAnalytics.com 45 2.5 Summary 46 3 Data model for Big Data: Illustration 3.1 Why a serialization framework? 48 47 3.2 Apache Thrift 48 Nodes 49 Edges 49 Properties 50 Tying everything together into data objects 51 Evolving your schema 51 3.3 Limitations of serialization frameworks 52 3.4 Summary 53 4 Data storage on the batch layer 4.1 54 Storage requirements for the master dataset 55 4.2 Choosing a storage solution for the batch layer 56 Using a key/value store for the master dataset 56 Distributed filesystems 57 Licensed to Mark Watson CONTENTS vii 4.3 How distributed filesystems work 58 4.4 Storing a master dataset with a distributed filesystem 59 4.5 Vertical partitioning 61 4.6 Low-level nature of distributed filesystems 62 4.7 Storing the SuperWebAnalytics.com master dataset on a distributed filesystem 64 4.8 Summary 64 5 Data storage on the batch layer: Illustration 65 5.1 Using the Hadoop Distributed File System 66 The small-files problem 67 Towards a higher-level abstraction 67 5.2 Data storage in the batch layer with Pail 68 Basic Pail operations 69 Serializing objects into pails 70 Batch operations using Pail 72 Vertical partitioning with Pail 73 Pail file formats and compression 74 Summarizing the benefits of Pail 75 5.3 Storing the master dataset for SuperWebAnalytics.com 76 A structured pail for Thrift objects 77 A basic pail for SuperWebAnalytics.com 78 A split pail to vertically partition the dataset 78 5.4 Summary 82 6 Batch layer 83 6.1 Motivating examples 84 Number of pageviews over time 84 Gender inference 85 Influence score 85 6.2 Computing on the batch layer 86 6.3 Recomputation algorithms vs. incremental algorithms 88 Performance 89 Human-fault tolerance 90 Generality of the algorithms 91 Choosing a style of algorithm 91 6.4 Scalability in the batch layer 92 6.5 MapReduce: a paradigm for Big Data computing 93 Scalability 94 Fault-tolerance 96 Generality of MapReduce 97 6.6 Low-level nature of MapReduce 99 Multistep computations are unnatural 99 Joins are very complicated to implement manually 99 Logical and physical execution tightly coupled 101 Licensed to Mark Watson viii CONTENTS 6.7 Pipe diagrams: a higher-level way of thinking about batch computation 102 Concepts of pipe diagrams 102 Executing pipe diagrams via MapReduce 106 Combiner aggregators 107 Pipe diagram examples 108 6.8 Summary 109 7 Batch layer: Illustration 111 7.1 An illustrative example 112 7.2 Common pitfalls of data-processing tools 114 Custom languages 114 Poorly composable abstractions 115 7.3 An introduction to JCascalog 115 The JCascalog data model 116 The structure of a JCascalog query 117 Querying multiple datasets 119 Grouping and aggregators 121 Stepping though an example query 122 Custom predicate operations 125 7.4 Composition 130 Combining subqueries 130 Dynamically created subqueries 131 Predicate macros 134 Dynamically created predicate macros 136 7.5 Summary 138 8 An example batch layer: Architecture and algorithms 139 8.1 Design of the SuperWebAnalytics.com batch layer 140 Supported queries 140 Batch views 141 8.2 Workflow overview 144 8.3 Ingesting new data 145 8.4 URL normalization 146 8.5 User-identifier normalization 146 8.6 Deduplicate pageviews 151 8.7 Computing batch views 151 Pageviews over time 151 Unique visitors over time 152 Bounce-rate analysis 152 8.8 Summary 154 Licensed to Mark Watson CONTENTS ix 9 An example batch layer: Implementation 9.1 Starting point 157 156 9.2 Preparing the workflow 158 9.3 Ingesting new data 158 9.4 URL normalization 162 9.5 User-identifier normalization 163 9.6 Deduplicate pageviews 168 9.7 Computing batch views 169 Pageviews over time 169 Uniques over time 171 Bounce- rate analysis 172 9.8 Summary 175 PART 2 SERVING LAYER...................................................177 10 Serving layer 179 10.1 Performance metrics for the serving layer 181 10.2 The serving layer solution to the normalization/ denormalization problem 183 10.3 Requirements for a serving layer database 185 10.4 Designing a serving layer for SuperWebAnalytics.com 186 Pageviews over time 186 Uniques over time 187 Bounce- rate analysis 188 10.5 Contrasting with a fully incremental solution 188 Fully incremental solution to uniques over time 188 Comparing to the Lambda Architecture solution 194 10.6 Summary 195 11 Serving layer: Illustration 11.1 Basics of ElephantDB 196 197 View creation in ElephantDB 197 View serving in ElephantDB 197 Using ElephantDB 198 11.2 Building the serving layer for SuperWebAnalytics.com 200 Pageviews over time 200 Uniques over time 202 Bounce- rate analysis 203 11.3 Summary 204 Licensed to Mark Watson x CONTENTS PART 3 SPEED LAYER......................................................205 12 Realtime views 207 12.1 Computing realtime views 209 12.2 Storing realtime views 210 Eventual accuracy 211 Amount of state stored in the speed layer 211 12.3 Challenges of incremental computation 212 Validity of the CAP theorem 213 The complex interaction between the CAP theorem and incremental algorithms 214 12.4 Asynchronous versus synchronous updates 216 12.5 Expiring realtime views 217 12.6 Summary 219 13 Realtime views: Illustration 220 13.1 Cassandra’s data model 220 13.2 Using Cassandra 222 Advanced Cassandra 224 13.3 Summary 224 14 Queuing and stream processing 225 14.1 Queuing 226 Single-consumer queue servers 226 Multi-consumer queues 228 14.2 Stream processing 229 Queues and workers 230 Queues-and-workers pitfalls 231 14.3 Higher-level, one-at-a-time stream processing 231 Storm model 232 Guaranteeing message processing 236 14.4 SuperWebAnalytics.com speed layer 238 Topology structure 240 14.5 Summary 241 15 Queuing and stream processing: Illustration 242 15.1 Defining topologies with Apache Storm 242 15.2 Apache Storm clusters and deployment 245 15.3 Guaranteeing message processing 247 Licensed to Mark Watson CONTENTS xi 15.4 Implementing the SuperWebAnalytics.com uniques-over-time speed layer 249 15.5 Summary 253 16 Micro-batch stream processing 254 16.1 Achieving exactly-once semantics 255 Strongly ordered processing 255 Micro-batch stream processing 256 Micro-batch processing topologies 257 16.2 Core concepts of micro-batch stream processing 259 16.3 Extending pipe diagrams for micro-batch processing 260 16.4 Finishing the speed layer for SuperWebAnalytics.com 262 Pageviews over time 262 Bounce-rate analysis 263 16.5 Another look at the bounce-rate-analysis example 267 16.6 Summary 268 17 Micro-batch stream processing: Illustration 269 17.1 Using Trident 270 17.2 Finishing the SuperWebAnalytics.com speed layer 273 Pageviews over time 273 Bounce-rate analysis 275 17.3 Fully fault-tolerant, in-memory, micro-batch processing 281 17.4 Summary 283 18 Lambda Architecture in depth 284 18.1 Defining data systems 285 18.2 Batch and serving layers 286 Incremental batch processing 286 Measuring and optimizing batch layer resource usage 293 18.3 Speed layer 297 18.4 Query layer 298 18.5 Summary 299 index 301 Licensed to Mark Watson Licensed to Mark Watson preface When I first entered the world of Big Data, it felt like the Wild West of software devel- opment. Many were abandoning the relational database and its familiar comforts for NoSQL databases with highly restricted data models designed to scale to thousands of machines. The number of NoSQL databases, many of them with only minor differ- ences between them, became overwhelming. A new project called Hadoop began to make waves, promising the ability to do deep analyses on huge amounts of data. Mak- ing sense of how to use these new tools was bewildering. At the time, I was trying to handle the scaling problems we were faced with at the company at which I worked. The architecture was intimidatingly complex—a web of sharded relational databases, queues, workers, masters, and slaves. Corruption had worked its way into the databases, and special code existed in the application to han- dle the corruption. Slaves were always behind. I decided to explore alternative Big Data technologies to see if there was a better design for our data architecture. One experience from my early software-engineering career deeply shaped my view of how systems should be architected. A coworker of mine had spent a few weeks col- lecting data from the internet onto a shared filesystem. He was waiting to collect enough data so that he could perform an analysis on it. One day while doing some routine maintenance, I accidentally deleted all of my coworker’s data, setting him behind weeks on his project. I knew I had made a big mistake, but as a new software engineer I didn’t know what the consequences would be. Was I going to get fired for being so careless? I sent out an email to the team apologizing profusely—and to my great surprise, everyone was very sympathetic. I’ll never forget when a coworker came to my desk, patted my back, and said “Congratulations. You’re now a professional software engineer.” xiii Licensed to Mark Watson xiv PREFACE In his joking statement lay a deep unspoken truism in software development: we don’t know how to make perfect software. Bugs can and do get deployed to production. If the application can write to the database, a bug can write to the database as well. When I set about redesigning our data architecture, this experience profoundly affected me. I knew our new architecture not only had to be scalable, tolerant to machine failure, and easy to reason about—but tolerant of human mistakes as well. My experience re-architecting that system led me down a path that caused me to question everything I thought was true about databases and data management. I came up with an architecture based on immutable data and batch computation, and I was astonished by how much simpler the new system was compared to one based solely on incremental computation. Everything became easier, including operations, evolving the system to support new features, recovering from human mistakes, and doing per- formance optimization. The approach was so generic that it seemed like it could be used for any data system. Something confused me though. When I looked at the rest of the industry, I saw that hardly anyone was using similar techniques. Instead, daunting amounts of com- plexity were embraced in the use of architectures based on huge clusters of incremen- tally updated databases. So many of the complexities in those architectures were either completely avoided or greatly softened by the approach I had developed. Over the next few years, I expanded on the approach and formalized it into what I dubbed the Lambda Architecture. When working on a startup called BackType, our team of five built a social media analytics product that provided a diverse set of realtime analytics on over 100 TB of data. Our small team also managed deployment, opera- tions, and monitoring of the system on a cluster of hundreds of machines. When we showed people our product, they were astonished that we were a team of only five people. They would often ask “How can so few people do so much?” My answer was simple: “It’s not what we’re doing, but what we’re not doing.” By using the Lambda Architecture, we avoided the complexities that plague traditional architectures. By avoiding those complexities, we became dramatically more productive. The Big Data movement has only magnified the complexities that have existed in data architectures for decades. Any architecture based primarily on large databases that are updated incrementally will suffer from these complexities, causing bugs, bur- densome operations, and hampered productivity. Although SQL and NoSQL data- bases are often painted as opposites or as duals of each other, at a fundamental level they are really the same. They encourage this same architecture with its inevitable complexities. Complexity is a vicious beast, and it will bite you regardless of whether you acknowledge it or not. This book is the result of my desire to spread the knowledge of the Lambda Archi- tecture and how it avoids the complexities of traditional architectures. It is the book I wish I had when I started working with Big Data. I hope you treat this book as a jour- ney—a journey to challenge what you thought you knew about data systems, and to discover that working with Big Data can be elegant, simple, and fun. NATHAN MARZ Licensed to Mark Watson acknowledgments This book would not have been possible without the help and support of so many individuals around the world. I must start with my parents, who instilled in me from a young age a love of learning and exploring the world around me. They always encour- aged me in all my career pursuits. Likewise, my brother Iorav encouraged my intellectual interests from a young age. I still remember when he taught me Algebra while I was in elementary school. He was the one to introduce me to programming for the first time—he taught me Visual Basic as he was taking a class on it in high school. Those lessons sparked a passion for programming that led to my career. I am enormously grateful to Michael Montano and Christopher Golda, the found- ers of BackType. From the moment they brought me on as their first employee, I was given an extraordinary amount of freedom to make decisions. That freedom was essential for me to explore and exploit the Lambda Architecture to its fullest. They never questioned the value of open source and allowed me to open source our tech- nology liberally. Getting deeply involved with open source has been one of the great privileges of my life. Many of my professors from my time as a student at Stanford deserve special thanks. Tim Roughgarden is the best teacher I’ve ever had—he radically improved my ability to rigorously analyze, deconstruct, and solve difficult problems. Taking as many classes as possible with him was one of the best decisions of my life. I also give thanks to Monica Lam for instilling within me an appreciation for the elegance of Datalog. Many years later I married Datalog with MapReduce to produce my first significant open source project, Cascalog. xv Licensed to Mark Watson xvi ACKNOWLEDGMENTS Chris Wensel was the first one to show me that processing data at scale could be elegant and performant. His Cascading library changed the way I looked at Big Data processing. None of my work would have been possible without the pioneers of the Big Data field. Special thanks to Jeffrey Dean and Sanjay Ghemawat for the original MapReduce paper, Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels for the original Dynamo paper, and Michael Cafarella and Doug Cutting for founding the Apache Hadoop project. Rich Hickey has been one of my biggest inspirations during my programming career. Clojure is the best language I have ever used, and I’ve become a better pro- grammer having learned it. I appreciate its practicality and focus on simplicity. Rich’s philosophy on state and complexity in programming has influenced me deeply. When I started writing this book, I was not nearly the writer I am now. Renae Gre- goire, one of my development editors at Manning, deserves special thanks for helping me improve as a writer. She drilled into me the importance of using examples to lead into general concepts, and she set off many light bulbs for me on how to effectively structure technical writing. The skills she taught me apply not only to writing techni- cal books, but to blogging, giving talks, and communication in general. For gaining an important life skill, I am forever grateful. This book would not be nearly of the same quality without the efforts of my co- author James Warren. He did a phenomenal job absorbing the theoretical concepts and finding even better ways to present the material. Much of the clarity of the book comes from his great communication skills. My publisher, Manning, was a pleasure to work with. They were patient with me and understood that finding the right way to write on such a big topic takes time. Through the whole process they were supportive and helpful, and they always gave me the resources I needed to be successful. Thanks to Marjan Bace and Michael Stephens for all the support, and to all the other staff for their help and guidance along the way. I try to learn as much as possible about writing from studying other writers. Brad- ford Cross, Clayton Christensen, Paul Graham, Carl Sagan, and Derek Sivers have been particularly influential. Finally, I can’t give enough thanks to the hundreds of people who reviewed, com- mented, and gave feedback on our book as it was being written. That feedback led us to revise, rewrite, and restructure numerous times until we found ways to present the material effectively. Special thanks to Aaron Colcord, Aaron Crow, Alex Holmes, Arun Jacob, Asif Jan, Ayon Sinha, Bill Graham, Charles Brophy, David Beckwith, Derrick Burns, Douglas Duncan, Hugo Garza, Jason Courcoux, Jonathan Esterhazy, Karl Kuntz, Kevin Martin, Leo Polovets, Mark Fisher, Massimo Ilario, Michael Fogus, Michael G. Noll, Patrick Dennis, Pedro Ferrera Bertran, Philipp Janert, Rodrigo Abreu, Rudy Bonefas, Sam Ritchie, Siva Kalagarla, Soren Macbeth, Timothy Chk- lovski, Walid Farid, and Zhenhua Guo. NATHAN MARZ Licensed to Mark Watson ACKNOWLEDGMENTS xvii I’m astounded when I consider everyone who contributed in some manner to this book. Unfortunately, I can’t provide an exhaustive list, but that doesn’t lessen my appreciation. Nonetheless, there are individuals to whom I wish to explicitly express my gratitude: My wife, Wen-Ying Feng—for your love, encouragement and support, not only for this book but for everything we do together. My parents, James and Gretta Warren—for your endless faith in me and the sac- rifices you made to provide me with every opportunity. My sister, Julia Warren-Ulanch—for setting a shining example so I could follow in your footsteps. My professors and mentors, Ellen Toby and Sue Geller—for your willingness to answer my every question and for demonstrating the joy in sharing knowledge, not just acquiring it. Chuck Lam—for saying “Hey, have you heard of this thing called Hadoop?” to me so many years ago. My friends and colleagues at RockYou!, Storm8, and Bina—for the experiences we shared together and the opportunity to put theory into practice. Marjan Bace, Michael Stephens, Jennifer Stout, Renae Gregoire, and the entire Manning editorial and publishing staff—for your guidance and patience in see- ing this book to completion. The reviewers and early readers of this book—for your comments and critiques that pushed us to clarify our words; the end result is so much better for it. Finally, I want to convey my greatest appreciation to Nathan for inviting me to come along on this journey. I was already a great admirer of your work before joining this venture, and working with you has only deepened my respect for your ideas and phi- losophy. It has been an honor and a privilege. JAMES WARREN Licensed to Mark Watson about this book Services like social networks, web analytics, and intelligent e-commerce often need to manage data at a scale too big for a traditional database. Complexity increases with scale and demand, and handling Big Data is not as simple as just doubling down on your RDBMS or rolling out some trendy new technology. Fortunately, scalability and simplicity are not mutually exclusive—you just need to take a different approach. Big Data systems use many machines working in parallel to store and process data, which introduces fundamental challenges unfamiliar to most developers. Big Data teaches you to build these systems using an architecture that takes advan- tage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data. It describes a scalable, easy-to-understand approach to Big Data systems that can be built and run by a small team. Following a realistic example, this book guides readers through the theory of Big Data systems and how to imple- ment them in practice. Big Data requires no previous exposure to large-scale data analysis or NoSQL tools. Familiarity with traditional databases is helpful, though not required. The goal of the book is to teach you how to think about data systems and how to break down difficult problems into simple solutions. We start from first principles and from those deduce the necessary properties for each component of an architecture. Roadmap An overview of the 18 chapters in this book follows. Chapter 1 introduces the principles of data systems and gives an overview of the Lambda Architecture: a generalized approach to building any data system. Chapters 2 through 17 dive into all the pieces of the Lambda Architecture, with chapters alternating between theory and illustration chapters. Theory chapters demonstrate the xviii Licensed to Mark Watson ABOUT THIS BOOK xix concepts that hold true regardless of existing tools, while illustration chapters use real-world tools to demonstrate the concepts. Don’t let the names fool you, though— all chapters are highly example-driven. Chapters 2 through 9 focus on the batch layer of the Lambda Architecture. Here you will learn about modeling your master dataset, using batch processing to create arbitrary views of your data, and the trade-offs between incremental and batch processing. Chapters 10 and 11 focus on the serving layer, which provides low latency access to the views produced by the batch layer. Here you will learn about specialized databases that are only written to in bulk. You will discover that these databases are dramatically simpler than traditional databases, giving them excellent performance, operational, and robustness properties. Chapters 12 through 17 focus on the speed layer, which compensates for the batch layer’s high latency to provide up-to-date results for all queries. Here you will learn about NoSQL databases, stream processing, and managing the complexities of incre- mental computation. Chapter 18 uses your new-found knowledge to review the Lambda Architecture once more and fill in any remaining gaps. You’ll learn about incremental batch pro- cessing, variants of the basic Lambda Architecture, and how to get the most out of your resources. Code downloads and conventions The source code for the book can be found at https://github.com/Big-Data-Manning. We have provided source code for the running example SuperWebAnalytics.com. Much of the source code is shown in numbered listings. These listings are meant to provide complete segments of code. Some listings are annotated to help highlight or explain certain parts of the code. In other places throughout the text, code frag- ments are used when necessary. Courier typeface is used to denote code for Java. In both the listings and fragments, we make use of a bold code font to help identify key parts of the code that are being explained in the text. Author Online Purchase of Big Data includes free access to a private web forum run by Manning Pub- lications where you can make comments about the book, ask technical questions, and receive help from the authors and other users. To access the forum and subscribe to it, point your web browser to www.manning.com/BigData. This Author Online (AO) page provides information on how to get on the forum once you’re registered, what kind of help is available, and the rules of conduct on the forum. Manning’s commitment to our readers is to provide a venue where a meaningful dialog among individual readers and between readers and the authors can take place. It’s not a commitment to any specific amount of participation on the part of the authors, whose contribution to the AO forum remains voluntary (and unpaid). We suggest you try asking the authors some challenging questions, lest their interest stray! Licensed to Mark Watson xx ABOUT THIS BOOK The AO forum and the archives of previous discussions will be accessible from the publisher’s website as long as the book is in print. About the cover illustration The figure on the cover of Big Data is captioned “Le Raccommodeur de Fiance,” which means a mender of clayware. His special talent was mending broken or chipped pots, plates, cups, and bowls, and he traveled through the countryside, visiting the towns and villages of France, plying his trade. The illustration is taken from a nineteenth-century edition of Sylvain Maréchal’s four-volume compendium of regional dress customs published in France. Each illus- tration is finely drawn and colored by hand. The rich variety of Maréchal’s collection reminds us vividly of how culturally apart the world’s towns and regions were just 200 years ago. Isolated from each other, people spoke different dialects and languages. In the streets or in the countryside, it was easy to identify where they lived and what their trade or station in life was just by their dress. Dress codes have changed since then, and the diversity by region, so rich at the time, has faded away. It is now hard to tell apart the inhabitants of different conti- nents, let alone different towns or regions. Perhaps we have traded cultural diversity for a more varied personal life—certainly for a more varied and fast-paced technolog- ical life. At a time when it is hard to tell one computer book from another, Manning cele- brates the inventiveness and initiative of the computer business with book covers based on the rich diversity of regional life of two centuries ago, brought back to life by Maréchal’s pictures. Licensed to Mark Watson A new paradigm for Big Data This chapter covers Typical problems encountered when scaling a traditional database Why NoSQL is not a panacea Thinking about Big Data systems from first principles Landscape of Big Data tools Introducing SuperWebAnalytics.com In the past decade the amount of data being created has skyrocketed. More than 30,000 gigabytes of data are generated every second, and the rate of data creation is only accelerating. The data we deal with is diverse. Users create content like blog posts, tweets, social network interactions, and photos. Servers continuously log messages about what they’re doing. Scientists create detailed measurements of the world around us. The internet, the ultimate source of data, is almost incomprehensibly large. This astonishing growth in data has profoundly affected businesses. Traditional database systems, such as relational databases, have been pushed to the limit. In an 1 Licensed to Mark Watson 2 CHAPTER 1 A new paradigm for Big Data increasing number of cases these systems are breaking under the pressures of “Big Data.” Traditional systems, and the data management techniques associated with them, have failed to scale to Big Data. To tackle the challenges of Big Data, a new breed of technologies has emerged. Many of these new technologies have been grouped under the term NoSQL. In some ways, these new technologies are more complex than traditional databases, and in other ways they’re simpler. These systems can scale to vastly larger sets of data, but using these technologies effectively requires a fundamentally new set of techniques. They aren’t one-size-fits-all solutions. Many of these Big Data systems were pioneered by Google, including distributed filesystems, the MapReduce computation framework, and distributed locking services. Another notable pioneer in the space was Amazon, which created an innovative dis- tributed key/value store called Dynamo. The open source community responded in the years following with Hadoop, HBase, MongoDB, Cassandra, RabbitMQ, and count- less other projects. This book is about complexity as much as it is about scalability. In order to meet the challenges of Big Data, we’ll rethink data systems from the ground up. You’ll dis- cover that some of the most basic ways people manage data in traditional systems like relational database management systems (RDBMSs) are too complex for Big Data sys- tems. The simpler, alternative approach is the new paradigm for Big Data that you’ll explore. We have dubbed this approach the Lambda Architecture. In this first chapter, you’ll explore the “Big Data problem” and why a new para- digm for Big Data is needed. You’ll see the perils of some of the traditional techniques for scaling and discover some deep flaws in the traditional way of building data sys- tems. By starting from first principles of data systems, we’ll formulate a different way to build data systems that avoids the complexity of traditional techniques. You’ll take a look at how recent trends in technology encourage the use of new kinds of systems, and finally you’ll take a look at an example Big Data system that we’ll build through- out this book to illustrate the key concepts. 1.1 How this book is structured You should think of this book as primarily a theory book, focusing on how to approach building a solution to any Big Data problem. The principles you’ll learn hold true regardless of the tooling in the current landscape, and you can use these principles to rigorously choose what tools are appropriate for your application. This book is not a survey of database, computation, and other related technolo- gies. Although you’ll learn how to use many of these tools throughout this book, such as Hadoop, Cassandra, Storm, and Thrift, the goal of this book is not to learn those tools as an end in themselves. Rather, the tools are a means of learning the underlying principles of architecting robust and scalable data systems. Doing an involved compare-and-contrast between the tools would not do you justice, as that just distracts from learning the underlying principles. Put another way, you’re going to learn how to fish, not just how to use a particular fishing rod. Licensed to Mark Watson Scaling with a traditional database 3 In that vein, we have structured the book into theory and illustration chapters. You can read just the theory chapters and gain a full understanding of how to build Big Data systems—but we think the process of mapping that theory onto specific tools in the illustration chapters will give you a richer, more nuanced understanding of the material. Don’t be fooled by the names though—the theory chapters are very much example- driven. The overarching example in the book—SuperWebAnalytics.com—is used in both the theory and illustration chapters. In the theory chapters you’ll see the algo- rithms, index designs, and architecture for SuperWebAnalytics.com. The illustration chapters will take those designs and map them onto functioning code with specific tools. 1.2 Scaling with a traditional database Let’s begin our exploration of Big Data by starting from where many developers start: hitting the limits of traditional database technologies. Suppose your boss asks you to build a simple web analytics application. The appli- cation should track the number of pageviews for any URL a customer wishes to track. The customer’s web page pings the application’s web server with its URL every time a pageview is received. Additionally, the application should be able to tell you at any point what the top 100 URLs are by number of pageviews. You start with a traditional relational schema for Column name Type the pageviews that looks something like figure 1.1. Your back end consists of an RDBMS with a table of id integer that schema and a web server. Whenever someone user_id integer loads a web page being tracked by your application, the web page pings your web server with the url varchar(255) pageview, and your web server increments the corre- pageviews bigint sponding row in the database. Let’s see what problems emerge as you evolve the Figure 1.1 Relational schema for application. As you’re about to see, you’ll run into simple analytics application problems with both scalability and complexity. 1.2.1 Scaling with a queue The web analytics product is a huge success, and traffic to your application is growing like wildfire. Your company throws a big party, but in the middle of the celebration you start getting lots of emails from your monitoring system. They all say the same thing: “Timeout error on inserting to the database.” You look at the logs and the problem is obvious. The database can’t keep up with the load, so write requests to increment pageviews are timing out. You need to do something to fix the problem, and you need to do something quickly. You realize that it’s wasteful to only perform a single increment at a time to the database. It can be more efficient if you batch many increments in a single request. So you re-architect your back end to make this possible. Licensed to Mark Watson 4 CHAPTER 1 A new paradigm for Big Data Instead of having the web server hit the database directly, you insert a queue between the web server and the database. Whenever you receive a new pageview, that event is added to the queue. You then create a worker process that reads 100 events at a time off the queue, and batches them into a single Web server DB database update. This is illustrated in figure 1.2. This scheme works well, and it resolves the Pageview timeout issues you were getting. It even has the added bonus that if the database ever gets Queue Worker 100 at a time overloaded again, the queue will just get big- ger instead of timing out to the web server and Figure 1.2 Batching updates with queue potentially losing data. and worker 1.2.2 Scaling by sharding the database Unfortunately, adding a queue and doing batch updates was only a band-aid for the scaling problem. Your application continues to get more and more popular, and again the database gets overloaded. Your worker can’t keep up with the writes, so you try adding more workers to parallelize the updates. Unfortunately that doesn’t help; the database is clearly the bottleneck. You do some Google searches for how to scale a write-heavy relational database. You find that the best approach is to use multiple database servers and spread the table across all the servers. Each server will have a subset of the data for the table. This is known as horizontal partitioning or sharding. This technique spreads the write load across multiple machines. The sharding technique you use is to choose the shard for each key by taking the hash of the key modded by the number of shards. Mapping keys to shards using a hash function causes the keys to be uniformly distributed across the shards. You write a script to map over all the rows in your single database instance, and split the data into four shards. It takes a while to run, so you turn off the worker that increments pageviews to let it finish. Otherwise you’d lose increments during the transition. Finally, all of your application code needs to know how to find the shard for each key. So you wrap a library around your database-handling code that reads the number of shards from a configuration file, and you redeploy all of your application code. You have to modify your top-100-URLs query to get the top 100 URLs from each shard and merge those together for the global top 100 URLs. As the application gets more and more popular, you keep having to reshard the database into more shards to keep up with the write load. Each time gets more and more painful because there’s so much more work to coordinate. And you can’t just run one script to do the resharding, as that would be too slow. You have to do all the resharding in parallel and manage many active worker scripts at once. You forget to update the application code with the new number of shards, and it causes many of the increments to be written to the wrong shards. So you have to write a one-off script to manually go through the data and move whatever was misplaced. Licensed to Mark Watson Scaling with a traditional database 5 1.2.3 Fault-tolerance issues begin Eventually you have so many shards that it becomes a not-infrequent occurrence for the disk on one of the database machines to go bad. That portion of the data is unavailable while that machine is down. You do a couple of things to address this: You update your queue/worker system to put increments for unavailable shards on a separate “pending” queue that you attempt to flush once every five minutes. You use the database’s replication capabilities to add a slave to each shard so you have a backup in case the master goes down. You don’t write to the slave, but at least customers can still view the stats in the application. You think to yourself, “In the early days I spent my time building new features for cus- tomers. Now it seems I’m spending all my time just dealing with problems reading and writing the data.” 1.2.4 Corruption issues While working on the queue/worker code, you accidentally deploy a bug to produc- tion that increments the number of pageviews by two, instead of by one, for every URL. You don’t notice until 24 hours later, but by then the damage is done. Your weekly backups don’t help because there’s no way of knowing which data got cor- rupted. After all this work trying to make your system scalable and tolerant of machine failures, your system has no resilience to a human making a mistake. And if there’s one guarantee in software, it’s that bugs inevitably make it to production, no matter how hard you try to prevent it. 1.2.5 What went wrong? As the simple web analytics application evolved, the system continued to get more and more complex: queues, shards, replicas, resharding scripts, and so on. Developing applications on the data requires a lot more than just knowing the database schema. Your code needs to know how to talk to the right shards, and if you make a mistake, there’s nothing preventing you from reading from or writing to the wrong shard. One problem is that your database is not self-aware of its distributed nature, so it can’t help you deal with shards, replication, and distributed queries. All that complexity got pushed to you both in operating the database and developing the application code. But the worst problem is that the system is not engineered for human mistakes. Quite the opposite, actually: the system keeps getting more and more complex, mak- ing it more and more likely that a mistake will be made. Mistakes in software are inevi- table, and if you’re not engineering for it, you might as well be writing scripts that randomly corrupt data. Backups are not enough; the system must be carefully thought out to limit the damage a human mistake can cause. Human-fault tolerance is not optional. It’s essential, especially when Big Data adds so many more complexities to building applications. Licensed to Mark Watson 6 CHAPTER 1 A new paradigm for Big Data 1.2.6 How will Big Data techniques help? The Big Data techniques you’re going to learn will address these scalability and com- plexity issues in a dramatic fashion. First of all, the databases and computation systems you use for Big Data are aware of their distributed nature. So things like sharding and replication are handled for you. You’ll never get into a situation where you acciden- tally query the wrong shard, because that logic is internalized in the database. When it comes to scaling, you’ll just add nodes, and the systems will automatically rebalance onto the new nodes. Another core technique you’ll learn about is making your data immutable. Instead of storing the pageview counts as your core dataset, which you continuously mutate as new pageviews come in, you store the raw pageview information. That raw pageview information is never modified. So when you make a mistake, you might write bad data, but at least you won’t destroy good data. This is a much stronger human-fault tol- erance guarantee than in a traditional system based on mutation. With traditional databases, you’d be wary of using immutable data because of how fast such a dataset would grow. But because Big Data techniques can scale to so much data, you have the ability to design systems in different ways. 1.3 NoSQL is not a panacea The past decade has seen a huge amount of innovation in scalable data systems. These include large-scale computation systems like Hadoop and databases such as Cassandra and Riak. These systems can handle very large amounts of data, but with serious trade-offs. Hadoop, for example, can parallelize large-scale batch computations on very large amounts of data, but the computations have high latency. You don’t use Hadoop for anything where you need low-latency results. NoSQL databases like Cassandra achieve their scalability by offering you a much more limited data model than you’re used to with something like SQL. Squeezing your application into these limited data models can be very complex. And because the databases are mutable, they’re not human-fault tolerant. These tools on their own are not a panacea. But when intelligently used in con- junction with one another, you can produce scalable systems for arbitrary data prob- lems with human-fault tolerance and a minimum of complexity. This is the Lambda Architecture you’ll learn throughout the book. 1.4 First principles To figure out how to properly build data systems, you must go back to first principles. At the most fundamental level, what does a data system do? Let’s start with an intuitive definition: A data system answers questions based on infor- mation that was acquired in the past up to the present. So a social network profile answers questions like “What is this person’s name?” and “How many friends does this person have?” A bank account web page answers questions like “What is my current balance?” and “What transactions have occurred on my account recently?” Licensed to Mark Watson Desired properties of a Big Data system 7 Data systems don’t just memorize and regurgitate information. They combine bits and pieces together to produce their answers. A bank account balance, for example, is based on combining the information about all the transactions on the account. Another crucial observation is that not all bits of information are equal. Some infor- mation is derived from other pieces of information. A bank account balance is derived from a transaction history. A friend count is derived from a friend list, and a friend list is derived from all the times a user added and removed friends from their profile. When you keep tracing back where information is derived from, you eventually end up at information that’s not derived from anything. This is the rawest information you have: information you hold to be true simply because it exists. Let’s call this infor- mation data. You may have a different conception of what the word data means. Data is often used interchangeably with the word information. But for the remainder of this book, when we use the word data, we’re referring to that special information from which everything else is derived. If a data system answers questions by looking at past data, then the most general- purpose data system answers questions by looking at the entire dataset. So the most general-purpose definition we can give for a data system is the following: query = function  all data  Anything you could ever imagine doing with data can be expressed as a function that takes in all the data you have as input. Remember this equation, because it’s the crux of everything you’ll learn. We’ll refer to this equation over and over. The Lambda Architecture provides a general-purpose approach to implementing an arbitrary function on an arbitrary dataset and having the function return its results with low latency. That doesn’t mean you’ll always use the exact same technologies every time you implement a data system. The specific technologies you use might change depending on your requirements. But the Lambda Architecture defines a con- sistent approach to choosing those technologies and to wiring them together to meet your requirements. Let’s now discuss the properties a data system must exhibit. 1.5 Desired properties of a Big Data system The properties you should strive for in Big Data systems are as much about complexity as they are about scalability. Not only must a Big Data system perform well and be resource- efficient, it must be easy to reason about as well. Let’s go over each property one by one. 1.5.1 Robustness and fault tolerance Building systems that “do the right thing” is difficult in the face of the challenges of distributed systems. Systems need to behave correctly despite machines going down randomly, the complex semantics of consistency in distributed databases, duplicated data, concurrency, and more. These challenges make it difficult even to reason about Licensed to Mark Watson 8 CHAPTER 1 A new paradigm for Big Data what a system is doing. Part of making a Big Data system robust is avoiding these com- plexities so that you can easily reason about the system. As discussed before, it’s imperative for systems to be human-fault tolerant. This is an oft-overlooked property of systems that we’re not going to ignore. In a production sys- tem, it’s inevitable that someone will make a mistake sometime, such as by deploying incorrect code that corrupts values in a database. If you build immutability and recomputation into the core of a Big Data system, the system will be innately resilient to human error by providing a clear and simple mechanism for recovery. This is described in depth in chapters 2 through 7. 1.5.2 Low latency reads and updates The vast majority of applications require reads to be satisfied with very low latency, typi- cally between a few milliseconds to a few hundred milliseconds. On the other hand, the update latency requirements vary a great deal between applications. Some applications require updates to propagate immediately, but in other applications a latency of a few hours is fine. Regardless, you need to be able to achieve low latency updates when you need them in your Big Data systems. More importantly, you need to be able to achieve low latency reads and updates without compromising the robustness of the system. You’ll learn how to achieve low latency updates in the discussion of the speed layer, starting in chapter 12. 1.5.3 Scalability Scalability is the ability to maintain performance in the face of increasing data or load by adding resources to the system. The Lambda Architecture is horizontally scalable across all layers of the system stack: scaling is accomplished by adding more machines. 1.5.4 Generalization A general system can support a wide range of applications. Indeed, this book wouldn’t be very useful if it didn’t generalize to a wide range of applications! Because the Lambda Architecture is based on functions of all data, it generalizes to all applica- tions, whether financial management systems, social media analytics, scientific appli- cations, social networking, or anything else. 1.5.5 Extensibility You don’t want to have to reinvent the wheel each time you add a related feature or make a change to how your system works. Extensible systems allow functionality to be added with a minimal development cost. Oftentimes a new feature or a change to an existing feature requires a migration of old data into a new format. Part of making a system extensible is making it easy to do large-scale migrations. Being able to do big migrations quickly and easily is core to the approach you’ll learn. 1.5.6 Ad hoc queries Being able to do ad hoc queries on your data is extremely important. Nearly every large dataset has unanticipated value within it. Being able to mine a dataset arbitrarily Licensed to Mark Watson The problems with fully incremental architectures 9 gives opportunities for business optimization and new applications. Ultimately, you can’t discover interesting things to do with your data unless you can ask arbitrary ques- tions of it. You’ll learn how to do ad hoc queries in chapters 6 and 7 when we discuss batch processing. 1.5.7 Minimal maintenance Maintenance is a tax on developers. Maintenance is the work required to keep a system running smoothly. This includes anticipating when to add machines to scale, keeping processes up and running, and debugging anything that goes wrong in production. An important part of minimizing maintenance is choosing components that have as little implementation complexity as possible. You want to rely on components that have sim- ple mechanisms underlying them. In particular, distributed databases tend to have very complicated internals. The more complex a system, the more likely something will go wrong, and the more you need to understand about the system to debug and tune it. You combat implementation complexity by relying on simple algorithms and sim- ple components. A trick employed in the Lambda Architecture is to push complexity out of the core components and into pieces of the system whose outputs are discard- able after a few hours. The most complex components used, like read/write distrib- uted databases, are in this layer where outputs are eventually discardable. We’ll discuss this technique in depth when we discuss the speed layer in chapter 12. 1.5.8 Debuggability A Big Data system must provide the information necessary to debug the system when things go wrong. The key is to be able to trace, for each value in the system, exactly what caused it to have that value. “Debuggability” is accomplished in the Lambda Architecture through the func- tional nature of the batch layer and by preferring to use recomputation algorithms when possible. Achieving all these properties together in one system may seem like a daunting challenge. But by starting from first principles, as the Lambda Architecture does, these properties emerge naturally from the resulting system design. Before diving into the Lambda Architecture, let’s take a look at more traditional architectures—characterized by a reliance on incremental computation—and at why they’re unable to satisfy many of these properties. 1.6 The problems with fully incremental architectures At the highest level, traditional architectures look like figure 1.3. What characterizes these architectures is the use of read/write databases and maintaining the state in those databases incrementally as new data is seen. For example, an incremental approach to count- ing pageviews would be to process a new Application Database pageview by adding one to the counter for its URL. This characterization of architectures is a Figure 1.3 Fully incremental architecture Licensed to Mark Watson 10 CHAPTER 1 A new paradigm for Big Data lot more fundamental than just relational versus non-relational—in fact, the vast major- ity of both relational and non-relational database deployments are done as fully incre- mental architectures. This has been true for many decades. It’s worth emphasizing that fully incremental architectures are so widespread that many people don’t realize it’s possible to avoid their problems with a different archi- tecture. These are great examples of familiar complexity—complexity that’s so ingrained, you don’t even think to find a way to avoid it. The problems with fully incremental architectures are significant. We’ll begin our exploration of this topic by looking at the general complexities brought on by any fully incremental architecture. Then we’ll look at two contrasting solutions for the same problem: one using the best possible fully incremental solution, and one using a Lambda Architecture. You’ll see that the fully incremental version is significantly worse in every respect. 1.6.1 Operational complexity There are many complexities inherent in fully incremental architectures that create difficulties in operating production infrastructure. Here we’ll focus on one: the need for read/write databases to perform online compaction, and what you have to do operationally to keep things running smoothly. In a read/write database, as a disk index is incrementally added to and modified, parts of the index become unused. These unused parts take up space and eventually need to be reclaimed to prevent the disk from filling up. Reclaiming space as soon as it becomes unused is too expensive, so the space is occasionally reclaimed in bulk in a process called compaction. Compaction is an intensive operation. The server places substantially higher demand on the CPU and disks during compaction, which dramatically lowers the per- formance of that machine during that time period. Databases such as HBase and Cas- sandra are well-known for requiring careful configuration and management to avoid problems or server lockups during compaction. The performance loss during com- paction is a complexity that can even cause cascading failure—if too many machines compact at the same time, the load they were supporting will have to be handled by other machines in the cluster. This can potentially overload the rest of your cluster, causing total failure. We have seen this failure mode happen many times. To manage compaction correctly, you have to schedule compactions on each node so that not too many nodes are affected at once. You have to be aware of how long a compaction takes—as well as the variance—to avoid having more nodes undergoing compaction than you intended. You have to make sure you have enough disk capacity on your nodes to last them between compactions. In addition, you have to make sure you have enough capacity on your cluster so that it doesn’t become overloaded when resources are lost during compactions. All of this can be managed by a competent operational staff, but it’s our contention that the best way to deal with any sort of complexity is to get rid of that complexity Licensed to Mark Watson The problems with fully incremental architectures 11 altogether. The fewer failure modes you have in your system, the less likely it is that you’ll suffer unexpected downtime. Dealing with online compaction is a complexity inherent to fully incremental architectures, but in a Lambda Architecture the primary databases don’t require any online compaction. 1.6.2 Extreme complexity of achieving eventual consistency Another complexity of incremental architectures results when trying to make systems highly available. A highly available system allows for queries and updates even in the presence of machine or partial network failure. It turns out that achieving high availability competes directly with another impor- tant property called consistency. A consistent system returns results that take into account all previous writes. A theorem called the CAP theorem has shown that it’s impossible to achieve both high availability and consistency in the same system in the presence of network partitions. So a highly available system sometimes returns stale results during a network partition. The CAP theorem is discussed in depth in chapter 12—here we wish to focus on how the inability to achieve full consistency and high availability at all times affects your ability to construct systems. It turns out that if your business requirements demand high availability over full consistency, there is a huge amount of complexity you have to deal with. In order for a highly available system to return to consistency once a network parti- tion ends (known as eventual consistency), a lot of help is required from your applica- tion. Take, for example, the basic use case of maintaining a count in a database. The obvious way to go about this is to store a number in the database and increment that number whenever an event is received that requires the count to go up. You may be surprised that if you were to take this approach, you’d suffer massive data loss during network partitions. The reason for this is due to the way distributed databases achieve high availability by keeping multiple replicas of all information stored. When you keep many copies of the same information, that information is still available even if a machine goes down or the network gets partitioned, as shown in figure 1.4. During a network partition, a system that chooses to be highly available has clients update whatever replicas are reachable to them. This causes replicas to diverge and receive different sets of updates. Only when the partition goes away can the replicas be merged together into a common value. Suppose you have two replicas with a count of 10 when a network partition begins. Suppose the first replica gets two increments and the second gets one increment. When it comes time to merge these replicas together, with values of 12 and 11, what should the merged value be? Although the correct answer is 13, there’s no way to know just by looking at the numbers 12 and 11. They could have diverged at 11 (in which case the answer would be 12), or they could have diverged at 0 (in which case the answer would be 23). Licensed to Mark Watson 12 CHAPTER 1 A new paradigm for Big Data x -> 10 x -> 10 y -> 12 y -> 12 Replica 1 Replica 2 Query Query Network partition Client Client Figure 1.4 Using replication to increase availability To do highly available counting correctly, it’s not enough to just store a count. You need a data structure that’s amenable to merging when values diverge, and you need to implement the code that will repair values once partitions end. This is an amazing amount of complexity you have to deal with just to maintain a simple count. In general, handling eventual consistency in incremental, highly available systems is unintuitive and prone to error. This complexity is innate to highly available, fully incremental systems. You’ll see later how the Lambda Architecture structures itself in a different way that greatly lessens the burdens of achieving highly available, eventu- ally consistent systems. 1.6.3 Lack of human-fault tolerance The last problem with fully incremental architectures we wish to point out is their inherent lack of human-fault tolerance. An incremental system is constantly modify- ing the state it keeps in the database, which means a mistake can also modify the state in the database. Because mistakes are inevitable, the database in a fully incremental architecture is guaranteed to be corrupted. It’s important to note that this is one of the few complexities of fully incremental architectures that can be resolved without a complete rethinking of the architecture. Consider the two architectures shown in figure 1.5: a synchronous architecture, where the application makes updates directly to the database, and an asynchronous architecture, where events go to a queue before updating the database in the back- ground. In both cases, every event is permanently logged to an events datastore. By keeping every event, if a human mistake causes database corruption, you can go back Licensed to Mark Watson The problems with fully incremental architectures 13 Application Application Database Event log Stream processor Event log Database Figure 1.5 Adding logging to fully incremental architec- Synchronous Asynchronous tures to the events store and reconstruct the proper state for the database. Because the events store is immutable and constantly growing, redundant checks, like permis- sions, can be put in to make it highly unlikely for a mistake to trample over the events store. This technique is also core to the Lambda Architecture and is discussed in depth in chapters 2 and 3. Although fully incremental architectures with logging can overcome the human- fault tolerance deficiencies of fully incremental architectures without logging, the log- ging does nothing to handle the other complexities that have been discussed. And as you’ll see in the next section, any architecture based purely on fully incremental com- putation, including those with logging, will struggle to solve many problems. 1.6.4 Fully incremental solution vs. Lambda Architecture solution One of the example queries that is implemented throughout the book serves as a great contrast between fully incremental and Lambda architectures. There’s nothing contrived about this query—in fact, it’s based on real-world problems we have faced in our careers multiple times. The query has to do with pageview analytics and is done on two kinds of data coming in: Pageviews, which contain a user ID, URL, and timestamp. Equivs, which contain two user IDs. An equiv indicates the two user IDs refer to the same person. For example, you might have an equiv between the email [email protected] and the username sally. If [email protected] also registers for the user- name sally2, then you would have an equiv between [email protected] and sally2. By transitivity, you know that the usernames sally and sally2 refer to the same person. The goal of the query is to compute the number of unique visitors to a URL over a range of time. Queries should be up to date with all data and respond with minimal latency (less than 100 milliseconds). Here’s the interface for the query: long uniquesOverTime(String url, int startHour, int endHour) Licensed to Mark Watson 14 CHAPTER 1 A new paradigm for Big Data What makes implementing this query tricky are those equivs. If a person visits the same URL in a time range with two user IDs connected via equivs (even transitively), that should only count as one visit. A new equiv coming in can change the results for any query over any time range for any URL. We’ll refrain from showing the details of the solutions at this point, as too many concepts must be covered to understand them: indexing, distributed databases, batch processing, HyperLogLog, and many more. Overwhelming you with all these concepts at this point would be counterproductive. Instead, we’ll focus on the characteristics of the solutions and the striking differences between them. The best possible fully incre- mental solution is shown in detail in chapter 10, and the Lambda Architecture solu- tion is built up in chapters 8, 9, 14, and 15. The two solutions can be compared on three axes: accuracy, latency, and through- put. The Lambda Architecture solution is significantly better in all respects. Both must make approximations, but the fully incremental version is forced to use an infe- rior approximation technique with a 3–5x worse error rate. Performing queries is sig- nificantly more expensive in the fully incremental version, affecting both latency and throughput. But the most striking difference between the two approaches is the fully incremental version’s need to use special hardware to achieve anywhere close to rea- sonable throughput. Because the fully incremental version must do many random access lookups to resolve queries, it’s practically required to use solid state drives to avoid becoming bottlenecked on disk seeks. That a Lambda Architecture can produce solutions with higher performance in every respect, while also avoiding the complexity that plagues fully incremental archi- tectures, shows that something very fundamental is going on. The key is escaping the shackles of fully incremental computation and embracing different techniques. Let’s now see how to do that. 1.7 Lambda Architecture Computing arbitrary functions on an arbitrary dataset in real time is a daunting prob- lem. There’s no single tool that provides a complete solution. Instead, you have to use a variety of tools and techniques to build a complete Big Data system. The main idea of the Lambda Architecture is to build Big Data systems as a series of layers, as shown in figure 1.6. Each layer satisfies a subset of the properties and builds upon the functionality provided by the lay- Speed layer ers beneath it. You’ll spend the whole book learning how to design, implement, and deploy each layer, but the high-level ideas Serving layer of how the whole system fits together are fairly easy to understand. Everything starts from the query = function(all data) equation. Batch layer Ideally, you could run the functions on the fly to get the results. Unfortunately, even if this were possible, it would take a huge Figure 1.6 Lambda amount of resources to do and would be unreasonably expensive. Architecture Licensed to Mark Watson Lambda Architecture 15 Imagine having to read a petabyte dataset every time you wanted to answer the query of someone’s current location. The most obvious alternative approach is to precompute the query function. Let’s call the precomputed query function the batch view. Instead of computing the query on the fly, you read the results from the precomputed view. The precomputed view is indexed so that it can be accessed with random reads. This system looks like this: batch view = function  all data  query = function  batch view  In this system, you run a function on all the data to get the batch view. Then, when you want to know the value for a query, you run a function on that batch view. The batch view makes it possible to get the values you need from it very quickly, without having to scan everything in it. Because this discussion is somewhat abstract, let’s ground it with an example. Sup- pose you’re building a web analytics application (again), and you want to query the number of pageviews for a URL on any range of days. If you were computing the query as a function of all the data, you’d scan the dataset for pageviews for that URL within that time range, and return the count of those results. The batch view approach instead runs a function on all the pageviews to precom- pute an index from a key of [url, day] to the count of the number of pageviews for that URL for that day. Then, to resolve the query, you retrieve all values from that view for all days within that time range, and sum up the counts to get the result. This approach is shown in figure 1.7. It should be clear that there’s something missing from this approach, as described so far. Creating the batch view is clearly going to be a high-latency operation, because it’s running a function on all the data you have. By the time it finishes, a lot of new data will have collected that’s not represented in the batch views, and the queries will be out of date by many hours. But let’s ignore this issue for the moment, because we’ll Batch view All data Batch Batch view layer Figure 1.7 Batch view Architecture of the batch layer Licensed to Mark Watson 16 CHAPTER 1 A new paradigm for Big Data be able to fix it. Let’s pretend that it’s okay for queries to be out of date by a few hours and continue exploring this idea of precomputing a batch view by running a function on the complete dataset. 1.7.1 Batch layer The portion of the Lambda Architecture that implements the batch view = function(all Speed layer data) equation is called the batch layer. The 1. Stores master dataset batch layer stores the master copy of the Serving layer 2. Computes arbitrary views dataset and precomputes batch views on that master dataset (see figure 1.8). The master Batch layer dataset can be thought of as a very large list of records. Figure 1.8 Batch layer The batch layer needs to be able to do two things: store an immutable, constantly growing master dataset, and compute arbitrary functions on that dataset. This type of processing is best done using batch-processing systems. Hadoop is the canonical example of a batch-processing system, and Hadoop is what we’ll use in this book to demonstrate the concepts of the batch layer. The simplest form of the batch layer can be represented in pseudo-code like this: function runBatchLayer(): while(true): recomputeBatchViews() The batch layer runs in a while(true) loop and continuously recomputes the batch views from scratch. In reality, the batch layer is a little more involved, but we’ll come to that later in the book. This is the best way to think about the batch layer at the moment. The nice thing about the batch layer is that it’s so simple to use. Batch computa- tions are written like single-threaded programs, and you get parallelism for free. It’s easy to write robust, highly scalable computations on the batch layer. The batch layer scales by adding new machines. Here’s an example of a batch layer computation. Don’t worry about understanding this code—the point is to show what an inherently parallel program looks like: Api.execute(Api.hfsSeqfile("/tmp/pageview-counts"), new Subquery("?url", "?count").predicate(Api.hfsSeqfile("/data/pageviews"), "?url", "?user", "?timestamp").predicate(new Count(), "?count"); This code computes the number of pageviews for every URL given an input dataset of raw pageviews. What’s interesting about this code is that all the concurrency chal- lenges of scheduling work and merging results is done for you. Because the algorithm is written in this way, it can be arbitrarily distributed on a MapReduce cluster, scaling to however many nodes you have available. At the end of the computation, the output Licensed to Mark Watson Lambda Architecture 17 directory will contain some number of files with the results. You’ll learn how to write programs like this in chapter 7. 1.7.2 Serving layer The batch layer emits batch views as the 1. Random access to Speed layer result of its functions. The next step is to batch views 2. Updated by batch layer load the views somewhere so that they can be queried. This is where the serving layer Serving layer comes in. The serving layer is a specialized distributed database that loads in a batch Batch layer view and makes it possible to do random reads on it (see figure 1.9). When new batch views are available, the serving layer Figure 1.9 Serving layer automatically swaps those in so that more up-to-date results are available. A serving layer database supports batch updates and random reads. Most notably, it doesn’t need to support random writes. This is a very important point, as random writes cause most of the complexity in databases. By not supporting random writes, these databases are extremely simple. That simplicity makes them robust, predictable, easy to configure, and easy to operate. ElephantDB, the serving layer database you’ll learn to use in this book, is only a few thousand lines of code. 1.7.3 Batch and serving layers satisfy almost all properties The batch and serving layers support arbitrary queries on an arbitrary dataset with the trade-off that queries will be out of date by a few hours. It takes a new piece of data a few hours to propagate through the batch layer into the serving layer where it can be queried. The important thing to notice is that other than low latency updates, the batch and serving layers satisfy every property desired in a Big Data system, as outlined in section 1.5. Let’s go through them one by one: Robustness and fault tolerance—Hadoop handles failover when machines go down. The serving layer uses replication under the hood to ensure availability when servers go down. The batch and serving layers are also human-fault toler- ant, because when a mistake is made, you can fix your algorithm or remove the bad data and recompute the views from scratch. Scalability—Both the batch and serving layers are easily scalable. They’re both fully distributed systems, and scaling them is as easy as adding new machines. Generalization—The architecture described is as general as it gets. You can com- pute and update arbitrary views of an arbitrary dataset. Extensibility—Adding a new view is as easy as adding a new function of the mas- ter dataset. Because the master dataset can contain arbitrary data, new types of data can be easily added. If you want to tweak a view, you don’t have to worry Licensed to Mark Watson 18 CHAPTER 1 A new paradigm for Big Data about supporting multiple versions of the view in the application. You can sim- ply recompute the entire view from scratch. Ad hoc queries—The batch layer supports ad hoc queries innately. All the data is conveniently available in one location. Minimal maintenance—The main component to maintain in this system is Hadoop. Hadoop requires some administration knowledge, but it’s fairly straightforward to operate. As explained before, the serving layer databases are simple because they don’t do random writes. Because a serving layer database has so few moving parts, there’s lots less that can go wrong. As a consequence, it’s much less likely that anything will go wrong with a serving layer database, so they’re easier to maintain. Debuggability—You’ll always have the inputs and outputs of computations run on the batch layer. In a traditional database, an output can replace the original input—such as when incrementing a value. In the batch and serving layers, the input is the master dataset and the output is the views. Likewise, you have the inputs and outputs for all the intermediate steps. Having the inputs and outputs gives you all the information you need to debug when something goes wrong. The beauty of the batch and serving layers is that they satisfy almost all the properties you want with a simple and easy-to-understand approach. There are no concurrency issues to deal with, and it scales trivially. The only property missing is low latency updates. The final layer, the speed layer, fixes this problem. 1.7.4 Speed layer The serving layer updates whenever the batch layer finishes precomputing a batch view. This mean

Big Data Principles and Best Practices of Scalable Real-time Data Systems PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue