mastering_kafka_streams_and_ksqldb with marking.pdf
Document Details
Uploaded by Deleted User
Full Transcript
Mastering Kafka Streams and ksqlDB Building Real-Time Data Systems by Example Compliments of Mitch Seymour ksqlDB + Confluent Build and launch stream processing applications faster with fully managed...
Mastering Kafka Streams and ksqlDB Building Real-Time Data Systems by Example Compliments of Mitch Seymour ksqlDB + Confluent Build and launch stream processing applications faster with fully managed ksqlDB from Confluent Simplify your stream processing Enable your business to Empower developers to start architecture in one fully innovate faster and operate in building apps faster on top of managed solution with two real-time by processing data in Kafka with ksqlDB’s simple components: Kafka and ksqlDB motion rather than data at rest SQL syntax USE ANY POPULAR CLOUD PROVIDER Try ksqlDB on Confluent Free for 90 Days T RY F R E E New signups get up to $200 off their first three monthly bills, plus use promo code KSQLDB2021 for an additional $200 credit! Claim your promo code in-product Mastering Kafka Streams and ksqlDB Building Real-Time Data Systems by Example Mitch Seymour Beijing Boston Farnham Sebastopol Tokyo Mastering Kafka Streams and ksqlDB by Mitch Seymour Copyright © 2021 Mitch Seymour. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or [email protected]. Acquisitions Editor: Jessica Haberman Indexer: Ellen Troutman-Zaig Development Editor: Jeff Bleiel Interior Designer: David Futato Production Editor: Daniel Elfanbaum Cover Designer: Karen Montgomery Copyeditor: Kim Cofer Illustrator: Kate Dullea Proofreader: JM Olejarz February 2021: First Edition Revision History for the First Edition 2021-02-04: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492062493 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Mastering Kafka Streams and ksqlDB, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author, and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. This work is part of a collaboration between O’Reilly and Confluent. See our statement of editorial inde‐ pendence. 978-1-098-10982-0 [LSI] Table of Contents Foreword.................................................................... xiii Preface....................................................................... xv Part I. Kafka 1. A Rapid Introduction to Kafka................................................ 1 Communication Model 2 How Are Streams Stored? 6 Topics and Partitions 9 Events 11 Kafka Cluster and Brokers 12 Consumer Groups 13 Installing Kafka 15 Hello, Kafka 16 Summary 19 Part II. Kafka Streams 2. Getting Started with Kafka Streams.......................................... 23 The Kafka Ecosystem 23 Before Kafka Streams 24 Enter Kafka Streams 25 Features at a Glance 27 Operational Characteristics 28 v Scalability 28 Reliability 29 Maintainability 30 Comparison to Other Systems 30 Deployment Model 30 Processing Model 31 Kappa Architecture 32 Use Cases 33 Processor Topologies 35 Sub-Topologies 37 Depth-First Processing 39 Benefits of Dataflow Programming 41 Tasks and Stream Threads 41 High-Level DSL Versus Low-Level Processor API 44 Introducing Our Tutorial: Hello, Streams 45 Project Setup 46 Creating a New Project 46 Adding the Kafka Streams Dependency 47 DSL 48 Processor API 51 Streams and Tables 53 Stream/Table Duality 57 KStream, KTable, GlobalKTable 57 Summary 58 3. Stateless Processing....................................................... 61 Stateless Versus Stateful Processing 62 Introducing Our Tutorial: Processing a Twitter Stream 63 Project Setup 65 Adding a KStream Source Processor 65 Serialization/Deserialization 69 Building a Custom Serdes 70 Defining Data Classes 71 Implementing a Custom Deserializer 72 Implementing a Custom Serializer 73 Building the Tweet Serdes 74 Filtering Data 75 Branching Data 77 Translating Tweets 79 Merging Streams 81 Enriching Tweets 82 vi | Table of Contents Avro Data Class 83 Sentiment Analysis 85 Serializing Avro Data 87 Registryless Avro Serdes 88 Schema Registry–Aware Avro Serdes 88 Adding a Sink Processor 90 Running the Code 91 Empirical Verification 91 Summary 94 4. Stateful Processing........................................................ 95 Benefits of Stateful Processing 96 Preview of Stateful Operators 97 State Stores 98 Common Characteristics 99 Persistent Versus In-Memory Stores 101 Introducing Our Tutorial: Video Game Leaderboard 102 Project Setup 104 Data Models 104 Adding the Source Processors 106 KStream 106 KTable 107 GlobalKTable 109 Registering Streams and Tables 110 Joins 111 Join Operators 112 Join Types 113 Co-Partitioning 114 Value Joiners 117 KStream to KTable Join (players Join) 119 KStream to GlobalKTable Join (products Join) 120 Grouping Records 121 Grouping Streams 121 Grouping Tables 122 Aggregations 123 Aggregating Streams 123 Aggregating Tables 126 Putting It All Together 127 Interactive Queries 129 Materialized Stores 129 Accessing Read-Only State Stores 131 Table of Contents | vii Querying Nonwindowed Key-Value Stores 131 Local Queries 134 Remote Queries 134 Summary 142 5. Windows and Time....................................................... 143 Introducing Our Tutorial: Patient Monitoring Application 144 Project Setup 146 Data Models 147 Time Semantics 147 Timestamp Extractors 150 Included Timestamp Extractors 150 Custom Timestamp Extractors 152 Registering Streams with a Timestamp Extractor 153 Windowing Streams 154 Window Types 154 Selecting a Window 158 Windowed Aggregation 159 Emitting Window Results 161 Grace Period 163 Suppression 163 Filtering and Rekeying Windowed KTables 166 Windowed Joins 167 Time-Driven Dataflow 168 Alerts Sink 170 Querying Windowed Key-Value Stores 170 Summary 173 6. Advanced State Management.............................................. 175 Persistent Store Disk Layout 176 Fault Tolerance 177 Changelog Topics 178 Standby Replicas 180 Rebalancing: Enemy of the State (Store) 180 Preventing State Migration 181 Sticky Assignment 182 Static Membership 185 Reducing the Impact of Rebalances 186 Incremental Cooperative Rebalancing 187 Controlling State Size 189 Deduplicating Writes with Record Caches 195 viii | Table of Contents State Store Monitoring 196 Adding State Listeners 196 Adding State Restore Listeners 198 Built-in Metrics 199 Interactive Queries 200 Custom State Stores 201 Summary 202 7. Processor API............................................................ 203 When to Use the Processor API 204 Introducing Our Tutorial: IoT Digital Twin Service 205 Project Setup 208 Data Models 209 Adding Source Processors 211 Adding Stateless Stream Processors 213 Creating Stateless Processors 214 Creating Stateful Processors 217 Periodic Functions with Punctuate 221 Accessing Record Metadata 223 Adding Sink Processors 225 Interactive Queries 225 Putting It All Together 226 Combining the Processor API with the DSL 230 Processors and Transformers 231 Putting It All Together: Refactor 235 Summary 236 Part III. ksqlDB 8. Getting Started with ksqlDB............................................... 239 What Is ksqlDB? 240 When to Use ksqlDB 241 Evolution of a New Kind of Database 243 Kafka Streams Integration 243 Connect Integration 246 How Does ksqlDB Compare to a Traditional SQL Database? 247 Similarities 248 Differences 249 Architecture 251 ksqlDB Server 251 Table of Contents | ix ksqlDB Clients 253 Deployment Modes 255 Interactive Mode 255 Headless Mode 256 Tutorial 257 Installing ksqlDB 257 Running a ksqlDB Server 258 Precreating Topics 259 Using the ksqlDB CLI 259 Summary 262 9. Data Integration with ksqlDB.............................................. 263 Kafka Connect Overview 264 External Versus Embedded Connect 265 External Mode 266 Embedded Mode 267 Configuring Connect Workers 268 Converters and Serialization Formats 270 Tutorial 272 Installing Connectors 272 Creating Connectors with ksqlDB 273 Showing Connectors 275 Describing Connectors 276 Dropping Connectors 277 Verifying the Source Connector 277 Interacting with the Kafka Connect Cluster Directly 278 Introspecting Managed Schemas 279 Summary 279 10. Stream Processing Basics with ksqlDB....................................... 281 Tutorial: Monitoring Changes at Netflix 281 Project Setup 284 Source Topics 284 Data Types 285 Custom Types 287 Collections 288 Creating Source Collections 289 With Clause 291 Working with Streams and Tables 292 Showing Streams and Tables 292 Describing Streams and Tables 294 x | Table of Contents Altering Streams and Tables 295 Dropping Streams and Tables 295 Basic Queries 296 Insert Values 296 Simple Selects (Transient Push Queries) 298 Projection 299 Filtering 300 Flattening/Unnesting Complex Structures 302 Conditional Expressions 302 Coalesce 303 IFNULL 303 Case Statements 303 Writing Results Back to Kafka (Persistent Queries) 304 Creating Derived Collections 304 Putting It All Together 308 Summary 309 11. Intermediate and Advanced Stream Processing with ksqlDB.................... 311 Project Setup 312 Bootstrapping an Environment from a SQL File 312 Data Enrichment 314 Joins 314 Windowed Joins 319 Aggregations 322 Aggregation Basics 323 Windowed Aggregations 325 Materialized Views 331 Clients 332 Pull Queries 333 Curl 335 Push Queries 336 Push Queries via Curl 336 Functions and Operators 337 Operators 337 Showing Functions 338 Describing Functions 339 Creating Custom Functions 340 Additional Resources for Custom ksqlDB Functions 345 Summary 346 Table of Contents | xi Part IV. The Road to Production 12. Testing, Monitoring, and Deployment....................................... 349 Testing 350 Testing ksqlDB Queries 350 Testing Kafka Streams 352 Behavioral Tests 359 Benchmarking 362 Kafka Cluster Benchmarking 364 Final Thoughts on Testing 366 Monitoring 366 Monitoring Checklist 367 Extracting JMX Metrics 367 Deployment 370 ksqlDB Containers 370 Kafka Streams Containers 372 Container Orchestration 373 Operations 374 Resetting a Kafka Streams Application 374 Rate-Limiting the Output of Your Application 376 Upgrading Kafka Streams 377 Upgrading ksqlDB 378 Summary 379 A. Kafka Streams Configuration................................................ 381 B. ksqlDB Configuration....................................................... 387 Index....................................................................... 391 xii | Table of Contents Foreword Businesses are increasingly built around events—the real-time activity data of what is happening in a company—but what is the right infrastructure for harnessing the power of events? This is a question I have been thinking about since 2009, when I started the Apache Kafka project at LinkedIn. In 2014, I cofounded Confluent to definitively answer it. Beyond providing a way to store and access discrete events, an event streaming platform needs a mechanism to connect with a myriad of external systems. It also requires global schema management, metrics, and monitoring. But perhaps most important of all is stream processing—continuous computation over never-ending streams of data—without which an event streaming platform is simply incomplete. Now more than ever, stream processing plays a key role in how businesses interact with the world. In 2011, Marc Andreessen wrote an article titled “Why Software Is Eating the World.” The core idea is that any process that can be moved into software eventually will be. Marc turned out to be prescient. The most obvious outcome is that software has permeated every industry imaginable. But a lesser understood and more important outcome is that businesses are increas‐ ingly defined in software. Put differently, the core processes a business executes— from how it creates a product, to how it interacts with customers, to how it delivers services—are increasingly specified, monitored, and executed in software. What has changed because of that dynamic? Software, in this new world, is far less likely to be directly interacting with a human. Instead, it is more likely that its purpose is to programmatically trigger actions or react to other pieces of software that carry out business directly. It begs the question: are our traditional application architectures, centered around existing databases, sufficient for this emerging world? Virtually all databases, from the most established relational databases to the newest key-value stores, follow a paradigm in which data is passively stored and the database waits for commands to retrieve or modify it. This paradigm was driven by human-facing applications in xiii which a user looks at an interface and initiates actions that are translated into data‐ base queries. We think that is only half the problem, and the problem of storing data needs to be complemented with the ability to react to and process events. Events and stream processing are the keys to succeeding in this new world. Events support the continuous flow of data throughout a business, and stream processing automatically executes code in response to change at any level of detail—doing it in concert with knowledge of all changes that came before it. Modern stream processing systems like Kafka Streams and ksqlDB make it easy to build applications for a world that speaks software. In this book, Mitch Seymour lucidly describes these state-of-the-art systems from first principles. Mastering Kafka Streams and ksqlDB surveys core concepts, details the nuances of how each system works, and provides hands-on examples for using them for business in the real world. Stream processing has never been a more essen‐ tial programming paradigm—and Mastering Kafka Streams and ksqlDB illuminates the path to succeeding at it. — Jay Kreps Cocreator of Apache Kafka, Cofounder and CEO of Confluent xiv | Foreword Preface For data engineers and data scientists, there’s never a shortage of technologies that are competing for our attention. Whether we’re perusing our favorite subreddits, scan‐ ning Hacker News, reading tech blogs, or weaving through hundreds of tables at a tech conference, there are so many things to look at that it can start to feel overwhelming. But if we can find a quiet corner to just think for a minute, and let all of the buzz fade into the background, we can start to distinguish patterns from the noise. You see, we live in the age of explosive data growth, and many of these technologies were created to help us store and process data at scale. We’re told that these are modern solutions for modern problems, and we sit around discussing “big data” as if the idea is avant- garde, when really the focus on data volume is only half the story. Technologies that only solve for the data volume problem tend to have batch-oriented techniques for processing data. This involves running a job on some pile of data that has accumulated for a period of time. In some ways, this is like trying to drink the ocean all at once. With modern computing power and paradigms, some technologies actually manage to achieve this, though usually at the expense of high latency. Instead, there’s another property of modern data that we focus on in this book: data moves over networks in steady and never-ending streams. The technologies we cover in this book, Kafka Streams and ksqlDB, are specifically designed to process these continuous data streams in real time, and provide huge competitive advantages over the ocean-drinking variety. After all, many business problems are time-sensitive, and if you need to enrich, transform, or react to data as soon as it comes in, then Kafka Streams and ksqlDB will help get you there with ease and efficiency. Learning Kafka Streams and ksqlDB is also a great way to familiarize yourself with the larger concepts involved in stream processing. This includes modeling data in dif‐ ferent ways (streams and tables), applying stateless transformations of data, using local state for more advanced operations (joins, aggregations), understanding the dif‐ ferent time semantics and methods for grouping data into time buckets/windows, xv and more. In other words, your knowledge of Kafka Streams and ksqlDB will help you distinguish and evaluate different stream processing solutions that currently exist and may come into existence sometime in the future. I’m excited to share these technologies with you because they have both made an impact on my own career and helped me accomplish technological feats that I thought were beyond my own capabilities. In fact, by the time you finish reading this sentence, one of my Kafka Streams applications will have processed nine million events. The feeling you’ll get by providing real business value without having to invest exorbitant amounts of time on the solution will keep you working with these technol‐ ogies for years to come, and the succinct and expressive language constructs make the process feel more like an art form than a labor. And just like any other art form, whether it be a life-changing song or a beautiful painting, it’s human nature to want to share it. So consider this book a mixtape from me to you, with my favorite compi‐ lations from the stream processing space available for your enjoyment: Kafka Streams and ksqlDB, Volume 1. Who Should Read This Book This book is for data engineers who want to learn how to build highly scalable stream processing applications for moving, enriching, and transforming large amounts of data in real time. These skills are often needed to support business intelligence initia‐ tives, analytic pipelines, threat detection, event processing, and more. Data scientists and analysts who want to upgrade their skills by analyzing real-time data streams will also find value in this book, which is an exciting departure from the batch processing space that has typically dominated these fields. Prior experience with Apache Kafka is not required, though some familiarity with the Java programming language will make the Kafka Streams tutorials easier to follow. Navigating This Book This book is organized roughly as follows: Chapter 1 provides an introduction to Kafka and a tutorial for running a single- node Kafka cluster. Chapter 2 provides an introduction to Kafka Streams, starting with a background and architectural review, and ending with a tutorial for running a simple Kafka Streams application. Chapters 3 and 4 discuss the stateless and stateful operators in the Kafka Streams high-level DSL (domain-specific language). Each chapter includes a tutorial that will demonstrate how to use these operators to solve an interesting business problem. xvi | Preface Chapter 5 discusses the role that time plays in our stream processing applica‐ tions, and demonstrates how to use windows to perform more advanced stateful operations, including windowed joins and aggregations. A tutorial inspired by predictive healthcare will demonstrate the key concepts. Chapter 6 describes how stateful processing works under the hood, and provides some operational tips for stateful Kafka Streams applications. Chapter 7 dives into Kafka Streams’ lower-level Processor API, which can be used for scheduling periodic functions, and provides more granular access to application state and record metadata. The tutorial in this chapter is inspired by IoT (Internet of Things) use cases. Chapter 8 provides an introduction to ksqlDB, and discusses the history and architecture of this technology. The tutorial in this chapter will show you how to install and run a ksqlDB server instance, and work with the ksqlDB CLI. Chapter 9 discusses ksqlDB’s data integration features, which are powered by Kafka Connect. Chapters 10 and 11 discuss the ksqlDB SQL dialect in detail, demonstrating how to work with different collection types, perform push queries and pull queries, and more. The concepts will be introduced using a tutorial based on a Netflix use case: tracking changes to various shows/films, and making these changes avail‐ able to other applications. Chapter 12 provides the information you need to deploy your Kafka Streams and ksqlDB applications to production. This includes information on monitoring, testing, and containerizing your applications. Source Code The source code for this book can be found on GitHub at https://github.com/mitch- seymour/mastering-kafka-streams-and-ksqldb. Instructions for building and running each tutorial will be included in the repository. Kafka Streams Version At the time of this writing, the latest version of Kafka Streams was version 2.7.0. This is the version we use in this book, though in many cases, the code will also work with older or newer versions of the Kafka Streams library. We will make efforts to update the source code when newer versions introduce breaking changes, and will stage these updates in a dedicated branch (e.g., kafka-streams-2.8). Preface | xvii ksqlDB Version At the time of this writing, the latest version of ksqlDB was version 0.14.0. Compati‐ bility with older and newer versions of ksqlDB is less guaranteed due to the ongoing and rapid development of this technology, and the lack of a major version (e.g., 1.0) at the time of this book’s publication. We will make efforts to update the source code when newer versions introduce breaking changes, and will stage these updates in a dedicated branch (e.g., ksqldb-0.15). However, it is recommended to avoid versions older than 0.14.0 when running the examples in this book. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program ele‐ ments such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values deter‐ mined by context. This element signifies a tip or suggestion. This element signifies a general note. This element indicates a warning or caution. xviii | Preface Using Code Examples Supplemental material (code examples, exercises, etc.) can be found on the book’s GitHub page, https://github.com/mitch-seymour/mastering-kafka-streams-and-ksqldb. If you have a technical question or a problem using the code examples, please email [email protected]. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Mastering Kafka Streams and ksqlDB by Mitch Seymour (O’Reilly). Copyright 2021 Mitch Seymour, 978-1-492-06249-3.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at [email protected]. O’Reilly Online Learning For more than 40 years, O’Reilly Media has provided technol‐ ogy and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit http://oreilly.com. Preface | xix How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/mastering-kafka-streams. Email [email protected] to comment or ask technical questions about this book. For news and information about our books and courses, visit http://oreilly.com. Find us on Facebook: http://facebook.com/oreilly. Follow us on Twitter: http://twitter.com/oreillymedia. Watch us on YouTube: http://www.youtube.com/oreillymedia. Acknowledgments First and foremost, I want to thank my wife, Elyse, and my daughter, Isabelle. Writing a book is a huge time investment, and your patience and support through the entire process helped me immensely. As much as I enjoyed writing this book, I missed you both greatly, and I look forward to having more date nights and daddy-daughter time again. I also want to thank my parents, Angie and Guy, for teaching me the value of hard work and for being a never-ending source of encouragement. Your support has hel‐ ped me overcome many challenges over the years, and I am eternally grateful for you both. This book would not be possible without the following people, who dedicated a lot of their time to reviewing its content and providing great feedback and advice along the way: Matthias J. Sax, Robert Yokota, Nitin Sharma, Rohan Desai, Jeff Bleiel, and Danny Elfanbaum. Thank you all for helping me create this book, it’s just as much yours as it is mine. Many of the tutorials were informed by actual business use cases, and I owe a debt of gratitude to everyone in the community who openly shared their experiences with Kafka Streams and ksqlDB, whether it be through conferences, podcasts, blogs, or xx | Preface even in-person interviews. Your experiences helped shape this book, which puts a special emphasis on practical stream processing applications. Nitin Sharma also provided ideas for the Netflix-inspired ksqlDB tutorials, and Ramesh Sringeri shared his stream processing experiences at Children’s Healthcare of Atlanta, which inspired the predictive healthcare tutorial. Thank you both. Special thanks to Michael Drogalis for being a huge supporter of this book, even when it was just an outline of ideas. Also, thank you for putting me in touch with many of this book’s reviewers, and also Jay Kreps, who graciously wrote the foreword. The technical writings of Yeva Byzek and Bill Bejeck have also set a high bar for what this book should be. Thank you both for your contributions in this space. There have been many people in my career that helped get me to this point. Mark Conde and Tom Stanley, thank you for opening the doors to my career as a software engineer. Barry Bowden, for helping me become a better engineer, and for being a great mentor. Erin Fusaro, for knowing exactly what to say whenever I felt over‐ whelmed, and for just being a rock in general. Justin Isasi, for your continuous encouragement, and making sure my efforts don’t go unrecognized. Sean Sawyer, for a suggestion you made several years ago, that I try a new thing called “Kafka Streams,” which has clearly spiraled out of control. Thomas Holmes and Matt Farmer, for shar‐ ing your technical expertise with me on many occasions, and helping me become a better engineer. And to the Data Services team at Mailchimp, thanks for helping me solve some really cool problems, and for inspiring me with your own work. Finally, to my friends and family, who continue to stick by me even when I disappear for months at a time to work on a new project. Thanks for sticking around, this was a long one. Preface | xxi PART I Kafka CHAPTER 1 A Rapid Introduction to Kafka The amount of data in the world is growing exponentially and, according to the World Economic Forum, the number of bytes being stored in the world already far exceeds the number of stars in the observable universe. When you think of this data, you might think of piles of bytes sitting in data ware‐ houses, in relational databases, or on distributed filesystems. Systems like these have trained us to think of data in its resting state. In other words, data is sitting some‐ where, resting, and when you need to process it, you run some query or job against the pile of bytes. This view of the world is the more traditional way of thinking about data. However, while data can certainly pile up in places, more often than not, it’s moving. You see, many systems generate continuous streams of data, including IoT sensors, medical sensors, financial systems, user and customer analytics software, application and server logs, and more. Even data that eventually finds a nice place to rest likely travels across the network at some point before it finds its forever home. If we want to process data in real time, while it moves, we can’t simply wait for it to pile up somewhere and then run a query or job at some interval of our choosing. That approach can handle some business use cases, but many important use cases require us to process, enrich, transform, and respond to data incrementally as it becomes available. Therefore, we need something that has a very different worldview of data: a technology that gives us access to data in its flowing state, and which allows us to work with these continuous and unbounded data streams quickly and effi‐ ciently. This is where Apache Kafka comes in. Apache Kafka (or simply, Kafka) is a streaming platform for ingesting, storing, accessing, and processing streams of data. While the entire platform is very interest‐ ing, this book focuses on what I find to be the most compelling part of Kafka: 1 the stream processing layer. However, to understand Kafka Streams and ksqlDB (both of which operate at this layer, and the latter of which also operates at the stream ingestion layer), it is necessary to have a working knowledge of how Kafka, as a platform, works. Therefore, this chapter will introduce you to some important concepts and terminol‐ ogy that you will need for the rest of the book. If you already have a working knowl‐ edge of Kafka, feel free to skip this chapter. Otherwise, keep reading. Some of the questions we will answer in this chapter include: How does Kafka simplify communication between systems? What are the main components in Kafka’s architecture? Which storage abstraction most closely models streams? How does Kafka store data in a fault-tolerant and durable manner? How is high availability and fault tolerance achieved at the data processing layers? We will conclude this chapter with a tutorial showing how to install and run Kafka. But first, let’s start by looking at Kafka’s communication model. Communication Model Perhaps the most common communication pattern between systems is the synchro‐ nous, client-server model. When we talk about systems in this context, we mean applications, microservices, databases, and anything else that reads and writes data over a network. The client-server model is simple at first, and involves direct commu‐ nication between systems, as shown in Figure 1-1. Figure 1-1. Point-to-point communication is simple to maintain and reason about when you have a small number of systems For example, you may have an application that synchronously queries a database for some data, or a collection of microservices that talk to each other directly. 2 | Chapter 1: A Rapid Introduction to Kafka However, when more systems need to communicate, point-to-point communication becomes difficult to scale. The result is a complex web of communication pathways that can be difficult to reason about and maintain. Figure 1-2 shows just how confus‐ ing it can get, even with a relatively small number of systems. Figure 1-2. The result of adding more systems is a complex web of communication chan‐ nels, which is difficult to maintain Some of the drawbacks of the client-server model include: Systems become tightly coupled because their communication depends on knowl‐ edge of each other. This makes maintaining and updating these systems more difficult than it needs to be. Synchronous communication leaves little room for error since there are no deliv‐ ery guarantees if one of the systems goes offline. Systems may use different communication protocols, scaling strategies to deal with increased load, failure-handling strategies, etc. As a result, you may end up with multiple species of systems to maintain (software speciation), which hurts maintainability and defies the common wisdom that we should treat applications like cattle instead of pets. Receiving systems can easily be overwhelmed, since they don’t control the pace at which new requests or data comes in. Without a request buffer, they operate at the whims of the applications that are making requests. There isn’t a strong notion for what is being communicated between these sys‐ tems. The nomenclature of the client-server model has put too much emphasis on requests and responses, and not enough emphasis on the data itself. Data should be the focal point of data-driven systems. Communication is not replayable. This makes it difficult to reconstruct the state of a system. Communication Model | 3 Kafka simplifies communication between systems by acting as a centralized commu‐ nication hub (often likened to a central nervous system), in which systems can send and receive data without knowledge of each other. The communication pattern it implements is called the publish-subscribe pattern (or simply, pub/sub), and the result is a drastically simpler communication model, as shown in Figure 1-3. Figure 1-3. Kafka removes the complexity of point-to-point communication by acting as a communication hub between systems If we add more detail to the preceding diagram, we can begin to identify the main components involved in Kafka’s communication model, as shown in Figure 1-4. Figure 1-4. The Kafka communication model, redrawn with more detail to show the main components of the Kafka platform Instead of having multiple systems communicate directly with each other, pro‐ ducers simply publish their data to one or more topics, without caring who comes along to read the data. Topics are named streams (or channels) of related data that are stored in a Kafka cluster. They serve a similar purpose as tables in a database (i.e., to group related 4 | Chapter 1: A Rapid Introduction to Kafka data). However, they do not impose a particular schema, but rather store the raw bytes of data, which makes them very flexible to work with.1 Consumers are processes that read (or subscribe) to data in one or more topics. They do not communicate directly with the producers, but rather listen to data on any stream they happen to be interested in. Consumers can work together as a group (called a consumer group) in order to distribute work across multiple processes. Kafka’s communication model, which puts more emphasis on flowing streams of data that can easily be read from and written to by multiple processes, comes with several advantages, including: Systems become decoupled and easier to maintain because they can produce and consume data without knowledge of other systems. Asynchronous communication comes with stronger delivery guarantees. If a con‐ sumer goes down, it will simply pick up from where it left off when it comes back online again (or, when running with multiple consumers in a consumer group, the work will be redistributed to one of the other members). Systems can standardize on the communication protocol (a high-performance binary TCP protocol is used when talking to Kafka clusters), as well as scaling strategies and fault-tolerance mechanisms (which are driven by consumer groups). This allows us to write software that is broadly consistent, and which fits in our head. Consumers can process data at a rate they can handle. Unprocessed data is stored in Kafka, in a durable and fault-tolerant manner, until the consumer is ready to process it. In other words, if the stream your consumer is reading from suddenly turns into a firehose, the Kafka cluster will act as a buffer, preventing your con‐ sumers from being overwhelmed. A stronger notion of what data is being communicated, in the form of events. An event is a piece of data with a certain structure, which we will discuss in “Events” on page 11. The main point, for now, is that we can focus on the data flowing through our streams, instead of spending so much time disentangling the com‐ munication layer like we would in the client-server model. Systems can rebuild their state anytime by replaying the events in a topic. 1 We talk about the raw byte arrays that are stored in topics, as well as the process of deserializing the bytes into higher-level structures like JSON objects/Avro records, in Chapter 3. Communication Model | 5 One important difference between the pub/sub model and the client-server model is that communication is not bidirectional in Kafka’s pub/sub model. In other words, streams flow one way. If a system produces some data to a Kafka topic, and relies on another system to do something with the data (i.e., enrich or transform it), the enriched data will need to be written to another topic and subsequently consumed by the original process. This is simple to coordinate, but it changes the way we think about communication. As long as you remember the communication channels (topics) are stream-like in nature (i.e., flowing unidirectionally, and may have multiple sources and multiple downstream consumers), it’s easy to design systems that simply listen to whatever stream of flowing bytes they are interested in, and produce data to topics (named streams) whenever they want to share data with one or more systems. We will be working a lot with Kafka topics in the following chapters (each Kafka Streams and ksqlDB application we build will read, and usually write to, one or more Kafka top‐ ics), so by the time you reach the end of this book, this will be second nature for you. Now that we’ve seen how Kafka’s communication model simplifies the way systems communicate with each other, and that named streams called topics act as the com‐ munication medium between systems, let’s gain a deeper understanding of how streams come into play in Kafka’s storage layer. How Are Streams Stored? When a team of LinkedIn engineers2 saw the potential in a stream-driven data plat‐ form, they had to answer an important question: how should unbounded and contin‐ uous data streams be modeled at the storage layer? Ultimately, the storage abstraction they identified was already present in many types of data systems, including traditional databases, key-value stores, version control sys‐ tems, and more. The abstraction is the simple, yet powerful commit log (or simply, log). When we talk about logs in this book, we’re not referring to applica‐ tion logs, which emit information about a running process (e.g., HTTP server logs). Instead, we are referring to a specific data structure that is described in the following paragraphs. Logs are append-only data structures that capture an ordered sequence of events. Let’s examine the italicized attributes in more detail, and build some intuition around logs, by creating a simple log from the command line. For example, let’s create a log 2 Jay Kreps, Neha Narkhede, and Jun Rao initially led the development of Kafka. 6 | Chapter 1: A Rapid Introduction to Kafka called user_purchases, and populate it with some dummy data using the following command: # create the logfile touch users.log # generate four dummy records in our log echo "timestamp=1597373669,user_id=1,purchases=1" >> users.log echo "timestamp=1597373669,user_id=2,purchases=1" >> users.log echo "timestamp=1597373669,user_id=3,purchases=1" >> users.log echo "timestamp=1597373669,user_id=4,purchases=1" >> users.log Now if we look at the log we created, it contains four users that have made a single purchase: # print the contents of the log cat users.log # output timestamp=1597373669,user_id=1,purchases=1 timestamp=1597373669,user_id=2,purchases=1 timestamp=1597373669,user_id=3,purchases=1 timestamp=1597373669,user_id=4,purchases=1 The first attribute of logs is that they are written to in an append-only manner. This means that if user_id=1 comes along and makes a second purchase, we do not update the first record, since each record is immutable in a log. Instead, we just append the new record to the end: # append a new record to the log echo "timestamp=1597374265,user_id=1,purchases=2" >> users.log # print the contents of the log cat users.log # output timestamp=1597373669,user_id=1,purchases=1 timestamp=1597373669,user_id=2,purchases=1 timestamp=1597373669,user_id=3,purchases=1 timestamp=1597373669,user_id=4,purchases=1 timestamp=1597374265,user_id=1,purchases=2 Once a record is written to the log, it is considered immutable. Therefore, if we need to perform an update (e.g., to change the purchase count for a user), then the original record is left untouched. In order to model the update, we simply append the new value to the end of the log. The log will contain both the old record and the new record, both of which are immutable. How Are Streams Stored? | 7 Any system that wants to examine the purchase counts for each user can simply read each record in the log, in order, and the last record they will see for user_id=1 will contain the updated purchase amount. This brings us to the second attribute of logs: they are ordered. The preceding log happens to be in timestamp order (see the first column), but that’s not what we mean by ordered. In fact, Kafka does store a timestamp for each record in the log, but the records do not have to be in timestamp order. When we say a log is ordered, what we mean is that a record’s position in the log is fixed, and never changes. If we reprint the log again, this time with line numbers, you can see the posi‐ tion in the first column: # print the contents of the log, with line numbers cat -n users.log # output 1 timestamp=1597373669,user_id=1,purchases=1 2 timestamp=1597373669,user_id=2,purchases=1 3 timestamp=1597373669,user_id=3,purchases=1 4 timestamp=1597373669,user_id=4,purchases=1 5 timestamp=1597374265,user_id=1,purchases=2 Now, imagine a scenario where ordering couldn’t be guaranteed. Multiple processes could read the user_id=1 updates in a different order, creating disagreement about the actual purchase count for this user. By ensuring the logs are ordered, the data can be processed deterministically3 by multiple processes.4 Furthermore, while the position of each log entry in the preceding example uses line numbers, Kafka refers to the position of each entry in its distributed log as an offset. Offsets start at 0 and they enable an important behavior: they allow multiple con‐ sumer groups to each read from the same log, and maintain their own positions in the log/stream they are reading from. This is shown in Figure 1-5. Now that we’ve gained some intuition around Kafka’s log-based storage layer by creat‐ ing our own log from the command line, let’s tie these ideas back to the higher-level constructs we identified in Kafka’s communication model. We’ll start by continuing our discussion of topics, and learning about something called partitions. 3 Deterministic means the same inputs will produce the same outputs. 4 This is why traditional databases use logs for replication. Logs are used to capture each write operation on the leader database, and process the same writes, in order, on a replica database in order to deterministically re- create the same dataset on another machine. 8 | Chapter 1: A Rapid Introduction to Kafka Figure 1-5. Multiple consumer groups can read from the same log, each maintaining their position based on the offset they have read/processed Topics and Partitions In our discussion of Kafka’s communication model, we learned that Kafka has the concept of named streams called topics. Furthermore, Kafka topics are extremely flexible with what you store in them. For example, you can have homogeneous topics that contain only one type of data, or heterogeneous topics that contain multiple types of data.5 A depiction of these different strategies is shown in Figure 1-6. Figure 1-6. Different strategies exist for storing events in topics; homogeneous topics gen‐ erally contain one event type (e.g., clicks) while heterogeneous topics contain multiple event types (e.g., clicks and page_views) We have also learned that append-only commit logs are used to model streams in Kafka’s storage layer. So, does this mean that each topic correlates with a log file? Not exactly. You see, Kafka is a distributed log, and it’s hard to distribute just one of some‐ thing. So if we want to achieve some level of parallelism with the way we distribute and process logs, we need to create lots of them. This is why Kafka topics are broken into smaller units called partitions. 5 Martin Kleppmann has an interesting article on this topic, which can be found at https://oreil.ly/tDZMm. He talks about the various trade-offs and the reasons why one might choose one strategy over another. Also, Rob‐ ert Yokota’s follow-up article goes into more depth about how to support multiple event types when using Confluent Schema Registry for schema management/evolution. Topics and Partitions | 9 Partitions are individual logs (i.e., the data structures we discussed in the previous section) where data is produced and consumed from. Since the commit log abstrac‐ tion is implemented at the partition level, this is the level at which ordering is guaran‐ teed, with each partition having its own set of offsets. Global ordering is not supported at the topic level, which is why producers often route related records to the same partition.6 Ideally, data will be distributed relatively evenly across all partitions in a topic. But you could also end up with partitions of different sizes. Figure 1-7 shows an example of a topic with three different partitions. Figure 1-7. A Kafka topic configured with three partitions The number of partitions for a given topic is configurable, and having more parti‐ tions in a topic generally translates to more parallelism and throughput, though there are some trade-offs of having too many partitions.7 We’ll talk about this more throughout the book, but the important takeaway is that only one consumer per con‐ sumer group can consume from a partition (individual members across different consumer groups can consume from the same partition, however, as shown in Figure 1-5). Therefore, if you want to spread the processing load across N consumers in a single consumer group, you need N partitions. If you have fewer members in a consumer group than there are partitions on the source topic (i.e., the topic that is being read from), that’s OK: each consumer can process multiple partitions. If you have more members in a consumer group than there are partitions on the source topic, then some consumers will be idle. 6 The partitioning strategy is configurable, but a popular strategy, including the one that is implemented in Kafka Streams and ksqlDB, involves setting the partition based on the record key (which can be extracted from the payload of the record or set explicitly). We’ll discuss this in more detail over the next few chapters. 7 The trade-offs include longer recovery periods after certain failure scenarios, increased resource utilization (file descriptors, memory), and increased end-to-end latency. 10 | Chapter 1: A Rapid Introduction to Kafka With this in mind, we can improve our definition of what a topic is. A topic is a named stream, composed of multiple partitions. And each partition is modeled as a commit log that stores data in a totally ordered and append-only sequence. So what exactly is stored in a topic partition? We’ll explore this in the next section. Events In this book, we spend a lot of time talking about processing data in topics. However, we still haven’t developed a full understanding of what kind of data is stored in a Kafka topic (and, more specifically, in a topic’s partitions). A lot of the existing literature on Kafka, including the official documentation, uses a variety of terms to describe the data in a topic, including messages, records, and events. These terms are often used interchangeably, but the one we have favored in this book (though we still use the other terms occasionally) is event. An event is a timestamped key-value pair that records something that happened. The basic anatomy of each event captured in a topic partition is shown in Figure 1-8. Figure 1-8. Anatomy of an event, which is what is stored in topic partitions Application-level headers contain optional metadata about an event. We don’t work with these very often in this book. Keys are also optional, but play an important role in how data is distributed across partitions. We will see this over the next few chapters, but generally speak‐ ing, they are used to identify related records. Each event is associated with a timestamp. We’ll learn more about timestamps in Chapter 5. The value contains the actual message contents, encoded as a byte array. It’s up to clients to deserialize the raw bytes into a more meaningful structure (e.g., a JSON Events | 11 object or Avro record). We will talk about byte array deserialization in detail in “Serialization/Deserialization” on page 69. Now that we have a good understanding of what data is stored in a topic, let’s get a deeper look at Kafka’s clustered deployment model. This will provide more informa‐ tion about how data is physically stored in Kafka. Kafka Cluster and Brokers Having a centralized communication point means reliability and fault tolerance are extremely important. It also means that the communication backbone needs to be scalable, i.e., able to handle increased amounts of load. This is why Kafka operates as a cluster, and multiple machines, called brokers, are involved in the storage and retrieval of data. Kafka clusters can be quite large, and can even span multiple data centers and geo‐ graphic regions. However, in this book, we will usually work with a single-node Kafka cluster since that is all we need to start working with Kafka Streams and ksqlDB. In production, you’ll likely want at least three brokers, and you will want to set the repli‐ cation of your Kafka topic so that your data is replicated across multiple brokers (we’ll see this later in this chapter’s tutorial). This allows us to achieve high availability and to avoid data loss in case one machine goes down. Now, when we talk about data being stored and replicated across brokers, we’re really talking about individual partitions in a topic. For example, a topic may have three partitions that are spread across three brokers, as shown in Figure 1-9. Figure 1-9. Partitions are spread across the available brokers, meaning that a topic can span multiple machines in the Kafka cluster 12 | Chapter 1: A Rapid Introduction to Kafka As you can see, this allows topics to be quite large, since they can grow beyond the capacity of a single machine. To achieve fault tolerance and high availability, you can set a replication factor when configuring the topic. For example, a replication factor of 2 will allow the partition to be stored on two different brokers. This is shown in Figure 1-10. Figure 1-10. Increasing the replication factor to 2 will cause the partitions to be stored on two different brokers Whenever a partition is replicated across multiple brokers, one broker will be desig‐ nated as the leader, which means it will process all read/write requests from produc‐ ers/consumers for the given partition. The other brokers that contain the replicated partitions are called followers, and they simply copy the data from the leader. If the leader fails, then one of the followers will be promoted as the new leader. Furthermore, as the load on your cluster increases over time, you can expand your cluster by adding even more brokers, and triggering a partition reassignment. This will allow you to migrate data from the old machines to a fresh, new machine. Finally, brokers also play an important role with maintaining the membership of con‐ sumer groups. We’ll explore this in the next section. Consumer Groups Kafka is optimized for high throughput and low latency. To take advantage of this on the consumer side, we need to be able to parallelize work across multiple processes. This is accomplished with consumer groups. Consumer Groups | 13 Consumer groups are made up of multiple cooperating consumers, and the member‐ ship of these groups can change over time. For example, new consumers can come online to scale the processing load, and consumers can also go offline either for plan‐ ned maintenance or due to unexpected failure. Therefore, Kafka needs some way of maintaining the membership of each group, and redistributing work when necessary. To facilitate this, every consumer group is assigned to a special broker called the group coordinator, which is responsible for receiving heartbeats from the consumers, and triggering a rebalance of work whenever a consumer is marked as dead. A depic‐ tion of consumers heartbeating back to a group coordinator is shown in Figure 1-11. Figure 1-11. Three consumers in a group, heartbeating back to group coordinator Every active member of the consumer group is eligible to receive a partition assign‐ ment. For example, the work distribution across three healthy consumers may look like the diagram in Figure 1-12. Figure 1-12. Three healthy consumers splitting the read/processing workload of a three- partition Kafka topic 14 | Chapter 1: A Rapid Introduction to Kafka However, if a consumer instance becomes unhealthy and cannot heartbeat back to the cluster, then work will automatically be reassigned to the healthy consumers. For example, in Figure 1-13, the middle consumer has been assigned the partition that was previously being handled by the unhealthy consumer. Figure 1-13. Work is redistributed when consumer processes fail As you can see, consumer groups are extremely important in achieving high availabil‐ ity and fault tolerance at the data processing layer. With this, let’s now commence our tutorial by learning how to install Kafka. Installing Kafka There are detailed instructions for installing Kafka manually in the official documen‐ tation. However, to keep things as simple as possible, most of the tutorials in this book utilize Docker, which allows us to deploy Kafka and our stream processing applications inside a containerized environment. Therefore, we will be installing Kafka using Docker Compose, and we’ll be using Docker images that are published by Confluent.8 The first step is to download and install Docker from the Docker install page. Next, save the following configuration to a file called docker-compose.yml: --- version: '2' services: zookeeper: image: confluentinc/cp-zookeeper:6.0.0 hostname: zookeeper container_name: zookeeper ports: 8 There are many Docker images to choose from for running Kafka. However, the Confluent images are a con‐ venient choice since Confluent also provides Docker images for some of the other technologies we will use in this book, including ksqlDB and Confluent Schema Registry. Installing Kafka | 15 - "2181:2181" environment: ZOOKEEPER_CLIENT_PORT: 2181 ZOOKEEPER_TICK_TIME: 2000 kafka: image: confluentinc/cp-enterprise-kafka:6.0.0 hostname: kafka container_name: kafka depends_on: - zookeeper ports: - "29092:29092" environment: KAFKA_BROKER_ID: 1 KAFKA_ZOOKEEPER_CONNECT: 'zookeeper:2181' KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: | PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT KAFKA_ADVERTISED_LISTENERS: | PLAINTEXT://kafka:9092,PLAINTEXT_HOST://localhost:29092 KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1 KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1 The first container, which we’ve named zookeeper, will contain the ZooKeeper installation. We haven’t talked about ZooKeeper in this introduction since, at the time of this writing, it is being actively removed from Kafka. However, it is a cen‐ tralized service for storing metadata such as topic configuration. Soon, it will no longer be included in Kafka, but we are including it here since this book was published before ZooKeeper was fully removed. The second container, called kafka, will contain the Kafka installation. This is where our broker (which comprises our single-node cluster) will run and where we will execute some of Kafka’s console scripts for interacting with the cluster. Finally, run the following command to start a local Kafka cluster: docker-compose up With our Kafka cluster running, we are now ready to proceed with our tutorial. Hello, Kafka In this simple tutorial, we will demonstrate how to create a Kafka topic, write data to a topic using a producer, and finally, read data from a topic using a consumer. The first thing we need to do is log in to the container that has Kafka installed. We can do this by running the following command: docker-compose exec kafka bash 16 | Chapter 1: A Rapid Introduction to Kafka Now, let’s create a topic, called users. We’ll use one of the console scripts (kafka- topics) that is included with Kafka. The following command shows how to do this: kafka-topics \ --bootstrap-server localhost:9092 \ --create \ --topic users \ --partitions 4 \ --replication-factor 1 # output Created topic users. kafka-topics is a console script that is included with Kafka. A bootstrap server is the host/IP pair for one or more brokers. There are many flags for interacting with Kafka topics, including --list, --describe, and --delete. Here, we use the --create flag since we are creating a new topic. The topic name is users. Split our topic into four partitions. Since we’re running a single-node cluster, we will set the replication factor to 1. In production, you will want to set this to a higher value (such as 3) to ensure high-availability. The console scripts we use in this section are included in the Kafka source distribution. In a vanilla Kafka installation, these scripts include the.sh file extension (e.g., kafka-topics.sh, kafka-console- producer.sh, etc.). However, the file extension is dropped in Conflu‐ ent Platform (which is why we ran kafka-topics instead of kafka- topics.sh in the previous code snippet). Once the topic has been created, you can print a description of the topic, including its configuration, using the following command: kafka-topics \ --bootstrap-server localhost:9092 \ --describe \ --topic users # output Topic: users PartitionCount: 4 ReplicationFactor: 1 Configs: Topic: users Partition: 0 Leader: 1 Replicas: 1 Isr: 1 Hello, Kafka | 17 Topic: users Partition: 1 Leader: 1 Replicas: 1 Isr: 1 Topic: users Partition: 2 Leader: 1 Replicas: 1 Isr: 1 Topic: users Partition: 3 Leader: 1 Replicas: 1 Isr: 1 The --describe flag allows us to view configuration information for a given topic. Now, let’s produce some data using the built-in kafka-console-producer script: kafka-console-producer \ --bootstrap-server localhost:9092 \ --property key.separator=, \ --property parse.key=true \ --topic users The kafka-console-producer script, which is included with Kafka, can be used to produce data to a topic. However, once we start working with Kafka Streams and ksqlDB, the producer processes will be embedded in the underlying Java library, so we won’t need to use this script outside of testing and development purposes. We will be producing a set of key-value pairs to our users topic. This property states that our key and values will be separated using the , character. The previous command will drop you in an interactive prompt. From here, we can input several key-value pairs to produce to the users topic. When you are finished, press Control-C on your keyboard to exit the prompt: >1,mitch >2,elyse >3,isabelle >4,sammy After producing the data to our topic, we can use the kafka-console-consumer script to read the data. The following command shows how: kafka-console-consumer \ --bootstrap-server localhost:9092 \ --topic users \ --from-beginning # output mitch elyse isabelle sammy 18 | Chapter 1: A Rapid Introduction to Kafka The kafka-console-consumer script is also included in the Kafka distribution. Similar to what we mentioned for the kafka-console-producer script, most of the tutorials in this book will leverage consumer processes that are built into Kafka Streams and ksqlDB, instead of using this standalone console script (which is useful for testing purposes). The --from-beginning flag indicates that we should start consuming from the beginning of the Kafka topic. By default, the kafka-console-consumer will only print the message value. But as we learned earlier, events actually contain more information, including a key, a time‐ stamp, and headers. Let’s pass in some additional properties to the console consumer so that we can see the timestamp and key values as well:9 kafka-console-consumer \ --bootstrap-server localhost:9092 \ --topic users \ --property print.timestamp=true \ --property print.key=true \ --property print.value=true \ --from-beginning # output CreateTime:1598226962606 1 mitch CreateTime:1598226964342 2 elyse CreateTime:1598226966732 3 isabelle CreateTime:1598226968731 4 sammy That’s it! You have now learned how to perform some very basic interactions with a Kafka cluster. The final step is to tear down our local cluster using the following command: docker-compose down Summary Kafka’s communication model makes it easy for multiple systems to communicate, and its fast, durable, and append-only storage layer makes it possible to work with fast-moving streams of data with ease. By using a clustered deployment, Kafka can achieve high availability and fault tolerance at the storage layer by replicating data across multiple machines, called brokers. Furthermore, the cluster’s ability to receive heartbeats from consumer processes, and update the membership of consumer groups, allows for high availability, fault tolerance, and workload scalability at the 9 As of version 2.7, you can also use the --property print.headers=true flag to print the message headers. Summary | 19 stream processing and consumption layer. All of these features have made Kafka one of the most popular stream processing platforms in existence. You now have enough background on Kafka to get started with Kafka Streams and ksqlDB. In the next section, we will begin our journey with Kafka Streams by seeing how it fits in the wider Kafka ecosystem, and by learning how we can use this library to work with data at the stream processing layer. 20 | Chapter 1: A Rapid Introduction to Kafka PART II Kafka Streams CHAPTER 2 Getting Started with Kafka Streams Kafka Streams is a lightweight, yet powerful Java library for enriching, transforming, and processing real-time streams of data. In this chapter, you will be introduced to Kafka Streams at a high level. Think of it as a first date, where you will learn a little about Kafka Streams’ background and get an initial glance at its features. By the end of this date, er…I mean chapter, you will understand the following: Where Kafka Streams fits in the Kafka ecosystem Why Kafka Streams was built in the first place What kinds of features and operational characteristics are present in this library Who Kafka Streams is appropriate for How Kafka Streams compares to other stream processing solutions How to create and run a basic Kafka Streams application So without further ado, let’s get our metaphorical date started with a simple question for Kafka Streams: where do you live (…in the Kafka ecosystem)? The Kafka Ecosystem Kafka Streams lives among a group of technologies that are collectively referred to as the Kafka ecosystem. In Chapter 1, we learned that at the heart of Apache Kafka is a distributed, append-only log that we can produce messages to and read messages from. Furthermore, the core Kafka code base includes some important APIs for inter‐ acting with this log (which is separated into categories of messages called topics). Three APIs in the Kafka ecosystem, which are summarized in Table 2-1, are con‐ cerned with the movement of data to and from Kafka. 23 Table 2-1. APIs for moving data to and from Kafka API Topic interaction Examples Producer API Writing messages to Kafka topics. Filebeat rsyslog Custom producers Consumer API Reading messages from Kafka topics. Logstash kafkacat Custom consumers Connect API Connecting external data stores, APIs, and JDBC source connector filesystems to Kafka topics. Elasticsearch sink connector Involves both reading from topics (sink Custom connectors connectors) and writing to topics (source connectors). However, while moving data through Kafka is certainly important for creating data pipelines, some busine