Lecture #1.3 - Streaming Data Storage with Apache Kafka.pdf
Document Details
Uploaded by PerfectPanda
IE University
Tags
Related
- Lab #3.1 - Apache Spark Stream Processing - Truck Fleet Lab.pdf
- Lab #5.1 - Apache Spark Stream Processing - Truck Fleet Lab II PDF
- Lecture #6.1 - Data Processing - Apache Spark Graph API.pdf
- Lecture #9.1 - Data Processing - Apache Spark ML API.pdf
- Introducción_a_Apache_Hadoop..pdf
- Apache Hadoop: HDFS, YARN, and MapReduce PDF
Full Transcript
MODERN DATA ARCHITECTURES FOR BIG DATA I STREAMING DATA STORAGE WITH KAFKA AGENDA What's Kafka? Core Concepts Messages and Schemas Topics and Partitions Producer and Consumers Brokers and Clusters 1. WHAT'S KAFKA? STREAMING DATA STORAGE Keep the Data Value Chain context in mind all the time: THE ORI...
MODERN DATA ARCHITECTURES FOR BIG DATA I STREAMING DATA STORAGE WITH KAFKA AGENDA What's Kafka? Core Concepts Messages and Schemas Topics and Partitions Producer and Consumers Brokers and Clusters 1. WHAT'S KAFKA? STREAMING DATA STORAGE Keep the Data Value Chain context in mind all the time: THE ORIGINS Event streaming solution with a different real-time approach: Modern distributed system → massively scalable. True storage system → data is: Replicated multiple times and in different places if needed Persisted to reuse it as many times and by different applications as needed Kept around (data retention) as much time as the business needs Raises level of abstraction → allows computation with less code Developed at LinkedIn back in 2010* because: The need of low-latency ingestion of large amounts of event data Data ingestion tools at the time not designed for real-time use cases Traditional real-time systems overkilling and unable to scale * Kafka's origin story at LinkedIn THE CORE OF REAL-TIME At the core of any Big Data architecture in every industry. Main ideas behind the design of this streaming platform are: Allow publishing and subscribing to streams of data Store streamed data as much time as needed to be properly processed BUILDING A REAL-TIME DATA PIPELINE Building a Real-time Data Pipeline: Apache Kafka at LinkedIn* * Talk delivered at Hadoop Summit 2013 North America. ANOTHER APACHE PROJECT Kafka is OSS supported by the Apache Software Foundation. CONFLUENT, ENTERPRISE KAFKA Confluent develops an enterprise version of Kafka. Provides a managed Kafka solution on main cloud providers. 2. CORE CONCEPTS KAFKA'S CORE CONCEPTS Consider Kafka a massive scalable message queue. Ranging from the more functional to the more operational: Messages & Schemas - actual streaming data and its format Producers & Consumers - add and use streaming data within pipeline Topics & Partitions - organize and make streaming data available Brokers & Clusters - Nodes' organization of the distributed system 2.1 MESSAGES & SCHEMAS MESSAGES A message or event is the unit of data in Kafka: The message doesn't have to have a format → unstructured data At implementation level, it's just an array of bytes Even though you send formatted text (ex. CSV), still an array of bytes SCHEMAS Metadata with rules describing structure of message or event. There a two options to specify schemas: Embeded within the message - typically text data: Data in CSV or JSON format implicitly embeds the schema Easier for humans to read and understand Less efficient for storage and transmision as it's bigger than binary data Decoupled from the message - typically binary data: Apache Avro is a compact serialization format used to specify schemas in Kafka It brings strong data typing as it explicitly specifies the rules messages must follow Versioning and schema evolution allow projects and needs easily evolve over time SCHEMAS * Schemas, Contracts and Compatibility SCHEMA REGISTRY Centralized repository for managing and validating schemas It actually enables message and schema decoupling We won't use it due to the lack of time, therefore we'll: Write/send text in CSV/JSON format to Kafka Read/receive text in CSV/JSON format from Kafka * Schema Registry Overview 2.2 TOPICS & PARTITIONS TOPICS Messages in Kafka are organized in topics: Messages written at the end of the topic (appended) Messages read from beginning to the end by default (can be changed) PARTITIONS Topics can be additionally organized in multiple partitions: Each message in a topic partition has a unique identifier called offset Messages within a topic partition are ordered ascending by time Topics partitions allows redundancy and scalability within Kafka: The same message or event can be published to different partitions (redundandy) Each partition can live in a completely different machine (scalability) 2.3 PRODUCERS & SCHEMAS PRODUCERS Write or publish new messages or events into Kafka. Similar solutions call them publishers or writers. Messages or events are produced into a specific Kafka topic: Producers don't care about partition within the topic to use Producers will evenly balance messages over all partitions Default behaviour can be modified to have full control CONSUMERS Read or consume messages or events from Kafka. Similar solutions call them subscribers or readers. Consumers subscribe to one or more topics and: Read messages in the order they were produced Keep track of messages read by remembering the offset of messages CONSUMERS GROUPS Consumer Group → many consumers reading a topic together. Each topic partition only read by one consumer in the group. Allow massive scalability and provides reliability by: Scaling horizontally to consume topics with a large number of messages Partition reading rebalance across the group when a consumer fails 2.4 BROKERS & CLUSTERS BROKERS Broker → a single Kafka machine responsible of: Receiving and organizing messages from producers into topics Assigning offsets to messages Committing messages to storage on disk (message durability) Serving messages to consumers upon their requests A single broker can cope with large scenarios as it can handle: Thousands of partitions per topic Millions of messages per second CLUSTERS Cluster → set of brokers working together. One broker within the cluster is the Cluster Controller and: Assigns partitions to brokers Monitors for broker failures Runs any other administrative operations With regards to partitions and clusters: A partition must be owned by a single broker (leader) A partition may be replicated on multiple brokers (redundancy) Consumers and producers must always connect to the leader 2-BROKER KAFKA CLUSTER This is what a 2-broker Kafka cluster looks like: CONGRATS, WE'RE DONE!