Apache Kafka Overview PDF
Document Details
Uploaded by UnequivocalNephrite9216
Tags
Summary
This document provides an overview of Apache Kafka, a distributed streaming platform for real-time data processing. It details key features, architecture, and the role of ZooKeeper in managing Kafka clusters. The document also discusses Kafka's use in the big data ecosystem and its advantages.
Full Transcript
Apache Kafka What is Apache Kafka? Kafka is a distributed streaming platform designed for real-time data processing. It enables publish-subscribe messaging and is fault-tolerant, scalable, and high- throughput. Kafka key features : High Throughput and Scalability: Kafka supports proce...
Apache Kafka What is Apache Kafka? Kafka is a distributed streaming platform designed for real-time data processing. It enables publish-subscribe messaging and is fault-tolerant, scalable, and high- throughput. Kafka key features : High Throughput and Scalability: Kafka supports processing millions of messages per second with low latency. Horizontal scaling across clusters using partitions. Fault Tolerance: Message replication ensures data reliability even in hardware or node failures. Leaders and followers architecture ensures continuous operation. Kafka key features : Durability: Kafka writes messages to disk and retains them based on user-defined policies. Flexible Messaging Model: Kafka supports publish-subscribe and point-to-point messaging, suitable for various use cases. Stream Processing: Provides libraries like Kafka Streams for real-time data transformation and aggregation. Kafka key features : Decoupled Architecture: Producers and consumers operate independently, enabling seamless integration across distributed systems. Integration-Friendly: Works well with external systems using Kafka Connect, supporting plugins for databases, file systems, and more. How Kafka Fits into the Big Data Ecosystem Central Data Hub: Kafka is a backbone for managing and transferring real-time data between various components in a big data stack. Key Integrations: Apache Hadoop/Spark: Use Kafka to feed raw data streams for batch or stream processing. Machine Learning Pipelines: Stream data for real-time model updates and predictions. Why Kafka is Preferred Designed for Real-Time Streaming: Kafka excels in scenarios where high throughput, fault tolerance, and real-time data processing are essential. Integration with Big Data Ecosystem: Works seamlessly with systems like Apache Spark, Flink, and Hadoop for building advanced data pipelines. Adoption by Industry Leaders: Companies like LinkedIn, Netflix, Uber, and Spotify use Kafka for scalability, speed, and reliability in mission-critical systems. Who uses Kafka? LinkedIn: Activity data and operational metrics Twitter: Uses it as part of Storm – stream processing infrastructure Spotify, Uber, Tumbler, Goldman Sachs, PayPal, Box, Cisco, CloudFlare, DataDog, LucidWorks, MailChimp, NetFlix, etc. Kafka in LinkedIn Kafka in Fortune 500’s Over 35% of Fortune 500’s are using Kafka(2018) Kafka Architecture Brokers Kafka Broker is a server that stores and manages incoming and outgoing messages. Key Features: Message Storage: Brokers retain messages for a specified retention period. Request Handling: Handles read and write requests from producers and consumers. Scalability: Kafka clusters consist of multiple brokers to handle large workloads. Each broker has a unique ID and coordinates with others to form a distributed system. Kafka Architecture Producers Producers are client applications responsible for publishing messages to Kafka topics. Key Features: Partitioning: Producers send messages to specific partitions based on custom logic or a random assignment. Acknowledgment: Can request acknowledgments from brokers to ensure data delivery. Compression: Supports compressing messages to reduce storage and network usage. Kafka Architecture Consumers Consumers are applications that subscribe to topics and process messages. Key Features: Consumer Groups: Multiple consumers can join a group, and Kafka ensures each partition is assigned to only one consumer. Offset Management: Kafka tracks the offset of each message, ensuring consumers process data sequentially or resume from the last read point. Pull Model: Consumers request data from brokers. Kafka Architecture Partitions Topics in Kafka are divided into partitions for scalability and parallelism. Key Features: Parallel Processing: Multiple partitions enable distributed message handling. Ordering: Kafka guarantees order only within a single partition, not across partitions. Load Distribution: Producers distribute messages across partitions for even workload handling. ZooKeeper’s Role ZooKeeper is a distributed coordination service used by Kafka for: Cluster Management: Maintains metadata about brokers, topics, and partitions. Leader Election: Facilitates leader election for partitions and brokers. Configuration Management: Tracks and propagates configuration changes across the cluster. Fault tolerance helps manage broker failures by enabling failover mechanisms.