Apache Kafka Overview PDF

Apache Kafka What is Apache Kafka? Kafka is a distributed streaming platform designed for real-time data processing. It enables publish-subscribe messaging and is fault-tolerant, scalable, and high- throughput. Kafka key features : High Throughput and Scalability: Kafka supports processing millions of messages per second with low latency. Horizontal scaling across clusters using partitions. Fault Tolerance: Message replication ensures data reliability even in hardware or node failures. Leaders and followers architecture ensures continuous operation. Kafka key features : Durability: Kafka writes messages to disk and retains them based on user-defined policies. Flexible Messaging Model: Kafka supports publish-subscribe and point-to-point messaging, suitable for various use cases. Stream Processing: Provides libraries like Kafka Streams for real-time data transformation and aggregation. Kafka key features : Decoupled Architecture: Producers and consumers operate independently, enabling seamless integration across distributed systems. Integration-Friendly: Works well with external systems using Kafka Connect, supporting plugins for databases, file systems, and more. How Kafka Fits into the Big Data Ecosystem Central Data Hub: Kafka is a backbone for managing and transferring real-time data between various components in a big data stack. Key Integrations: Apache Hadoop/Spark: Use Kafka to feed raw data streams for batch or stream processing. Machine Learning Pipelines: Stream data for real-time model updates and predictions. Why Kafka is Preferred Designed for Real-Time Streaming: Kafka excels in scenarios where high throughput, fault tolerance, and real-time data processing are essential. Integration with Big Data Ecosystem: Works seamlessly with systems like Apache Spark, Flink, and Hadoop for building advanced data pipelines. Adoption by Industry Leaders: Companies like LinkedIn, Netflix, Uber, and Spotify use Kafka for scalability, speed, and reliability in mission-critical systems. Who uses Kafka?  LinkedIn: Activity data and operational metrics  Twitter: Uses it as part of Storm – stream processing infrastructure  Spotify, Uber, Tumbler, Goldman Sachs, PayPal, Box, Cisco, CloudFlare, DataDog, LucidWorks, MailChimp, NetFlix, etc. Kafka in LinkedIn Kafka in Fortune 500’s Over 35% of Fortune 500’s are using Kafka(2018) Kafka Architecture Brokers Kafka Broker is a server that stores and manages incoming and outgoing messages. Key Features: Message Storage: Brokers retain messages for a specified retention period. Request Handling: Handles read and write requests from producers and consumers. Scalability: Kafka clusters consist of multiple brokers to handle large workloads. Each broker has a unique ID and coordinates with others to form a distributed system. Kafka Architecture Producers Producers are client applications responsible for publishing messages to Kafka topics. Key Features: Partitioning: Producers send messages to specific partitions based on custom logic or a random assignment. Acknowledgment: Can request acknowledgments from brokers to ensure data delivery. Compression: Supports compressing messages to reduce storage and network usage. Kafka Architecture Consumers Consumers are applications that subscribe to topics and process messages. Key Features: Consumer Groups: Multiple consumers can join a group, and Kafka ensures each partition is assigned to only one consumer. Offset Management: Kafka tracks the offset of each message, ensuring consumers process data sequentially or resume from the last read point. Pull Model: Consumers request data from brokers. Kafka Architecture Partitions Topics in Kafka are divided into partitions for scalability and parallelism. Key Features: Parallel Processing: Multiple partitions enable distributed message handling. Ordering: Kafka guarantees order only within a single partition, not across partitions. Load Distribution: Producers distribute messages across partitions for even workload handling. ZooKeeper’s Role ZooKeeper is a distributed coordination service used by Kafka for: Cluster Management: Maintains metadata about brokers, topics, and partitions. Leader Election: Facilitates leader election for partitions and brokers. Configuration Management: Tracks and propagates configuration changes across the cluster. Fault tolerance helps manage broker failures by enabling failover mechanisms.

Apache Kafka Overview PDF

Document Details

Tags

Related

Summary

Full Transcript