Structured Streaming (Spark Streams) PDF
Document Details
Uploaded by UnequivocalNephrite9216
Tags
Summary
This document provides an overview of Structured Streaming, a real-time data streaming engine built on Spark SQL. It highlights key concepts like unbounded data streams and append-only tables, and compares it to Spark Streaming. The document also discusses different operation modes and the performance advantages of Structured Streaming.
Full Transcript
STRUCTURED STREAMING (AKA SPARK STREAMS) Structured Streaming Structured Streaming is an advanced streaming engine built on the Spark SQL engine. It allows you to process real-time data streams using high-level abstractions like DataFrames and Datasets. Unlike Spark Streaming, which...
STRUCTURED STREAMING (AKA SPARK STREAMS) Structured Streaming Structured Streaming is an advanced streaming engine built on the Spark SQL engine. It allows you to process real-time data streams using high-level abstractions like DataFrames and Datasets. Unlike Spark Streaming, which uses micro-batches, Structured Streaming works on a continuous data model. Key Concepts Unbounded Data Streams Data streams are considered unbounded data sources, continuously flowing into the system. Example: Logs generated by web servers, financial transactions, or sensor data from IoT devices. Key Concepts Streaming as an Append-Only Table Streams are modeled as continuously growing tables, where each new row is appended as data arrives. Modes of Operations Complete Mode: Outputs the entire result table whenever it updates. Update Mode: Outputs only the updated rows in the result. Append Mode: Outputs only new rows appended to the result table Advantages of Structured Streaming Unified API: Uses the same APIs for batch and streaming processing (DataFrame/Dataset APIs). Example: You can write a query for a batch process and extend it to handle streaming data. Declarative Query Language: You write high-level SQL-like queries for stream transformations. Optimized Performance: Supports Catalyst Optimizer for better performance. Efficient use of resources compared to older systems like Spark Streaming (DStream). Differences Between Structured Streaming and Spark Streaming Feature Structured Streaming Spark Streaming (DStream) API DataFrame/Dataset DStream API (RDD- API based) Processing Model Continuous (row-by- Micro-batches row) Abstraction Infinite Table Batch RDDs Performance More optimized Less optimized