Lecture #2.2 - Spark Structured Streaming API.pdf

MODERN DATA ARCHITECTURES FOR BIG DATA II APACHE SPARK STRUCTURED STREAMING API AGENDA Spark Streaming Structured Streaming Key elements Structured Streaming - API TIME TO TURN OSBDET ON! We'll use the course environment by the end of the lesson: 1. SPARK STREAMING BUILDING UP ON THE FOUNDATIONS Spark Streaming is built on top of core APIs we've learned: SPARK STREAMING DESIGN OPTIONS Spark Streaming's design options are summarized as follows: Declarative API → what to do with events VS how to do it Event & processing time → full timing information on the events Micro-Batch & Continuous execution → until Spark 2.3 only micro-batch: Micro-batch = ↑ throughput & ↑ latency Continuous execution = ↓ throughput & ↓ latency SPARK STREAMING FLAVOURS Two flavours based on the two main APIs: DStreams API based on RDDs - first existing API, comes with limitations: Only micro-batching Only processing time, there is no notion of event time (ex. watermarks) More complex data abstractions (ex. Java/Python objects) Structured Streaming based on DataFrames - our option, full featured: Micro-batching and continuous processing Event and processing time to fully control events processing One single structured data abstraction (DataFrame/Dataset) for everything This course will only consider Structured Streaming. 2. STRUCTURED STREAMING STRUCTURED STREAMING APPLICATIONS They rely on Structured/High level APIs studied. Transformations studied and some more are supported. Steps to create and run a streaming application/job: 1. Write the code implementing the business logic 2. Connect that code to the: Source → where the events are coming from (ex. Kafka, file, socket,...) Sink/Destination → where the insights are sent to (ex. Kafka, file, console,...) 3. Start the application/job on the Spark cluster which: It'll evaluate the code continuously as new events arrive It'll produce insights incrementally as code evaluates events It'll run for some time or infinitely depending upon setup UNBOUNDED/INFINITE TABLES Stream of data → modeled as an unbounded/infinite table: New data arriving is continuously appended to the Input Table* * Our Spark applications take this Input Table and transform it into a Result Table. SPARK CONTINUOUS APPLICATIONS Unique in Unified Computing Engines as let you: Run streaming applications/jobs Run batch applications/jobs Combine/join streaming and batch data simultaneously Run interactive ad-hoc queries (ex. SQL via visualization tool) Scenarios like the following one are very common: 1. Continuously update a Dataframe with Spark Structured Streaming 2. Have users querying that table interactively with Spark SQL 3. Serve a Machine Learning model trained with Spark MLlib 4. Join Streaming Dataframes with Static Dataframes to enrich events SPARK CONTINUOUS APPLICATIONS Visually, Spark fits at the core of company analytics like this*: * A Beginner's Guide to Spark Streaming For Data Engineers - article based on DStreams (RDDs) but still a recommended one. 2.1 KEY ELEMENTS KEY ELEMENTS OF STRUCTURED STREAMING Streaming applications/jobs rely on the following elements: Transformations & actions → business logic to consider Input sources → where the events come from Output sinks & modes → where and how the results/insights go to Triggers → When to check for new available data Event-time processing → deal with delayed data TRANSFORMATIONS & ACTIONS Transformations & actions reused concepts that apply here. Same transformations as studied with some restrictions. Only one action, start, which iniciates the application/job. * Picture taken from the article "6 recommendations for optimizing a Spark job", a bit advanced to go over yet but keep it for later. INPUT SOURCES Input sources* feed Spark applications/jobs with events: File source → reads files written in a directory as a stream of data Kafka source → reads data from Kafka topics Socket source** → reads data from a network endpoint Rate source** → reads mock row (time stamp & incremental number) Rate Per Micro-Batch source** → reads mock rows in small batches We'll see Socket & Kafka sources in action during the course. * Part of the "Structured Streaming Programming Guide", the official guide to master Structured Streaming. ** These input sources are used for testing/development purposes, they shouldn't be used in any production scenario. OUTPUT SINKS & MODES Output sinks* handle outcomes of Spark applications/jobs: File sink → stored as files in one or different folders Kafka sink → stored as events in a Kafka topic Foreach sink → custom logic (function) applied to every row ForeachBatch sink → custom logic (function) applied to some rows Console sink** → displayed to the standard output in every trigger Memory sink** → stored as an in-memory table (ex. DataFrame) Available output modes are not valid for all the scenarios: Append - only new rows added to the Result Table sent to the sink Complete - whole Result Table sent to the sink after every trigger Update - only updated rows in the Result Table sent to the sink We'll see Kafka, Foreach & ForeachBatch sinks in action. * Part of the "Structured Streaming Programming Guide", the official guide to master Structured Streaming. ** These output sinks are used for testing/development purposes, they shouldn't be used in any production scenario. TRIGGERS Triggers*, check for new input data & update Result Table**: Micro-batch → micro-batch generated when previous one processed Fixed interval micro-batches → micro-batch generated on a timely basis Available-now micro-batch → logic runs once with existing data (testing) Continuous → low-latency (1ms) continuous (event by event) processing Micro-batch is the default mode, with ~100ms latency. * Part of the "Structured Streaming Programming Guide", the official guide to master Structured Streaming. ** Triggers sound like studied windowing techniques, but they're not; trigger = get new data + reevaluate the Spark logic. EVENT-TIME PROCESSING Structured Streaming supports event-time processing. Event-time is part of the processed data specified as: Timestamp - time explictly provided: Fri, 25 Aug 2023 14:12:33 GMT Epoch - seconds elapsed since January 1st, 1970 Events can arrive late for many reasons (ex. network latency). Watermark → how late an event is considered for processing. 3. STRUCTURED STREAMING - API EXPLORE THE API IN JUPYTER NOTEBOOK Jump to OSBDET and explore the Structured Streaming API: CONGRATS, WE'RE DONE!

Lecture #2.2 - Spark Structured Streaming API.pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue