Recent Lessons

Show all results for ""

Introduction to Real-Time Systems

Introduction to Real-Time Systems

Choose a study mode

Play Quiz

Study Flashcards

Spaced Repetition

Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

Which of the following scenarios would LEAST benefit from a real-time data processing system?

A financial institution monitoring transactions for fraudulent activity.
A logistics company tracking its fleet of delivery vehicles to optimize routes and delivery times.
A manufacturing plant using sensor data to adjust machinery settings and prevent equipment failures.
A retail company analyzing annual sales data to determine overall trends. (correct)

In the context of Microsoft Fabric, which component facilitates the ingestion of data from various sources into a real-time data processing pipeline?

KQL Database
Spark Structured Streaming
Event Stream (correct)
Data Pipeline

When configuring a 'Group By' transformation in Event Streams, what critical element related to time must be defined to handle continuous data inflow effectively?

Data schema
Time period (window) (correct)
Compression algorithm
Sampling rate

Which type of window is characterized by a fixed length and overlaps, providing more data points and quicker decision-making capabilities?

<p>Hopping Window (C)</p> Signup and view all the answers

In Spark Structured Streaming, how does it treat incoming data streams so that it can be queried using Spark SQL?

<p>As an unbounded table, appending new data as rows. (D)</p> Signup and view all the answers

When choosing between using Event Streams and Spark Structured Streaming in Microsoft Fabric, what is a key advantage of Event Streams?

<p>No code requirement (A)</p> Signup and view all the answers

In KQL, what is the primary purpose of the `where` operator?

<p>To filter data based on a condition. (B)</p> Signup and view all the answers

Which KQL operator is used to perform aggregated calculations for groups of data?

<p>summarize (D)</p> Signup and view all the answers

You need to set up a process that automatically updates a summary table whenever new data is ingested into a raw data table in a KQL database. Which feature should you use?

<p>KQL Table Update Policies (A)</p> Signup and view all the answers

In a streaming architecture within Fabric, leveraging an Event House, how is data typically visualized in real-time?

<p>Using KQL query sets with real-time dashboards or Power BI (A)</p> Signup and view all the answers

Flashcards

Batch Processing

A data processing approach where data is processed in batches at scheduled intervals, suitable for historical analysis.

Real-Time Systems

Data systems that ingest and process data continuously, requiring 24/7 operation and specialized analysis techniques.

Event Streams

Microsoft Fabric tool for ingesting data from various sources, transforming data, and routing streams to specific endpoints.

Azure Event Hub and IoT Hub

Azure services that integrate with Event Streams for data streaming.

Signup and view all the flashcards

Aggregation Operation

A transformation that performs a calculation on data over a specified time window with every new event.

Signup and view all the flashcards

Union Operation

A transformation that combines two streams with the same schema vertically.

Signup and view all the flashcards

Edit Mode

Event stream mode used for configuring and investigating the event stream.

Signup and view all the flashcards

Tumbling Windows

Windows with a fixed length and non-overlapping intervals, often used for regular, periodic aggregations.

Signup and view all the flashcards

Hopping Windows

Windows with a fixed length and overlapping intervals, providing finer-grained analysis.

Signup and view all the flashcards

Spark Structured Streaming

A pro-code approach allowing unit testing and more complex stream transformations, good if you are already using Spark for data engineering

Signup and view all the flashcards

Study Notes

Introduction to Real-Time Systems

Batch processing and real-time systems are the two big paradigms in data analytics.
Batch processing involves getting batches of data from source systems, transforming them, and visualizing them, usually on a schedule.
Examples of batch processing tools include data pipelines and data flows.
Batch processing is suitable for historical analysis, with some lag between data collection and ingestion.
For most businesses and use cases, batch processing is sufficient.
Real-time systems are worth the cost for certain industries and use cases.
Real-time systems require a different data infrastructure.
Real-time systems necessitate ingesting data 24/7, different analysis principles, and a different mindset.

Industries and Use Cases for Real-Time Systems

Cyber-physical systems: Logistics, Transportation, IoT, and agriculture are increasingly using real-time systems due to sensors providing real-time data.
Digital applications: Financial applications, fraud detection, and cyber security using digital applications log data into real-time systems.
The kql community and Azure Data Explorer are significant in cyber security and real-time logging applications.
Big Tech: Companies like Netflix, Uber, and Airbnb use real-time systems for website optimization, such as clickthrough rates.

Real-Time Systems in Microsoft Fabric

Microsoft Fabric includes batch processing tools like data pipelines and data flows.
Spark in Fabric allows for structured streaming.
Fabric-specific tools:
- Event stream for ingesting data from various sources.
- Event stream allows to perform transformations and write to different destinations.
- Event house which holds kql databases.
Analysis tools:
- Custo Query Language to analyze data in kql databases.
- Real-time dashboards for data visualization.
Alerting, monitoring, and optimizations are crucial but not the focus.
Spark structured streaming, event streams, and kql databases will be explored in detail.

Event Streams Overview

Event streams allow data ingestion from various sources.
Event streams allow a variety of transformations.
Event streams allow routing streams to specific endpoints.

Event Streams - Sources

Azure Realtime Platform as a Service Solutions: Azure Event Hub and Azure IoT Hub integrate for data streaming into event streams.
Cloud databases with change data capture enabled: Stream data from Azure SQL databases and Azure Cosmos DB.
Third-party streaming platforms: CFA, Google Pub/Sub, and Amazon Kinesis can stream data into event streams.
Fabric and Azure event triggers: Trigger an event stream from updates to a onel folder or workspace events.
Custom endpoints: Allow data ingestion directly from a device into an event stream without Azure Event Hubs.
Sample data is also available for testing.

Event Streams - Operations/Transformations

Operations are optional on event streams.
Aggregation: Performs an aggregate on the data over a specified time window with every new event.
Group by: Similar to aggregate, but with defined window duration and window length.
Expand: Expands a list of data or an array of values into different data points.
Filter: Filters streams of data based on a specified rule.
Join: Joins two streams together on a matching key.
Union: Combines two streams with the same schema vertically.
Manage Fields: Reduces columns and changes data types, useful for converting data types for kql databases.

Event Streams - Destinations

Custom endpoint: Direct the stream back to a custom endpoint (owned by the user).
Fabric data stores: Stream into the lake house or event house, or both.
Stream: Join two streams together. The output of one stream is the input of another.
Activator: Set up alerting on the stream based on certain conditions.

Event Streams - Practical Example in UI

Event streams operate in two modes: edit mode and live mode.
Edit mode is for configuring and investigating the event stream.
Live mode is for ingesting data with ingestion on or off.
The UI is structured as a directed aylet graph.
The source is on the left and the event stream flows to the right.
Sample Data Source: Example uses the "bicycles" sample data set.
The bicycle data set includes bike Point ID, street, neighborhood, latitude, longitude, number of bikes, and number of empty docks.
The root node displays the name of the event stream.
Direct Ingestion: Writes the raw stream into a destination (e.g., event house).
Manage Fields Node: Used to select specific fields and add system time.

Ingestion and Data Selection

System timestamp captures ingestion time, added as a column.
Bike Point ID, number of bikes, and empty docks are ingested.
Latitude and longitude might be obtained from a reference table, reducing data streamed.

Optimizing Streaming Data

Minimize stream structure changes for efficiency, especially with large data volumes.
Avoid streaming unnecessary columns to prevent storing excessive data.
Data storage is cheap, so store even seemingly irrelevant variables for potential future use.
Bike Point ID, number of bikes, and number of empty docks are obtained directly from the source
Data types for these fields don't need to be changed

Filtering Data Streams

Filter the data stream to identify empty bike stations (number of bikes = 0).
Filtering can trigger processes to redistribute bikes or alert users.
Filtered data is written into a KQL table named "empty racks".

KQL Table Destination

Data can be streamed into an existing KQL table, and that requires schema matching.
Alternatively, a new KQL table can be created during event stream authoring.
Activate the event stream to start the ingestion process.
Choose to ingest from the last stopped time or from the current time onward.
Active status for all stream sections indicates successful streaming from source to destination.

Group By and Aggregations

Group by transformations in event streams require consideration of continuous data inflow.
Unlike batch data, event streams necessitate defining time periods (windows) for aggregations.
Windows are categorized by fixed or variable length, and whether they overlap.

Tumbling Windows

Fixed window length with non-overlapping windows.
Example: Calculate average air temperature hourly, resulting in 24 readings per day.

Hopping Windows

Fixed window length with overlapping windows.
Windows overlap, providing finer-grained analysis.
Hop size: distance between the start of consecutive windows.
Smaller hop size yields more data points and quicker decision-making.

Sliding Windows

Fixed window length with overlapping windows at every possible interval.
Requires filtering to generate meaningful output events.
Example: Emit events when more than 50 cars pass a junction in 1 minute.
Filter must be a separate node due to numerous possible windows.

Snapshot Windows

Windows are non-overlapping, where window length is variable.
Used for bursty sensor data, aggregating data within a snapshot.
Aggregates calculated over data received at a specific timestamp.

Session Windows

Non-overlapping windows with variable length.
Groups events based on locality or arrival time.
Inspired by website analytics, where a session tracks user navigation.

Event Stream Configuration for Aggregations

Three main components: aggregation, group by, and time window.
Aggregation: numeric calculation performed (e.g., sum, average, count).
Mandatory to define the column on which to apply calculations
Group by: optional grouping of data (e.g., by vendor ID).
Time window: defines the time period for aggregation (Hopping, Tumbling, Sliding).

Hopping Window Configuration

Requires specifying duration (window size) and hop size.
Example: 60-second window size with a 30-second hop size.

Tumbling Window Configuration

Requires only window length, as windows are non-overlapping.
Offset parameter: adjusts window inclusivity at the beginning and end.

Sliding Window Configuration

Requires fixed window length.
Slides the window across every possible start and end point.
Often paired with a filter to output events based on criteria.

Spark Structured Streaming

Fabric Spark uses Spark Structured Streaming for ingesting data
Transforms data streams into unbounded tables by appending new data as rows
These tables are built on top of Spark SQL API, which is an implementation of the open-source version of Spark
Development process involves coding and transforming in batch scenarios before adapting for structured streaming
The code is very similar for Spark structure streaming to the data frame API in the batch world
The code starts spark.readstream instead of spark.read

Spark Structure Streaming and Event Streams

Spark Structure Streaming integrates well with Azure Event Hubs.
readStream and writeStream are essential elements in Spark Structured Streaming.
Event Streams require no code, are simple, easy to learn, and easy to manage.
Event Streams are suitable for streaming from Azure Event Hub, IoT Hub, CDC sources, or Kafka/Kinesis.
Event Streams are useful when complex transformations are not needed.
Event Streams are integrated into Fabric and can write to lakehouses, KQL databases, and other event streams.
Spark Structure Streaming is beneficial if you are already using Spark for data engineering.
Spark Structure Streaming is a pro-code approach allowing unit testing and more complex stream transformations.
Spark Structure Streaming is useful when you have existing Spark Structured Streaming code or want to stream directly into a lakehouse.
KQL language is a strong point for real-time experiences in Fabric, and it doesn't require Event Streams.

KQL Database and Event House

The DP-700 exam includes a significant amount of KQL.
Familiarize yourself with event houses, KQL databases, KQL basics, filtering, joins, window functions, and management commands.
Monitoring and optimization of KQL queries are important.

Event House UI

An event house is a collection of KQL databases.
You can create new databases within an event house.
You can control event house consumption in capacity units (minimum level).
Python language extensions are available as plugins.
The UI provides details on region, query URI, and ingestion URI.
You can view the size and compressed size of data in the event house.
Activity levels and database summaries are visible.

KQL Database UI

At the KQL database level, you can create tables, materialized views, functions, and update policies.
You can bring in data from other areas in One Lake, like lake house tables.
Data can be ingested from supported sources.
You can query data in KQL tables using KQL code
You can set data retention policies, specifying retention and caching (hot and cold cache).
One Lake integration can be turned on or off for tables.
The left-hand side shows KQL query sets, tables, shortcuts, materialized views, and functions.

KQL Queries

KQL query sets can exist outside of a KQL database.
The UI includes a canvas for writing queries and a run button.
Tables are usually the starting point for KQL queries.
Shift+Enter, or selecting the table name and clicking Run will display all table records.
The pipe operator feeds the output of one line into the next for transformations.
count counts the number of rows.
take samples data and returns a specified number of values.
top gets the top N events based on a specified column and order.
project selects specific columns.
extend adds new calculated columns.
sort by sorts the data, with descending order as the default; use ASC for ascending.

Filtering in KQL

The where operator will filter data.
The equality operator (==) is case-sensitive; use =~ for case-insensitive.
has searches for whole terms based on how KQL indexes data; contains searches for substrings.
startsWith and endsWith are standard string operations.
!= is the not equal operator.
Numeric filtering includes equality, less than, greater than, and ranges using between.
Ranges include the start range, two dots, and the end range.
Datetime filtering includes equality, less than/greater than, and keywords like ago and now.
Multiple conditions can be combined with and and or.
The in operator checks for membership in a list.
where not negates the filter condition.

Summarize and Aggregations

The summarize operator aggregates data.
Aggregations include counts, mins, max, and other standard functions.
You can group by columns to perform aggregated calculations for each group.
countif is a conditional count based on a specific condition.
Calculations like percentages can be added using extend and type casting for division.
Bins can be created based on time intervals to summarize events within those intervals.

Management Commands

Management commands start with a dot (.).
.show tables lists tables in the current KQL database.
.create table creates a new table with specified column names and data types.
.ingest inline adds data directly into a table, useful for data pipeline logging.
You cannot clear table data using DELETE if One Lake availability is on.
.drop materialized view deletes a materialized view.
Materialized views are created using the .create materialized view command.
Materilized views write teh output of a summariz statement in KQL to a new table.

KQL Table Update Policies

KQL table update policies propagate the changes from one table to another table by setting up a policy on one table.
When new data is written, this policy will update the other table.

KQL Functions

Functions are created using the .create function command.
KQL Functions work similarly to stored procedures in T-SQL.
Pass parameters into the Function and returen datat.
You can call a function by its name and use it in queries like a table.

Window Functions in KQL

There are KQL-specific window functions.

Streaming Architectures in Fabric

Architecture 1: CDC and other sources ingested via Event Stream, written to Event House, using KQL query sets. Visualizations are in real-time dashboards or Power BI; files in KQL databases are not replicated into One Lake by default and must be enabled.
Architecture 2: Medallion architecture with bronze, silver, gold layers. Event House has raw tables, update policies propagate changes to other tables, materialized views build aggregate tables, and data in the silver layer is mirrored into One Lake for machine learning.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

More Like This

Intelligent Transportation Systems: Traffic Management and Vehicle Communication Quiz

11 questions

Intelligent Transportation Systems: Traffic Management and Vehicle Com...

ContrastyWalnutTree

Batch Processing Vs Real-Time Processing

10 questions

Batch Processing vs Real-Time Quiz & Flashcards

UnmatchedNurture

Real-Time Data Analysis and System Design

18 questions

Real-Time Data Analysis and System Design

RetractableColumbus

Live Connection Source Systems Basics

8 questions

Live Connection Source Systems Basics

QuietUkulele948

Use Quizgecko on...

Browser