Podcast
Questions and Answers
Which of the following scenarios would LEAST benefit from a real-time data processing system?
Which of the following scenarios would LEAST benefit from a real-time data processing system?
In the context of Microsoft Fabric, which component facilitates the ingestion of data from various sources into a real-time data processing pipeline?
In the context of Microsoft Fabric, which component facilitates the ingestion of data from various sources into a real-time data processing pipeline?
When configuring a 'Group By' transformation in Event Streams, what critical element related to time must be defined to handle continuous data inflow effectively?
When configuring a 'Group By' transformation in Event Streams, what critical element related to time must be defined to handle continuous data inflow effectively?
Which type of window is characterized by a fixed length and overlaps, providing more data points and quicker decision-making capabilities?
Which type of window is characterized by a fixed length and overlaps, providing more data points and quicker decision-making capabilities?
In Spark Structured Streaming, how does it treat incoming data streams so that it can be queried using Spark SQL?
In Spark Structured Streaming, how does it treat incoming data streams so that it can be queried using Spark SQL?
When choosing between using Event Streams and Spark Structured Streaming in Microsoft Fabric, what is a key advantage of Event Streams?
When choosing between using Event Streams and Spark Structured Streaming in Microsoft Fabric, what is a key advantage of Event Streams?
In KQL, what is the primary purpose of the where
operator?
In KQL, what is the primary purpose of the where
operator?
Which KQL operator is used to perform aggregated calculations for groups of data?
Which KQL operator is used to perform aggregated calculations for groups of data?
You need to set up a process that automatically updates a summary table whenever new data is ingested into a raw data table in a KQL database. Which feature should you use?
You need to set up a process that automatically updates a summary table whenever new data is ingested into a raw data table in a KQL database. Which feature should you use?
In a streaming architecture within Fabric, leveraging an Event House, how is data typically visualized in real-time?
In a streaming architecture within Fabric, leveraging an Event House, how is data typically visualized in real-time?
Flashcards
Batch Processing
Batch Processing
Real-Time Systems
Real-Time Systems
Event Streams
Event Streams
Azure Event Hub and IoT Hub
Azure Event Hub and IoT Hub
Aggregation Operation
Aggregation Operation
Union Operation
Union Operation
Edit Mode
Edit Mode
Tumbling Windows
Tumbling Windows
Hopping Windows
Hopping Windows
Spark Structured Streaming
Spark Structured Streaming
Study Notes
Introduction to Real-Time Systems
- Batch processing and real-time systems are the two big paradigms in data analytics.
- Batch processing involves getting batches of data from source systems, transforming them, and visualizing them, usually on a schedule.
- Examples of batch processing tools include data pipelines and data flows.
- Batch processing is suitable for historical analysis, with some lag between data collection and ingestion.
- For most businesses and use cases, batch processing is sufficient.
- Real-time systems are worth the cost for certain industries and use cases.
- Real-time systems require a different data infrastructure.
- Real-time systems necessitate ingesting data 24/7, different analysis principles, and a different mindset.
Industries and Use Cases for Real-Time Systems
- Cyber-physical systems: Logistics, Transportation, IoT, and agriculture are increasingly using real-time systems due to sensors providing real-time data.
- Digital applications: Financial applications, fraud detection, and cyber security using digital applications log data into real-time systems.
- The kql community and Azure Data Explorer are significant in cyber security and real-time logging applications.
- Big Tech: Companies like Netflix, Uber, and Airbnb use real-time systems for website optimization, such as clickthrough rates.
Real-Time Systems in Microsoft Fabric
- Microsoft Fabric includes batch processing tools like data pipelines and data flows.
- Spark in Fabric allows for structured streaming.
- Fabric-specific tools:
- Event stream for ingesting data from various sources.
- Event stream allows to perform transformations and write to different destinations.
- Event house which holds kql databases.
- Analysis tools:
- Custo Query Language to analyze data in kql databases.
- Real-time dashboards for data visualization.
- Alerting, monitoring, and optimizations are crucial but not the focus.
- Spark structured streaming, event streams, and kql databases will be explored in detail.
Event Streams Overview
- Event streams allow data ingestion from various sources.
- Event streams allow a variety of transformations.
- Event streams allow routing streams to specific endpoints.
Event Streams - Sources
- Azure Realtime Platform as a Service Solutions: Azure Event Hub and Azure IoT Hub integrate for data streaming into event streams.
- Cloud databases with change data capture enabled: Stream data from Azure SQL databases and Azure Cosmos DB.
- Third-party streaming platforms: CFA, Google Pub/Sub, and Amazon Kinesis can stream data into event streams.
- Fabric and Azure event triggers: Trigger an event stream from updates to a onel folder or workspace events.
- Custom endpoints: Allow data ingestion directly from a device into an event stream without Azure Event Hubs.
- Sample data is also available for testing.
Event Streams - Operations/Transformations
- Operations are optional on event streams.
- Aggregation: Performs an aggregate on the data over a specified time window with every new event.
- Group by: Similar to aggregate, but with defined window duration and window length.
- Expand: Expands a list of data or an array of values into different data points.
- Filter: Filters streams of data based on a specified rule.
- Join: Joins two streams together on a matching key.
- Union: Combines two streams with the same schema vertically.
- Manage Fields: Reduces columns and changes data types, useful for converting data types for kql databases.
Event Streams - Destinations
- Custom endpoint: Direct the stream back to a custom endpoint (owned by the user).
- Fabric data stores: Stream into the lake house or event house, or both.
- Stream: Join two streams together. The output of one stream is the input of another.
- Activator: Set up alerting on the stream based on certain conditions.
Event Streams - Practical Example in UI
- Event streams operate in two modes: edit mode and live mode.
- Edit mode is for configuring and investigating the event stream.
- Live mode is for ingesting data with ingestion on or off.
- The UI is structured as a directed aylet graph.
- The source is on the left and the event stream flows to the right.
- Sample Data Source: Example uses the "bicycles" sample data set.
- The bicycle data set includes bike Point ID, street, neighborhood, latitude, longitude, number of bikes, and number of empty docks.
- The root node displays the name of the event stream.
- Direct Ingestion: Writes the raw stream into a destination (e.g., event house).
- Manage Fields Node: Used to select specific fields and add system time.
Ingestion and Data Selection
- System timestamp captures ingestion time, added as a column.
- Bike Point ID, number of bikes, and empty docks are ingested.
- Latitude and longitude might be obtained from a reference table, reducing data streamed.
Optimizing Streaming Data
- Minimize stream structure changes for efficiency, especially with large data volumes.
- Avoid streaming unnecessary columns to prevent storing excessive data.
- Data storage is cheap, so store even seemingly irrelevant variables for potential future use.
- Bike Point ID, number of bikes, and number of empty docks are obtained directly from the source
- Data types for these fields don't need to be changed
Filtering Data Streams
- Filter the data stream to identify empty bike stations (number of bikes = 0).
- Filtering can trigger processes to redistribute bikes or alert users.
- Filtered data is written into a KQL table named "empty racks".
KQL Table Destination
- Data can be streamed into an existing KQL table, and that requires schema matching.
- Alternatively, a new KQL table can be created during event stream authoring.
- Activate the event stream to start the ingestion process.
- Choose to ingest from the last stopped time or from the current time onward.
- Active status for all stream sections indicates successful streaming from source to destination.
Group By and Aggregations
- Group by transformations in event streams require consideration of continuous data inflow.
- Unlike batch data, event streams necessitate defining time periods (windows) for aggregations.
- Windows are categorized by fixed or variable length, and whether they overlap.
Tumbling Windows
- Fixed window length with non-overlapping windows.
- Example: Calculate average air temperature hourly, resulting in 24 readings per day.
Hopping Windows
- Fixed window length with overlapping windows.
- Windows overlap, providing finer-grained analysis.
- Hop size: distance between the start of consecutive windows.
- Smaller hop size yields more data points and quicker decision-making.
Sliding Windows
- Fixed window length with overlapping windows at every possible interval.
- Requires filtering to generate meaningful output events.
- Example: Emit events when more than 50 cars pass a junction in 1 minute.
- Filter must be a separate node due to numerous possible windows.
Snapshot Windows
- Windows are non-overlapping, where window length is variable.
- Used for bursty sensor data, aggregating data within a snapshot.
- Aggregates calculated over data received at a specific timestamp.
Session Windows
- Non-overlapping windows with variable length.
- Groups events based on locality or arrival time.
- Inspired by website analytics, where a session tracks user navigation.
Event Stream Configuration for Aggregations
- Three main components: aggregation, group by, and time window.
- Aggregation: numeric calculation performed (e.g., sum, average, count).
- Mandatory to define the column on which to apply calculations
- Group by: optional grouping of data (e.g., by vendor ID).
- Time window: defines the time period for aggregation (Hopping, Tumbling, Sliding).
Hopping Window Configuration
- Requires specifying duration (window size) and hop size.
- Example: 60-second window size with a 30-second hop size.
Tumbling Window Configuration
- Requires only window length, as windows are non-overlapping.
- Offset parameter: adjusts window inclusivity at the beginning and end.
Sliding Window Configuration
- Requires fixed window length.
- Slides the window across every possible start and end point.
- Often paired with a filter to output events based on criteria.
Spark Structured Streaming
- Fabric Spark uses Spark Structured Streaming for ingesting data
- Transforms data streams into unbounded tables by appending new data as rows
- These tables are built on top of Spark SQL API, which is an implementation of the open-source version of Spark
- Development process involves coding and transforming in batch scenarios before adapting for structured streaming
- The code is very similar for Spark structure streaming to the data frame API in the batch world
- The code starts spark.readstream instead of spark.read
Spark Structure Streaming and Event Streams
- Spark Structure Streaming integrates well with Azure Event Hubs.
readStream
andwriteStream
are essential elements in Spark Structured Streaming.- Event Streams require no code, are simple, easy to learn, and easy to manage.
- Event Streams are suitable for streaming from Azure Event Hub, IoT Hub, CDC sources, or Kafka/Kinesis.
- Event Streams are useful when complex transformations are not needed.
- Event Streams are integrated into Fabric and can write to lakehouses, KQL databases, and other event streams.
- Spark Structure Streaming is beneficial if you are already using Spark for data engineering.
- Spark Structure Streaming is a pro-code approach allowing unit testing and more complex stream transformations.
- Spark Structure Streaming is useful when you have existing Spark Structured Streaming code or want to stream directly into a lakehouse.
- KQL language is a strong point for real-time experiences in Fabric, and it doesn't require Event Streams.
KQL Database and Event House
- The DP-700 exam includes a significant amount of KQL.
- Familiarize yourself with event houses, KQL databases, KQL basics, filtering, joins, window functions, and management commands.
- Monitoring and optimization of KQL queries are important.
Event House UI
- An event house is a collection of KQL databases.
- You can create new databases within an event house.
- You can control event house consumption in capacity units (minimum level).
- Python language extensions are available as plugins.
- The UI provides details on region, query URI, and ingestion URI.
- You can view the size and compressed size of data in the event house.
- Activity levels and database summaries are visible.
KQL Database UI
- At the KQL database level, you can create tables, materialized views, functions, and update policies.
- You can bring in data from other areas in One Lake, like lake house tables.
- Data can be ingested from supported sources.
- You can query data in KQL tables using KQL code
- You can set data retention policies, specifying retention and caching (hot and cold cache).
- One Lake integration can be turned on or off for tables.
- The left-hand side shows KQL query sets, tables, shortcuts, materialized views, and functions.
KQL Queries
- KQL query sets can exist outside of a KQL database.
- The UI includes a canvas for writing queries and a run button.
- Tables are usually the starting point for KQL queries.
- Shift+Enter, or selecting the table name and clicking
Run
will display all table records. - The pipe operator feeds the output of one line into the next for transformations.
count
counts the number of rows.take
samples data and returns a specified number of values.top
gets the top N events based on a specified column and order.project
selects specific columns.extend
adds new calculated columns.sort by
sorts the data, with descending order as the default; useASC
for ascending.
Filtering in KQL
- The
where
operator will filter data. - The equality operator (
==
) is case-sensitive; use=~
for case-insensitive. has
searches for whole terms based on how KQL indexes data;contains
searches for substrings.startsWith
andendsWith
are standard string operations.!=
is the not equal operator.- Numeric filtering includes equality, less than, greater than, and ranges using
between
. - Ranges include the start range, two dots, and the end range.
- Datetime filtering includes equality, less than/greater than, and keywords like
ago
andnow
. - Multiple conditions can be combined with
and
andor
. - The
in
operator checks for membership in a list. where not
negates the filter condition.
Summarize and Aggregations
- The
summarize
operator aggregates data. - Aggregations include counts, mins, max, and other standard functions.
- You can group by columns to perform aggregated calculations for each group.
countif
is a conditional count based on a specific condition.- Calculations like percentages can be added using
extend
and type casting for division. - Bins can be created based on time intervals to summarize events within those intervals.
Management Commands
- Management commands start with a dot (
.
). .show tables
lists tables in the current KQL database..create table
creates a new table with specified column names and data types..ingest inline
adds data directly into a table, useful for data pipeline logging.- You cannot clear table data using DELETE if One Lake availability is
on
. .drop materialized view
deletes a materialized view.- Materialized views are created using the
.create materialized view
command. - Materilized views write teh output of a summariz statement in KQL to a new table.
KQL Table Update Policies
- KQL table update policies propagate the changes from one table to another table by setting up a policy on one table.
- When new data is written, this policy will update the other table.
KQL Functions
- Functions are created using the
.create function
command. - KQL Functions work similarly to stored procedures in T-SQL.
- Pass parameters into the Function and returen datat.
- You can call a function by its name and use it in queries like a table.
Window Functions in KQL
- There are KQL-specific window functions.
Streaming Architectures in Fabric
- Architecture 1: CDC and other sources ingested via Event Stream, written to Event House, using KQL query sets. Visualizations are in real-time dashboards or Power BI; files in KQL databases are not replicated into One Lake by default and must be enabled.
- Architecture 2: Medallion architecture with bronze, silver, gold layers. Event House has raw tables, update policies propagate changes to other tables, materialized views build aggregate tables, and data in the silver layer is mirrored into One Lake for machine learning.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.