Data Stream Processing

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

Which characteristic of big data is primarily addressed by stream data processing?

Volume
Veracity
Velocity (correct)
Variety

A data stream can be fully stored locally for processing without any limitations on the item arrival order.

False (B)

In stream processing, what characteristic must a sequence items possess?

Ordered

In the context of data streams, sub-linear space-time algorithms are used because we can only afford to store a _______ of the data.

summary

Signup and view all the answers

Match the streaming algorithm task with the corresponding calculation performed on the data stream:

Computing the sum = Σi=1 to t ai Finding the minimum = min(ata1, a2, ..., at-1) Finding the mean = mea n([a1,a2,...,at-1])×(t-1)+at / t

Signup and view all the answers

In the streaming algorithms framework, what should be done with `x` after retrieving the next element `x` in stream `S`?

Do something with <code>x</code> and update the summary information (A)

Signup and view all the answers

In the context of mining approximate frequent patterns in stream data, obtaining precise frequent patterns is always feasible.

False (B)

Signup and view all the answers

In the context of the majority problem, will there always be a majority element present in every infinite stream?

No

Signup and view all the answers

In the 'Finding the majority element' algorithm, there are negatives, but there are false _.

no, positives

Signup and view all the answers

Match each value of `s` for the s-heavy hitters problem with its correct implication:

s = 1/2 = Majority element problem s = 1/3 = 1/3-heavy hitters problem

Signup and view all the answers

Which statement is true regarding the s-heavy hitters problem?

False positives are acceptable, but false negatives are not. (A)

Signup and view all the answers

Window-based approaches extract an infinite relation from a finite stream.

False (B)

Signup and view all the answers

In window-based approaches, what is a common attribute used for ordering the windows?

time

Signup and view all the answers

In Lossy Counting, frequencies can be underestimated by at most _____.

εN

Signup and view all the answers

Match the following guarantee with the appropriate lossy-counting parameter:

s = 0.1% = All elements with frequency ≥ s = 0.1% will be returned. e = 0.01% = All individual frequencies are less than their true frequencies by at most 0.01%N

Signup and view all the answers

In lossy counting, what action is performed at a window boundary?

Decrement all counters by one. (A)

Signup and view all the answers

In Lossy Counting, false negatives have a true frequency greater $(s - ε)N$.

False (B)

Signup and view all the answers

In Lossy Counting, what happens if a counter is zero for a specific element during processing?

Drop the element

Signup and view all the answers

Sticky Sampling is named because it sweeps over the stream like a _____, attracting all elements which already have an entry in R.

magnet

Signup and view all the answers

Match each parameter with its correct definition in sticky sampling:

s = Support ε = Error δ = Probability of failure

Signup and view all the answers

In Lossy Counting, what is the condition to provide Guarantees?

ε << s (C)

Signup and view all the answers

In sticky sampling, if there are no changes on sampling rate, there is no need to scan entries in R.

True (A)

Signup and view all the answers

In the context of Sticky Sampling, which elements are tracked exactly.

existing element

Signup and view all the answers

Lossy Counting is good at pruning ____ frequency elements quickly.

low

Signup and view all the answers

Match the name of algorithm with its expected number of entries:

Lossy Counting = 1/ε log(ε N) Sticky sampling = 2/ε log (1/sδ)

Signup and view all the answers

For Lossy counting, the frequency error ≤ the Number of windows* what value?

εN (C)

Signup and view all the answers

The Zipf distribution satisfies Zipf's law by an inversely proportion on frequency and rank.

True (A)

Signup and view all the answers

If you need to implement fraud detection, is it crucial to avoid false positives or false negatives?

false negatives

Signup and view all the answers

Heavy Hitters are more important if false positives are problematic if you are billing someone, or need to issue _____.

punishment

Signup and view all the answers

Associate each of the following algorithms with their expected outcome:

s-heavy hitters = Output all items that occur more than s fraction of the time. Lossy conuting = Elements with counter values exceeding (s – ε)N Sticky sampling = Decides if element counter should be added

Signup and view all the answers

Assume the user knows a stream of data has s=10%, the delta needs to be small. Which represents highest probability of failure?

10% (B)

Signup and view all the answers

Heavy Hitters can have false negatives but should minimize the number of false positives.

False (B)

Signup and view all the answers

What is the primary advantage of using sub-linear space-time algorithms in data stream processing?

Efficiency

Signup and view all the answers

In the streaming algorithms framework, a key step involves _______ summary information for the incoming stream data.

initializing

Signup and view all the answers

Match each window-based approach with its corresponding description:

Ordering attributes = Defining windows based on e.g., time. Item/record counts = Defining windows based on the number of items or records processed. Explicit markers = Defining windows based on markers signifying the beginning and end.

Signup and view all the answers

Which approach provides more precise frequent patterns in stream data, albeit unrealistically?

Mining precise frequent patterns (C)

Signup and view all the answers

The purpose of lossy counting is data compression.

False (B)

Signup and view all the answers

In Sticky Sampling, if the number of successful unsuccessful coin tosses results in a given element count being zero what happens?

deleted

Signup and view all the answers

The dynamic window sizes used with Sticky Sampling are equal to __________.

2kt

Signup and view all the answers

Associate stream data algorithm with its main target:

Lossy Counting = Elements of support more of user specified percentage Majority problem = If element appears more than half the time if any. Heavy hitters = To maintain all s-heavy hitters so far.

Signup and view all the answers

Flashcards

Data Stream

A real-time, continuous, ordered sequence of items.

Big Data Characteristics

Data that is characterized by 5Vs: Volume, Velocity, Variety, Veracity, Value

Streaming Algorithms

Processing data as it arrives, using sub-linear space-time algorithms.

Majority Element

A number that occurs more than half the time in a stream.

Signup and view all the flashcards

S-heavy Hitters Objective

Objective: Maintain a structure containing all the s-heavy hitters so far.

Signup and view all the flashcards

Window-Based Approaches

A mechanism for extracting a finite relation from an infinite stream.

Signup and view all the flashcards

Lossy Counting

Output elements with counter values exceeding (s – ε)N.

Signup and view all the flashcards

Sticky Sampling

A probabilistic sampling approach that creates counters for distinct elements.

Signup and view all the flashcards

Limit in StreamingHeavyHitter

It determines the number of items to track for heavy hitting

Signup and view all the flashcards

Study Notes

This presentation is about stream data processing

Agenda

Discussion of an introduction to data stream processing
Look at data stream algorithms
Discussion of heavy hitters
A look into window-based approaches
Including lossy counting and sticky sampling

Big Data Characteristics

Big data is characterized by the 5Vs: Volume, Velocity, Variety, Veracity, Value
Most discussions deal with Volume, which is handled by using more than one machine
The focus is Velocity, as the speed of incoming data is faster than a machine's processing power
The concept of the data stream is introduced

Data Streams Defined

A data stream constitutes a real-time, continuous, ordered sequence of items, implicitly ordered by arrival time or explicitly by timestamp
It is impossible to control the order in which items arrive or locally store a stream in its entirety
A stream is composed of structured records, as opposed to unstructured data like audio or video
High data volumes and record arrival rates are expected, with records arriving at a few GB/s

Streaming Algorithms

With streaming algorithms, there is too much data, and too little space
Only a summary or sketch of the data can be stored
Data streams feature a huge amount of data that is never-ending
Data is run over AT MOST once
Only data summaries can be stored by using sub-linear space-time algorithms

Data Streams

A stream of data elements 𝑆 = 𝑎1, 𝑎2, …, where 𝑎𝑡 arrives at time 𝑡
Represented mathematically as 𝑎[1…𝑡] = [𝑎1, 𝑎2, … , 𝑎𝑡 ]
For simplicity, each 𝑎𝑡 is a number
A function can be computed continuously while only using limited space
At any time 𝑡, the function can be queried with the value from the stream seen so far, shown mathematically as 𝑎[1 … 𝑡]

Computing the Sum

Where S equals a series of numbers, the outputs can be a running total using Python implementation

Finding The Minimum

For an infinite stream of numbers, the minimum of the observed elements at timestamp 𝑡 can be find
Using a min(min) function to find the minimum value

Finding the mean

For an infinite stream of numbers, calculation of the mean of the observed elements at timestamp 𝑡
Using an equation and a python implementation

Streaming Algorithms: Framework

The algorithm works by these steps
First initialising summary information
A loop that continues while the stream S, is not done
The algorithm takes the next element from the stream, denoted as "x"
The algorithm then computes something, with x
It then updates the summary information
Then outputs something if needed
Finally the algorithm returns summary information
With some restrictions, computations of interesting functions are possible, given some tolerance for error

Mining Approximate Frequent Patterns

Precise frequent patterns is unrealistic
Approximate answers are often sufficient, like for trend/pattern analysis
A router may be interested to know what flows make up 1% of the entire traffic stream
With 1/10 of s error being acceptable, where epsilon = 0.1%

Finding The Majority Element

Given an infinite stream of numbers, a problem is finding the element that occurs STRICTLY more than half of the time, the majority element
Each stream will have at most one such element
Examples include with the majority at particular times shown
The majority element shows no false negatives, but can have false positives

S - Heavy Hitters

Heavy Hitters using Structured Query Language can be extracted as

SELECT R.EventType, COUNT(*)
FROM R
GROUP BY R.EventType
HAVING COUNT(*) > s t

Where s is a user-specified percentage and t is the length of the stream

S - Heavy Hitters Defined

For a stream 𝑆 = 𝑎1, 𝑎2, define the count of element 𝑒 at any time 𝑡 to be count(𝑒) = |{𝑖 ≤ 𝑡|𝑎𝑖 = 𝑒 }|
Called an s-heavy hitter if count(𝑒) > 𝑠𝑡
The objective is to maintain a structure with holding all the s-heavy hitters
There can be at most 1/s - 1 such element
False positives are acceptable, but no false negatives
You must not miss any heavy-hitters, but can store non-heavy-hitters

Examples and Applications

1/2

If s = , then this is the majority element problem
Examples include with the values at any particular element
If s = 1/3, then 1/3 heavy hitters will be found with s

Heavy Hitters Algorithm

Set k = ceiling( 1/s ) - 1, if s = 1/2, then k = 1
Keep an array T[1, ... , k] to hold elements
Keep an array C[1, ... k] to hold their counters
Initialise C[i] = 0 and T[i] = 0 for all i
For each a in stream S:
- If a is an element T[j] for some j if less than k, then C[j] is incremented
- Else, if an element C[j] equals 0, where j is less than or equal to k, set T[j] to a and C[j] to 1
- Else decrement all elements of C[j], to 1

Heavy Hitters - Python Implementation

Use a python implementation, for heavy hitters in streams

Types of Errors

A one-sided error means there are no false negatives
The stream needs to count everything that needs to be considered
There may be false positives, the stream may overcount
This may not be a problem for certain use cases
Billing, or punishment could be problematic

Window-Based Approaches

Mechanism for extracting a relation from an infinite stream
Various window proposals exist for restricting the processing scope
Includes windows based on ordering attributes like time
Includes windows based on the item (record) counts
Includes windows based on explicit markers specifying the beginning and end
Some variants may use semantic partitioning constraint

Lossy Counting Defined

𝑁 ∶ current length of the stream
Support threshold 𝑠 ∈ (0, 1)
Error parameter 𝜀 ∈ (0, 1) and 𝜀 ≪ 𝑠
Output: Elements with counter values exceeding (𝑠 − 𝜀)𝑁
Approximation Guarantees: - Frequencies underestimated by at most 𝜀𝑁 - No false negatives - False positives will have a true frequency of at least (𝑠 – 𝜀)𝑁

Lossy Counting Example

A user can identify any items whose frequency is at least 0.1% of the entire stream
Thus s = 0.1%.
Epsilon = 0.01%
All elements with a frequency of more than 0.1% is returned
An element with a frequency of less than 0.09% is not

Steps for Lossy Counting

First, divide the stream into "windows” of size 1/ 𝜖, or ceiling(1/ 𝜖) = ceiling(1/0.01) = 100 (for purposes of example)
Then, at a window boundary, decrement all counters by 1, where:
- w = ceiling(1/ 𝜖) = ceiling(1/0.01) = 100
Next reduce counts by one before processing the next window
The counter can be 0 for a specific element, dropping it
Now count any elements appearing in the next window and adjust the counts afterward

Error Analysis for Lossy Counting

What is the undercount?
- N = current size of stream
- Window = 1/𝜖
- Then frequency error less than or equal to, the number of windows times 𝜖N
A rule of thumb has 𝜀, equal to 10% of support s
- Where given a support frequency s = 1%
- The set error frequency ε = 0.1%

Output Considerations

Elements are only counted with values exceeding (𝑠 − 𝜀)𝑁
Approximation guarantees
Frequencies will likely be at most 𝜀𝑁
There should be no false negatives
Any False positives will have a true frequency of at least (𝑠 − 𝜀)𝑁

Sticky Sampling

Sampling, where a counter for a distinct element, if created may or may not occur and decides Sticky Sampling sweeps over the stream like a magnet
It attracts all those elements, elements which already have an entry in 𝑅
If a counter exists for a certain element, a future element instance will be counted
Users will need to consider supporting it
Along with any errors
And the probability of failure for those
For heavy hitters, Data structure R is a set with the form (𝑒, f)
- The stream and its belonging is for Data Structure R is set for some
The elements here form parts and estimates the frequency for a limited scope

Sticky Sampling Rates

An element with r means, that we can select a particular element with probability
t= log (1/εδ), where 𝜀 estimates the frequency here for the limited scope
For Sampling Element where the equation is: if s= log(1/ εδ) here with a probability
The sampling rate varies, along with a scan in R
The update element and scan includes these factors and elements which is
When the sampling rate changes, we scan any entries in 𝑅, and update them
- For those entries, and scan them within limited scope
- While each entry tosses an unbiased coin
- With the same coin, the value from both can be determined and assessed

Sticky Sampling and Error Correction

Whenever the sampling rate changes, and scan within 𝑅
- For those the coin tosses are repeated until, there are unbiased coins where limited coin tosses succeed
Diminishing f by one for every unsuccessful outcome.
- Every record in S is also in R
The Sticky Sample and updates, follows unbiasedness and repeats for many
With those a coin can toss can, also is assigned depending on the success-rate

Benefits and Drawbacks to Sticky Sampling

This type of updating from 𝑆, the Sampling and updates follows, to know its probability
This occurs from R, and knows its likelihood for the coin toss

Using Dynamic Window Size

Employing a dynamic window size: 𝑤𝑘 = 2𝑘𝑡, 𝑡 = log( ) for k ≥ 1 and 𝑤0 = 2𝑡
With parameters defined: s = 10%, є = 1%, δ = 0.1%, t ≈ 921

Memory usage compared

Lossy counting is expected to yield entries = log(ɛ N)/ ε
While sticky sampling is expected to yield entries = 2 log( 1/
Sticky sampling thus requires less space

Experiment Setup

Stream 1 (Uniq): no duplicates
Stream 2 (Zipf): The Zipf distribution (also known as the zeta distribution) is a discrete probability distribution that satisfies Zipf’s law: the frequency of an item is inversely proportional to its rank in a frequency table.
Where PMF can be expressed:

P(k) = [C / k^a]

, where P(k) is the probability that an object occurs k times in the data, a is a parameter that controls the shape of the distribution and C is a normalization constant.
For both streams, we assumed uniformly random arrival order.

Experiment Results

Assuming 𝑆 = 1% for both
The expected error for both: E =10%
While, the Probability of failure with is δ = 0.01
Data from both are greatly impacted by how much has the lossy counting can compute
With this the sticky sampling increases, the stream numbers were very limited
Sticky sampling will be very limited because of the tendency to remember, and element being sample
Lossy Counting is proficient in pruning frequencies quickly, but it can be survive by, mostly the elements with higher count
In most of the sample experiments the range will have a space that has limited bounds

Summary

Introduction to data stream processing
Data stream algos
Heavy hitters
Window-based approaches - Lossy counting - Sticky sampling

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Data Stream Processing

Choose a study mode

Podcast

Questions and Answers

Which characteristic of big data is primarily addressed by stream data processing?

A data stream can be fully stored locally for processing without any limitations on the item arrival order.

In stream processing, what characteristic must a sequence items possess?

In the context of data streams, sub-linear space-time algorithms are used because we can only afford to store a _______ of the data.

Match the streaming algorithm task with the corresponding calculation performed on the data stream:

In the streaming algorithms framework, what should be done with x after retrieving the next element x in stream S?

In the context of mining approximate frequent patterns in stream data, obtaining precise frequent patterns is always feasible.

In the context of the majority problem, will there always be a majority element present in every infinite stream?

In the 'Finding the majority element' algorithm, there are ____ negatives, but there are false _____.

Match each value of s for the s-heavy hitters problem with its correct implication:

Which statement is true regarding the s-heavy hitters problem?

Window-based approaches extract an infinite relation from a finite stream.

In window-based approaches, what is a common attribute used for ordering the windows?

In Lossy Counting, frequencies can be underestimated by at most _____.

Match the following guarantee with the appropriate lossy-counting parameter:

In lossy counting, what action is performed at a window boundary?

In Lossy Counting, false negatives have a true frequency greater $(s - ε)N$.

In Lossy Counting, what happens if a counter is zero for a specific element during processing?

Sticky Sampling is named because it sweeps over the stream like a _____, attracting all elements which already have an entry in R.

Match each parameter with its correct definition in sticky sampling:

In Lossy Counting, what is the condition to provide Guarantees?

In sticky sampling, if there are no changes on sampling rate, there is no need to scan entries in R.

In the context of Sticky Sampling, which elements are tracked exactly.

Lossy Counting is good at pruning ____ frequency elements quickly.

Match the name of algorithm with its expected number of entries:

For Lossy counting, the frequency error ≤ the Number of windows* what value?

The Zipf distribution satisfies Zipf's law by an inversely proportion on frequency and rank.

If you need to implement fraud detection, is it crucial to avoid false positives or false negatives?

Heavy Hitters are more important if false positives are problematic if you are billing someone, or need to issue _____.

Associate each of the following algorithms with their expected outcome:

Assume the user knows a stream of data has s=10%, the delta needs to be small. Which represents highest probability of failure?

Heavy Hitters can have false negatives but should minimize the number of false positives.

What is the primary advantage of using sub-linear space-time algorithms in data stream processing?

In the streaming algorithms framework, a key step involves _______ summary information for the incoming stream data.

Match each window-based approach with its corresponding description:

Which approach provides more precise frequent patterns in stream data, albeit unrealistically?

The purpose of lossy counting is data compression.

In Sticky Sampling, if the number of successful unsuccessful coin tosses results in a given element count being zero what happens?

The dynamic window sizes used with Sticky Sampling are equal to __________.

Associate stream data algorithm with its main target:

Flashcards

Data Stream

Big Data Characteristics

Streaming Algorithms

Majority Element

S-heavy Hitters Objective

Window-Based Approaches

Lossy Counting

Sticky Sampling

Limit in StreamingHeavyHitter

Study Notes

Agenda

Big Data Characteristics

Data Streams Defined

Streaming Algorithms

Data Streams

Computing the Sum

Finding The Minimum

Finding the mean

Streaming Algorithms: Framework

Mining Approximate Frequent Patterns

Finding The Majority Element

S - Heavy Hitters

S - Heavy Hitters Defined

Examples and Applications

Heavy Hitters Algorithm

Heavy Hitters - Python Implementation

Types of Errors

Window-Based Approaches

Lossy Counting Defined

Lossy Counting Example

Steps for Lossy Counting

Error Analysis for Lossy Counting

Output Considerations

Sticky Sampling

Sticky Sampling Rates

In the streaming algorithms framework, what should be done with `x` after retrieving the next element `x` in stream `S`?

In the 'Finding the majority element' algorithm, there are negatives, but there are false _.

Match each value of `s` for the s-heavy hitters problem with its correct implication: