Podcast
Questions and Answers
Which characteristic of big data is primarily addressed by stream data processing?
Which characteristic of big data is primarily addressed by stream data processing?
- Volume
- Veracity
- Velocity (correct)
- Variety
A data stream can be fully stored locally for processing without any limitations on the item arrival order.
A data stream can be fully stored locally for processing without any limitations on the item arrival order.
False (B)
In stream processing, what characteristic must a sequence items possess?
In stream processing, what characteristic must a sequence items possess?
Ordered
In the context of data streams, sub-linear space-time algorithms are used because we can only afford to store a _______ of the data.
In the context of data streams, sub-linear space-time algorithms are used because we can only afford to store a _______ of the data.
Match the streaming algorithm task with the corresponding calculation performed on the data stream:
Match the streaming algorithm task with the corresponding calculation performed on the data stream:
In the streaming algorithms framework, what should be done with x
after retrieving the next element x
in stream S
?
In the streaming algorithms framework, what should be done with x
after retrieving the next element x
in stream S
?
In the context of mining approximate frequent patterns in stream data, obtaining precise frequent patterns is always feasible.
In the context of mining approximate frequent patterns in stream data, obtaining precise frequent patterns is always feasible.
In the context of the majority problem, will there always be a majority element present in every infinite stream?
In the context of the majority problem, will there always be a majority element present in every infinite stream?
In the 'Finding the majority element' algorithm, there are ____ negatives, but there are false _____.
In the 'Finding the majority element' algorithm, there are ____ negatives, but there are false _____.
Match each value of s
for the s-heavy hitters problem with its correct implication:
Match each value of s
for the s-heavy hitters problem with its correct implication:
Which statement is true regarding the s-heavy hitters problem?
Which statement is true regarding the s-heavy hitters problem?
Window-based approaches extract an infinite relation from a finite stream.
Window-based approaches extract an infinite relation from a finite stream.
In window-based approaches, what is a common attribute used for ordering the windows?
In window-based approaches, what is a common attribute used for ordering the windows?
In Lossy Counting, frequencies can be underestimated by at most _____.
In Lossy Counting, frequencies can be underestimated by at most _____.
Match the following guarantee with the appropriate lossy-counting parameter:
Match the following guarantee with the appropriate lossy-counting parameter:
In lossy counting, what action is performed at a window boundary?
In lossy counting, what action is performed at a window boundary?
In Lossy Counting, false negatives have a true frequency greater $(s - ε)N$.
In Lossy Counting, false negatives have a true frequency greater $(s - ε)N$.
In Lossy Counting, what happens if a counter is zero for a specific element during processing?
In Lossy Counting, what happens if a counter is zero for a specific element during processing?
Sticky Sampling is named because it sweeps over the stream like a _____, attracting all elements which already have an entry in R.
Sticky Sampling is named because it sweeps over the stream like a _____, attracting all elements which already have an entry in R.
Match each parameter with its correct definition in sticky sampling:
Match each parameter with its correct definition in sticky sampling:
In Lossy Counting, what is the condition to provide Guarantees?
In Lossy Counting, what is the condition to provide Guarantees?
In sticky sampling, if there are no changes on sampling rate, there is no need to scan entries in R.
In sticky sampling, if there are no changes on sampling rate, there is no need to scan entries in R.
In the context of Sticky Sampling, which elements are tracked exactly.
In the context of Sticky Sampling, which elements are tracked exactly.
Lossy Counting is good at pruning ____ frequency elements quickly.
Lossy Counting is good at pruning ____ frequency elements quickly.
Match the name of algorithm with its expected number of entries:
Match the name of algorithm with its expected number of entries:
For Lossy counting, the frequency error ≤ the Number of windows* what value?
For Lossy counting, the frequency error ≤ the Number of windows* what value?
The Zipf distribution satisfies Zipf's law by an inversely proportion on frequency and rank.
The Zipf distribution satisfies Zipf's law by an inversely proportion on frequency and rank.
If you need to implement fraud detection, is it crucial to avoid false positives or false negatives?
If you need to implement fraud detection, is it crucial to avoid false positives or false negatives?
Heavy Hitters are more important if false positives are problematic if you are billing someone, or need to issue _____.
Heavy Hitters are more important if false positives are problematic if you are billing someone, or need to issue _____.
Associate each of the following algorithms with their expected outcome:
Associate each of the following algorithms with their expected outcome:
Assume the user knows a stream of data has s=10%, the delta needs to be small. Which represents highest probability of failure?
Assume the user knows a stream of data has s=10%, the delta needs to be small. Which represents highest probability of failure?
Heavy Hitters can have false negatives but should minimize the number of false positives.
Heavy Hitters can have false negatives but should minimize the number of false positives.
What is the primary advantage of using sub-linear space-time algorithms in data stream processing?
What is the primary advantage of using sub-linear space-time algorithms in data stream processing?
In the streaming algorithms framework, a key step involves _______ summary information for the incoming stream data.
In the streaming algorithms framework, a key step involves _______ summary information for the incoming stream data.
Match each window-based approach with its corresponding description:
Match each window-based approach with its corresponding description:
Which approach provides more precise frequent patterns in stream data, albeit unrealistically?
Which approach provides more precise frequent patterns in stream data, albeit unrealistically?
The purpose of lossy counting is data compression.
The purpose of lossy counting is data compression.
In Sticky Sampling, if the number of successful unsuccessful coin tosses results in a given element count being zero what happens?
In Sticky Sampling, if the number of successful unsuccessful coin tosses results in a given element count being zero what happens?
The dynamic window sizes used with Sticky Sampling are equal to __________.
The dynamic window sizes used with Sticky Sampling are equal to __________.
Associate stream data algorithm with its main target:
Associate stream data algorithm with its main target:
Flashcards
Data Stream
Data Stream
A real-time, continuous, ordered sequence of items.
Big Data Characteristics
Big Data Characteristics
Data that is characterized by 5Vs: Volume, Velocity, Variety, Veracity, Value
Streaming Algorithms
Streaming Algorithms
Processing data as it arrives, using sub-linear space-time algorithms.
Majority Element
Majority Element
Signup and view all the flashcards
S-heavy Hitters Objective
S-heavy Hitters Objective
Signup and view all the flashcards
Window-Based Approaches
Window-Based Approaches
Signup and view all the flashcards
Lossy Counting
Lossy Counting
Signup and view all the flashcards
Sticky Sampling
Sticky Sampling
Signup and view all the flashcards
Limit in StreamingHeavyHitter
Limit in StreamingHeavyHitter
Signup and view all the flashcards
Study Notes
- This presentation is about stream data processing
Agenda
- Discussion of an introduction to data stream processing
- Look at data stream algorithms
- Discussion of heavy hitters
- A look into window-based approaches
- Including lossy counting and sticky sampling
Big Data Characteristics
- Big data is characterized by the 5Vs: Volume, Velocity, Variety, Veracity, Value
- Most discussions deal with Volume, which is handled by using more than one machine
- The focus is Velocity, as the speed of incoming data is faster than a machine's processing power
- The concept of the data stream is introduced
Data Streams Defined
- A data stream constitutes a real-time, continuous, ordered sequence of items, implicitly ordered by arrival time or explicitly by timestamp
- It is impossible to control the order in which items arrive or locally store a stream in its entirety
- A stream is composed of structured records, as opposed to unstructured data like audio or video
- High data volumes and record arrival rates are expected, with records arriving at a few GB/s
Streaming Algorithms
- With streaming algorithms, there is too much data, and too little space
- Only a summary or sketch of the data can be stored
- Data streams feature a huge amount of data that is never-ending
- Data is run over AT MOST once
- Only data summaries can be stored by using sub-linear space-time algorithms
Data Streams
- A stream of data elements 𝑆 = 𝑎1, 𝑎2, …, where 𝑎𝑡 arrives at time 𝑡
- Represented mathematically as 𝑎[1…𝑡] = [𝑎1, 𝑎2, … , 𝑎𝑡 ]
- For simplicity, each 𝑎𝑡 is a number
- A function can be computed continuously while only using limited space
- At any time 𝑡, the function can be queried with the value from the stream seen so far, shown mathematically as 𝑎[1 … 𝑡]
Computing the Sum
- Where S equals a series of numbers, the outputs can be a running total using Python implementation
Finding The Minimum
- For an infinite stream of numbers, the minimum of the observed elements at timestamp 𝑡 can be find
- Using a min(min) function to find the minimum value
Finding the mean
- For an infinite stream of numbers, calculation of the mean of the observed elements at timestamp 𝑡
- Using an equation and a python implementation
Streaming Algorithms: Framework
- The algorithm works by these steps
- First initialising summary information
- A loop that continues while the stream S, is not done
- The algorithm takes the next element from the stream, denoted as "x"
- The algorithm then computes something, with x
- It then updates the summary information
- Then outputs something if needed
- Finally the algorithm returns summary information
- With some restrictions, computations of interesting functions are possible, given some tolerance for error
Mining Approximate Frequent Patterns
- Precise frequent patterns is unrealistic
- Approximate answers are often sufficient, like for trend/pattern analysis
- A router may be interested to know what flows make up 1% of the entire traffic stream
- With 1/10 of s error being acceptable, where epsilon = 0.1%
Finding The Majority Element
- Given an infinite stream of numbers, a problem is finding the element that occurs STRICTLY more than half of the time, the majority element
- Each stream will have at most one such element
- Examples include with the majority at particular times shown
- The majority element shows no false negatives, but can have false positives
S - Heavy Hitters
- Heavy Hitters using Structured Query Language can be extracted as
SELECT R.EventType, COUNT(*)
FROM R
GROUP BY R.EventType
HAVING COUNT(*) > s t
- Where s is a user-specified percentage and t is the length of the stream
S - Heavy Hitters Defined
- For a stream 𝑆 = 𝑎1, 𝑎2, define the count of element 𝑒 at any time 𝑡 to be count(𝑒) = |{𝑖 ≤ 𝑡|𝑎𝑖 = 𝑒 }|
- Called an s-heavy hitter if count(𝑒) > 𝑠𝑡
- The objective is to maintain a structure with holding all the s-heavy hitters
- There can be at most 1/s - 1 such element
- False positives are acceptable, but no false negatives
- You must not miss any heavy-hitters, but can store non-heavy-hitters
Examples and Applications
1/2
- If s = , then this is the majority element problem
- Examples include with the values at any particular element
- If s = 1/3, then 1/3 heavy hitters will be found with s
Heavy Hitters Algorithm
- Set k = ceiling( 1/s ) - 1, if s = 1/2, then k = 1
- Keep an array T[1, ... , k] to hold elements
- Keep an array C[1, ... k] to hold their counters
- Initialise C[i] = 0 and T[i] = 0 for all i
- For each a in stream S:
- If a is an element T[j] for some j if less than k, then C[j] is incremented
- Else, if an element C[j] equals 0, where j is less than or equal to k, set T[j] to a and C[j] to 1
- Else decrement all elements of C[j], to 1
Heavy Hitters - Python Implementation
- Use a python implementation, for heavy hitters in streams
Types of Errors
- A one-sided error means there are no false negatives
- The stream needs to count everything that needs to be considered
- There may be false positives, the stream may overcount
- This may not be a problem for certain use cases
- Billing, or punishment could be problematic
Window-Based Approaches
- Mechanism for extracting a relation from an infinite stream
- Various window proposals exist for restricting the processing scope
- Includes windows based on ordering attributes like time
- Includes windows based on the item (record) counts
- Includes windows based on explicit markers specifying the beginning and end
- Some variants may use semantic partitioning constraint
Lossy Counting Defined
- 𝑁 ∶ current length of the stream
- Support threshold 𝑠 ∈ (0, 1)
- Error parameter 𝜀 ∈ (0, 1) and 𝜀 ≪ 𝑠
- Output: Elements with counter values exceeding (𝑠 − 𝜀)𝑁
- Approximation Guarantees: - Frequencies underestimated by at most 𝜀𝑁 - No false negatives - False positives will have a true frequency of at least (𝑠 – 𝜀)𝑁
Lossy Counting Example
- A user can identify any items whose frequency is at least 0.1% of the entire stream
- Thus s = 0.1%.
- Epsilon = 0.01%
- All elements with a frequency of more than 0.1% is returned
- An element with a frequency of less than 0.09% is not
Steps for Lossy Counting
- First, divide the stream into "windows” of size 1/ 𝜖, or ceiling(1/ 𝜖) = ceiling(1/0.01) = 100 (for purposes of example)
- Then, at a window boundary, decrement all counters by 1, where:
- w = ceiling(1/ 𝜖) = ceiling(1/0.01) = 100
- Next reduce counts by one before processing the next window
- The counter can be 0 for a specific element, dropping it
- Now count any elements appearing in the next window and adjust the counts afterward
Error Analysis for Lossy Counting
- What is the undercount?
- N = current size of stream
- Window = 1/𝜖
- Then frequency error less than or equal to, the number of windows times 𝜖N
- A rule of thumb has 𝜀, equal to 10% of support s
- Where given a support frequency s = 1%
- The set error frequency ε = 0.1%
Output Considerations
- Elements are only counted with values exceeding (𝑠 − 𝜀)𝑁
- Approximation guarantees
- Frequencies will likely be at most 𝜀𝑁
- There should be no false negatives
- Any False positives will have a true frequency of at least (𝑠 − 𝜀)𝑁
Sticky Sampling
- Sampling, where a counter for a distinct element, if created may or may not occur and decides Sticky Sampling sweeps over the stream like a magnet
- It attracts all those elements, elements which already have an entry in 𝑅
- If a counter exists for a certain element, a future element instance will be counted
- Users will need to consider supporting it
- Along with any errors
- And the probability of failure for those
- For heavy hitters, Data structure R is a set with the form (𝑒, f)
- The stream and its belonging is for Data Structure R is set for some
- The elements here form parts and estimates the frequency for a limited scope
Sticky Sampling Rates
- An element with r means, that we can select a particular element with probability
- t= log (1/εδ), where 𝜀 estimates the frequency here for the limited scope
- For Sampling Element where the equation is: if s= log(1/ εδ) here with a probability
- The sampling rate varies, along with a scan in R
- The update element and scan includes these factors and elements which is
- When the sampling rate changes, we scan any entries in 𝑅, and update them
- For those entries, and scan them within limited scope
- While each entry tosses an unbiased coin
- With the same coin, the value from both can be determined and assessed
Sticky Sampling and Error Correction
- Whenever the sampling rate changes, and scan within 𝑅
- For those the coin tosses are repeated until, there are unbiased coins where limited coin tosses succeed
- Diminishing f by one for every unsuccessful outcome.
- Every record in S is also in R
- The Sticky Sample and updates, follows unbiasedness and repeats for many
- With those a coin can toss can, also is assigned depending on the success-rate
Benefits and Drawbacks to Sticky Sampling
- This type of updating from 𝑆, the Sampling and updates follows, to know its probability
- This occurs from R, and knows its likelihood for the coin toss
Using Dynamic Window Size
- Employing a dynamic window size: 𝑤𝑘 = 2𝑘𝑡, 𝑡 = log( ) for k ≥ 1 and 𝑤0 = 2𝑡
- With parameters defined: s = 10%, є = 1%, δ = 0.1%, t ≈ 921
Memory usage compared
- Lossy counting is expected to yield entries = log(ɛ N)/ ε
- While sticky sampling is expected to yield entries = 2 log( 1/
- Sticky sampling thus requires less space
Experiment Setup
- Stream 1 (Uniq): no duplicates
- Stream 2 (Zipf): The Zipf distribution (also known as the zeta distribution) is a discrete probability distribution that satisfies Zipf’s law: the frequency of an item is inversely proportional to its rank in a frequency table.
- Where PMF can be expressed:
P(k) = [C / k^a]
- , where P(k) is the probability that an object occurs k times in the data, a is a parameter that controls the shape of the distribution and C is a normalization constant.
- For both streams, we assumed uniformly random arrival order.
Experiment Results
- Assuming 𝑆 = 1% for both
- The expected error for both: E =10%
- While, the Probability of failure with is δ = 0.01
- Data from both are greatly impacted by how much has the lossy counting can compute
- With this the sticky sampling increases, the stream numbers were very limited
- Sticky sampling will be very limited because of the tendency to remember, and element being sample
- Lossy Counting is proficient in pruning frequencies quickly, but it can be survive by, mostly the elements with higher count
- In most of the sample experiments the range will have a space that has limited bounds
Summary
- Introduction to data stream processing
- Data stream algos
- Heavy hitters
- Window-based approaches - Lossy counting - Sticky sampling
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.