Data Stream Processing

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

Which characteristic of big data is primarily addressed by stream data processing?

  • Volume
  • Veracity
  • Velocity (correct)
  • Variety

A data stream can be fully stored locally for processing without any limitations on the item arrival order.

False (B)

In stream processing, what characteristic must a sequence items possess?

Ordered

In the context of data streams, sub-linear space-time algorithms are used because we can only afford to store a _______ of the data.

<p>summary</p> Signup and view all the answers

Match the streaming algorithm task with the corresponding calculation performed on the data stream:

<p>Computing the sum = Σi=1 to t ai Finding the minimum = min(ata1, a2, ..., at-1) Finding the mean = mea n([a1,a2,...,at-1])×(t-1)+at / t</p> Signup and view all the answers

In the streaming algorithms framework, what should be done with x after retrieving the next element x in stream S?

<p>Do something with <code>x</code> and update the summary information (A)</p> Signup and view all the answers

In the context of mining approximate frequent patterns in stream data, obtaining precise frequent patterns is always feasible.

<p>False (B)</p> Signup and view all the answers

In the context of the majority problem, will there always be a majority element present in every infinite stream?

<p>No</p> Signup and view all the answers

In the 'Finding the majority element' algorithm, there are ____ negatives, but there are false _____.

<p>no, positives</p> Signup and view all the answers

Match each value of s for the s-heavy hitters problem with its correct implication:

<p>s = 1/2 = Majority element problem s = 1/3 = 1/3-heavy hitters problem</p> Signup and view all the answers

Which statement is true regarding the s-heavy hitters problem?

<p>False positives are acceptable, but false negatives are not. (A)</p> Signup and view all the answers

Window-based approaches extract an infinite relation from a finite stream.

<p>False (B)</p> Signup and view all the answers

In window-based approaches, what is a common attribute used for ordering the windows?

<p>time</p> Signup and view all the answers

In Lossy Counting, frequencies can be underestimated by at most _____.

<p>εN</p> Signup and view all the answers

Match the following guarantee with the appropriate lossy-counting parameter:

<p>s = 0.1% = All elements with frequency ≥ s = 0.1% will be returned. e = 0.01% = All individual frequencies are less than their true frequencies by at most 0.01%N</p> Signup and view all the answers

In lossy counting, what action is performed at a window boundary?

<p>Decrement all counters by one. (A)</p> Signup and view all the answers

In Lossy Counting, false negatives have a true frequency greater $(s - ε)N$.

<p>False (B)</p> Signup and view all the answers

In Lossy Counting, what happens if a counter is zero for a specific element during processing?

<p>Drop the element</p> Signup and view all the answers

Sticky Sampling is named because it sweeps over the stream like a _____, attracting all elements which already have an entry in R.

<p>magnet</p> Signup and view all the answers

Match each parameter with its correct definition in sticky sampling:

<p>s = Support ε = Error δ = Probability of failure</p> Signup and view all the answers

In Lossy Counting, what is the condition to provide Guarantees?

<p>ε &lt;&lt; s (C)</p> Signup and view all the answers

In sticky sampling, if there are no changes on sampling rate, there is no need to scan entries in R.

<p>True (A)</p> Signup and view all the answers

In the context of Sticky Sampling, which elements are tracked exactly.

<p>existing element</p> Signup and view all the answers

Lossy Counting is good at pruning ____ frequency elements quickly.

<p>low</p> Signup and view all the answers

Match the name of algorithm with its expected number of entries:

<p>Lossy Counting = 1/ε log(ε N) Sticky sampling = 2/ε log (1/sδ)</p> Signup and view all the answers

For Lossy counting, the frequency error ≤ the Number of windows* what value?

<p>εN (C)</p> Signup and view all the answers

The Zipf distribution satisfies Zipf's law by an inversely proportion on frequency and rank.

<p>True (A)</p> Signup and view all the answers

If you need to implement fraud detection, is it crucial to avoid false positives or false negatives?

<p>false negatives</p> Signup and view all the answers

Heavy Hitters are more important if false positives are problematic if you are billing someone, or need to issue _____.

<p>punishment</p> Signup and view all the answers

Associate each of the following algorithms with their expected outcome:

<p>s-heavy hitters = Output all items that occur more than s fraction of the time. Lossy conuting = Elements with counter values exceeding (s – ε)N Sticky sampling = Decides if element counter should be added</p> Signup and view all the answers

Assume the user knows a stream of data has s=10%, the delta needs to be small. Which represents highest probability of failure?

<p>10% (B)</p> Signup and view all the answers

Heavy Hitters can have false negatives but should minimize the number of false positives.

<p>False (B)</p> Signup and view all the answers

What is the primary advantage of using sub-linear space-time algorithms in data stream processing?

<p>Efficiency</p> Signup and view all the answers

In the streaming algorithms framework, a key step involves _______ summary information for the incoming stream data.

<p>initializing</p> Signup and view all the answers

Match each window-based approach with its corresponding description:

<p>Ordering attributes = Defining windows based on e.g., time. Item/record counts = Defining windows based on the number of items or records processed. Explicit markers = Defining windows based on markers signifying the beginning and end.</p> Signup and view all the answers

Which approach provides more precise frequent patterns in stream data, albeit unrealistically?

<p>Mining precise frequent patterns (C)</p> Signup and view all the answers

The purpose of lossy counting is data compression.

<p>False (B)</p> Signup and view all the answers

In Sticky Sampling, if the number of successful unsuccessful coin tosses results in a given element count being zero what happens?

<p>deleted</p> Signup and view all the answers

The dynamic window sizes used with Sticky Sampling are equal to __________.

<p>2kt</p> Signup and view all the answers

Associate stream data algorithm with its main target:

<p>Lossy Counting = Elements of support more of user specified percentage Majority problem = If element appears more than half the time if any. Heavy hitters = To maintain all s-heavy hitters so far.</p> Signup and view all the answers

Flashcards

Data Stream

A real-time, continuous, ordered sequence of items.

Big Data Characteristics

Data that is characterized by 5Vs: Volume, Velocity, Variety, Veracity, Value

Streaming Algorithms

Processing data as it arrives, using sub-linear space-time algorithms.

Majority Element

A number that occurs more than half the time in a stream.

Signup and view all the flashcards

S-heavy Hitters Objective

Objective: Maintain a structure containing all the s-heavy hitters so far.

Signup and view all the flashcards

Window-Based Approaches

A mechanism for extracting a finite relation from an infinite stream.

Signup and view all the flashcards

Lossy Counting

Output elements with counter values exceeding (s – ε)N.

Signup and view all the flashcards

Sticky Sampling

A probabilistic sampling approach that creates counters for distinct elements.

Signup and view all the flashcards

Limit in StreamingHeavyHitter

It determines the number of items to track for heavy hitting

Signup and view all the flashcards

Study Notes

  • This presentation is about stream data processing

Agenda

  • Discussion of an introduction to data stream processing
  • Look at data stream algorithms
  • Discussion of heavy hitters
  • A look into window-based approaches
  • Including lossy counting and sticky sampling

Big Data Characteristics

  • Big data is characterized by the 5Vs: Volume, Velocity, Variety, Veracity, Value
  • Most discussions deal with Volume, which is handled by using more than one machine
  • The focus is Velocity, as the speed of incoming data is faster than a machine's processing power
  • The concept of the data stream is introduced

Data Streams Defined

  • A data stream constitutes a real-time, continuous, ordered sequence of items, implicitly ordered by arrival time or explicitly by timestamp
  • It is impossible to control the order in which items arrive or locally store a stream in its entirety
  • A stream is composed of structured records, as opposed to unstructured data like audio or video
  • High data volumes and record arrival rates are expected, with records arriving at a few GB/s

Streaming Algorithms

  • With streaming algorithms, there is too much data, and too little space
  • Only a summary or sketch of the data can be stored
  • Data streams feature a huge amount of data that is never-ending
  • Data is run over AT MOST once
  • Only data summaries can be stored by using sub-linear space-time algorithms

Data Streams

  • A stream of data elements 𝑆 = 𝑎1, 𝑎2, …, where 𝑎𝑡 arrives at time 𝑡
  • Represented mathematically as 𝑎[1…𝑡] = [𝑎1, 𝑎2, … , 𝑎𝑡 ]
  • For simplicity, each 𝑎𝑡 is a number
  • A function can be computed continuously while only using limited space
  • At any time 𝑡, the function can be queried with the value from the stream seen so far, shown mathematically as 𝑎[1 … 𝑡]

Computing the Sum

  • Where S equals a series of numbers, the outputs can be a running total using Python implementation

Finding The Minimum

  • For an infinite stream of numbers, the minimum of the observed elements at timestamp 𝑡 can be find
  • Using a min(min) function to find the minimum value

Finding the mean

  • For an infinite stream of numbers, calculation of the mean of the observed elements at timestamp 𝑡
  • Using an equation and a python implementation

Streaming Algorithms: Framework

  • The algorithm works by these steps
  • First initialising summary information
  • A loop that continues while the stream S, is not done
  • The algorithm takes the next element from the stream, denoted as "x"
  • The algorithm then computes something, with x
  • It then updates the summary information
  • Then outputs something if needed
  • Finally the algorithm returns summary information
  • With some restrictions, computations of interesting functions are possible, given some tolerance for error

Mining Approximate Frequent Patterns

  • Precise frequent patterns is unrealistic
  • Approximate answers are often sufficient, like for trend/pattern analysis
  • A router may be interested to know what flows make up 1% of the entire traffic stream
  • With 1/10 of s error being acceptable, where epsilon = 0.1%

Finding The Majority Element

  • Given an infinite stream of numbers, a problem is finding the element that occurs STRICTLY more than half of the time, the majority element
  • Each stream will have at most one such element
  • Examples include with the majority at particular times shown
  • The majority element shows no false negatives, but can have false positives

S - Heavy Hitters

  • Heavy Hitters using Structured Query Language can be extracted as
SELECT R.EventType, COUNT(*)
FROM R
GROUP BY R.EventType
HAVING COUNT(*) > s t
  • Where s is a user-specified percentage and t is the length of the stream

S - Heavy Hitters Defined

  • For a stream 𝑆 = 𝑎1, 𝑎2, define the count of element 𝑒 at any time 𝑡 to be count(𝑒) = |{𝑖 ≤ 𝑡|𝑎𝑖 = 𝑒 }|
  • Called an s-heavy hitter if count(𝑒) > 𝑠𝑡
  • The objective is to maintain a structure with holding all the s-heavy hitters
  • There can be at most 1/s - 1 such element
  • False positives are acceptable, but no false negatives
  • You must not miss any heavy-hitters, but can store non-heavy-hitters

Examples and Applications

1/2

  • If s = , then this is the majority element problem
  • Examples include with the values at any particular element
  • If s = 1/3, then 1/3 heavy hitters will be found with s

Heavy Hitters Algorithm

  • Set k = ceiling( 1/s ) - 1, if s = 1/2, then k = 1
  • Keep an array T[1, ... , k] to hold elements
  • Keep an array C[1, ... k] to hold their counters
  • Initialise C[i] = 0 and T[i] = 0 for all i
  • For each a in stream S:
    • If a is an element T[j] for some j if less than k, then C[j] is incremented
    • Else, if an element C[j] equals 0, where j is less than or equal to k, set T[j] to a and C[j] to 1
    • Else decrement all elements of C[j], to 1

Heavy Hitters - Python Implementation

  • Use a python implementation, for heavy hitters in streams

Types of Errors

  • A one-sided error means there are no false negatives
  • The stream needs to count everything that needs to be considered
  • There may be false positives, the stream may overcount
  • This may not be a problem for certain use cases
  • Billing, or punishment could be problematic

Window-Based Approaches

  • Mechanism for extracting a relation from an infinite stream
  • Various window proposals exist for restricting the processing scope
  • Includes windows based on ordering attributes like time
  • Includes windows based on the item (record) counts
  • Includes windows based on explicit markers specifying the beginning and end
  • Some variants may use semantic partitioning constraint

Lossy Counting Defined

  • 𝑁 ∶ current length of the stream
  • Support threshold 𝑠 ∈ (0, 1)
  • Error parameter 𝜀 ∈ (0, 1) and 𝜀 ≪ 𝑠
  • Output: Elements with counter values exceeding (𝑠 − 𝜀)𝑁
  • Approximation Guarantees: - Frequencies underestimated by at most 𝜀𝑁 - No false negatives - False positives will have a true frequency of at least (𝑠 – 𝜀)𝑁

Lossy Counting Example

  • A user can identify any items whose frequency is at least 0.1% of the entire stream
  • Thus s = 0.1%.
  • Epsilon = 0.01%
  • All elements with a frequency of more than 0.1% is returned
  • An element with a frequency of less than 0.09% is not

Steps for Lossy Counting

  • First, divide the stream into "windows” of size 1/ 𝜖, or ceiling(1/ 𝜖) = ceiling(1/0.01) = 100 (for purposes of example)
  • Then, at a window boundary, decrement all counters by 1, where:
    • w = ceiling(1/ 𝜖) = ceiling(1/0.01) = 100
  • Next reduce counts by one before processing the next window
  • The counter can be 0 for a specific element, dropping it
  • Now count any elements appearing in the next window and adjust the counts afterward

Error Analysis for Lossy Counting

  • What is the undercount?
    • N = current size of stream
    • Window = 1/𝜖
    • Then frequency error less than or equal to, the number of windows times 𝜖N
  • A rule of thumb has 𝜀, equal to 10% of support s
    • Where given a support frequency s = 1%
    • The set error frequency ε = 0.1%

Output Considerations

  • Elements are only counted with values exceeding (𝑠 − 𝜀)𝑁
  • Approximation guarantees
  • Frequencies will likely be at most 𝜀𝑁
  • There should be no false negatives
  • Any False positives will have a true frequency of at least (𝑠 − 𝜀)𝑁

Sticky Sampling

  • Sampling, where a counter for a distinct element, if created may or may not occur and decides Sticky Sampling sweeps over the stream like a magnet
  • It attracts all those elements, elements which already have an entry in 𝑅
  • If a counter exists for a certain element, a future element instance will be counted
  • Users will need to consider supporting it
  • Along with any errors
  • And the probability of failure for those
  • For heavy hitters, Data structure R is a set with the form (𝑒, f)
    • The stream and its belonging is for Data Structure R is set for some
  • The elements here form parts and estimates the frequency for a limited scope

Sticky Sampling Rates

  • An element with r means, that we can select a particular element with probability
  • t= log (1/εδ), where 𝜀 estimates the frequency here for the limited scope
  • For Sampling Element where the equation is: if s= log(1/ εδ) here with a probability
  • The sampling rate varies, along with a scan in R
  • The update element and scan includes these factors and elements which is
  • When the sampling rate changes, we scan any entries in 𝑅, and update them
    • For those entries, and scan them within limited scope
    • While each entry tosses an unbiased coin
    • With the same coin, the value from both can be determined and assessed

Sticky Sampling and Error Correction

  • Whenever the sampling rate changes, and scan within 𝑅
    • For those the coin tosses are repeated until, there are unbiased coins where limited coin tosses succeed
  • Diminishing f by one for every unsuccessful outcome.
    • Every record in S is also in R
  • The Sticky Sample and updates, follows unbiasedness and repeats for many
  • With those a coin can toss can, also is assigned depending on the success-rate

Benefits and Drawbacks to Sticky Sampling

  • This type of updating from 𝑆, the Sampling and updates follows, to know its probability
  • This occurs from R, and knows its likelihood for the coin toss

Using Dynamic Window Size

  • Employing a dynamic window size: 𝑤𝑘 = 2𝑘𝑡, 𝑡 = log( ) for k ≥ 1 and 𝑤0 = 2𝑡
  • With parameters defined: s = 10%, є = 1%, δ = 0.1%, t ≈ 921

Memory usage compared

  • Lossy counting is expected to yield entries = log(ɛ N)/ ε
  • While sticky sampling is expected to yield entries = 2 log( 1/
  • Sticky sampling thus requires less space

Experiment Setup

  • Stream 1 (Uniq): no duplicates
  • Stream 2 (Zipf): The Zipf distribution (also known as the zeta distribution) is a discrete probability distribution that satisfies Zipf’s law: the frequency of an item is inversely proportional to its rank in a frequency table.
  • Where PMF can be expressed:
P(k) = [C / k^a]
  • , where P(k) is the probability that an object occurs k times in the data, a is a parameter that controls the shape of the distribution and C is a normalization constant.
  • For both streams, we assumed uniformly random arrival order.

Experiment Results

  • Assuming 𝑆 = 1% for both
  • The expected error for both: E =10%
  • While, the Probability of failure with is δ = 0.01
  • Data from both are greatly impacted by how much has the lossy counting can compute
  • With this the sticky sampling increases, the stream numbers were very limited
  • Sticky sampling will be very limited because of the tendency to remember, and element being sample
  • Lossy Counting is proficient in pruning frequencies quickly, but it can be survive by, mostly the elements with higher count
  • In most of the sample experiments the range will have a space that has limited bounds

Summary

  • Introduction to data stream processing
  • Data stream algos
  • Heavy hitters
  • Window-based approaches - Lossy counting - Sticky sampling

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Use Quizgecko on...
Browser
Browser