Introduction to Big Data and Data Streams

Study Notes

Big data is characterized by volume, velocity, variety, veracity, and value
Types of big data include structured, unstructured, and semi-structured data
Structured data is organized in a predefined format (e.g., relational databases)
Unstructured data lacks a predefined format (e.g., images, videos, emails)
Semi-structured data has a defined format but not as rigid as structured data (e.g., JSON, XML)
Traditional business approaches often struggle with processing and managing big data
Big Data solutions offer innovative approaches to manage and extract value from enormous datasets

Data stream management systems efficiently process continuous incoming data
Stream sources include sensor data, financial transactions, and social media posts
Stream processing deals with queries targeting continuous data
Issues in stream processing include handling high velocity data and preserving accuracy
Sampling data streams allows analyzing subsets of data
Bloom filters are used in stream processing for efficient approximate filtering
Counting distinct elements within a stream utilizes algorithms like Flajolet–Martin
Counting events within a time window involves trade-offs between accuracy and cost

Big data plays a pivotal role in business decision-making
Drivers of big data include increasing data volumes, evolving technologies, and analytical requirements
Big data optimization techniques encompass various approaches
Dimensionality reduction techniques are used for simplifying data analysis
Time series are used to understand events across time
Social media mining and social network analysis are crucial techniques for extracting insights
Tools like Hadoop, Pig, Hive, MongoDB, Spark, and Mahout are used for analyzing large datasets
Techniques like discriminant analysis and cluster analysis help in data analysis

NLP focuses on enabling computers to understand human language
Regular expressions are used for pattern matching within text
N-grams are sequences of N words used for language modeling
Language models predict probabilities of word sequences
Part-of-speech tagging labels words based on their grammatical role
Named entity recognition identifies named entities (people, organizations, locations)
Syntactic and semantic parsing break down text into meaning
Morphological analysis examines word structure
Vector space models represent text as vectors
Bag-of-words models capture word frequencies
Term frequency–inverse document frequency (TF-IDF) highlights importance of words
Word vector representations (Word2Vec, GloVe, FastText, BERT) capture semantic relationships
Topic modeling identifies topics within text collections
Recurrent neural networks (RNNs) process sequential text data
Long short-term memory (LSTM) networks manage long-range dependencies in sequences
Encoder-decoder architecture is used in machine translation
Attention mechanism enhances contextual understanding
Transformer networks process information in parallel
Text classification and sentiment analysis determine themes, opinions
Neural machine translation translates text between languages
Question answering systems understand user queries
Text summarization condenses text into shorter versions