Machine Learning: Handling Large Data

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

What is a k-itemset in the context of frequent itemset generation?

A k-itemset is a subset of size k of the set that contains all items.

Which algorithms are used for frequent itemset generation? Select all that apply.

Apriori (correct)
ECLAT (correct)
Brute Force (correct)
FP-Growth (correct)

In Apriori, all subsets of a frequent itemset must also be frequent.

True (A)

In FP-Growth, the frequent itemsets are stored in a FP-Tree, eliminating the need to generate __________.

candidates Signup and view all the answers

Match the clustering algorithms with their descriptions:

Hierarchical Clustering = Can be Agglomerative or Divisive K-Means = Requires defining the number of clusters and centroid positions DBSCAN = Density-based method suitable for non-spherical clusters OPTICS = Refinement of DBSCAN based on Reachability Distance Signup and view all the answers

What is one of the main challenges addressed when handling large amounts of data?

Summarizing data to reduce its size Signup and view all the answers

What are characteristics of algorithms designed for handling lots of data? (Select all that apply)

Must be parallelizable (A), Need to be fast with low complexity (B) Signup and view all the answers

Bonferroni Correction is used to address the issue of too many hypotheses with large datasets.

True (A) Signup and view all the answers

Principal Component Analysis (PCA) aims to transform data into a lower dimensional space while preserving most of the __________ of the data.

variance Signup and view all the answers

Match the following metrics with their descriptions:

Support = Measure of how frequently two items appear together Confidence = Probability that one item appears given the presence of another item Lift = Indicates the strength of association between two items Conviction = Probability that one item appears without the presence of another item Signup and view all the answers

What is the purpose of collaborative filtering?

Recommend items to users based on previous user interactions. Signup and view all the answers

Which method of collaborative filtering involves recommending items to a user x that are liked by users similar to x?

User-user (A) Signup and view all the answers

How can the issue of many missing values in collaborative filtering be addressed?

All of the above (D) Signup and view all the answers

Latent factors in recommendation systems are underlying features of the data.

True (A) Signup and view all the answers

Regularization penalizes the prediction based on the size of the fitted elements to compensate for ______________.

overfit Signup and view all the answers

Match the following graph mining terms with their descriptions:

Graph Partitioning = Aims to maximize within-group connections and minimize between-group connections Spectral Graph Partitioning = Utilizes matricial representations and eigenvalues to find optimal partitioning Overlapping Communities = Modeling communities based on nodes and memberships Signup and view all the answers

Flashcards are hidden until you start studying

Study Notes

Handling Lots of Data

Models will perform well if given enough data, and the chosen model can be selected based on the amount of data available.
To manage large amounts of data, it's essential to:
- Summarize data to reduce its amount
- Use algorithms that can handle large data efficiently
Algorithms for handling large data should:
- Have low complexity
- Be partitionable and parallelizable
- Not require all data at once (e.g., Map and Reduce)

Too Many Hypotheses

With large amounts of data, noise can be amplified, and patterns may arise where they don't exist.
Bonferroni Correction is used to address this issue.

Summarization

Operations that take a long time, such as:
- Counting
- Summing
- Finding maxima
These operations require going through all the data, making them challenging.
Solutions involve:
- Reducing the number of points
- Understanding distributions
- Clustering
- Plotting data (but can be hard with many features)

One-Pass Strategy

Adapt algorithms to require only a single pass through the data.
Example: Pearson's algorithm that requires only 1 pass.

Fastest Possible Algorithms

Opt for algorithms with low time complexity (O(N^p), where p is small).

Sampling

Sampling is useful when:
- More data doesn't improve the model
- Computational resources are limited
Sampling should be:
- Unbiased
- Representative of the whole population

Sparse Matrices

Many columns with very few filled values.

Handling Lots of Columns

Dimensionality Reduction is essential to:
- Find data representations that fit in fewer variables
- Alternative to feature selection, which can be infeasible for data mining

PCA (Principal Component Analysis)

Assume the combination of features is linear.
Identify a set of projection vectors (eigenvectors) to transform variables into a lower-dimensional space while preserving most of the variance.
Steps:
1. Compute the covariance matrix
2. Compute eigenvectors
3. Select the principal components
Kernel PCA is used when the combination of features is not linear.

SVD (Singular Value Decomposition)

Decompose the data matrix into three matrices that can reconstruct the original data.
SVD is efficient for sparse matrices and high-dimensional data when the number of factors can be selected beforehand.

CUR

Produces more sparse matrices that are economical to store and process.
Compromises with SVD, which produces better results but is less efficient.

Map Reduce

Divide the task into two phases:
1. Map: generate a sequence of key-value pairs from the input data
2. Reduce: merge all values in the list associated with the same key

Handling Texts and Words

TF.IDF (Term Frequency-Inverse Document Frequency) measures the importance of a word in a document.
TF.IDF = TF * IDF, where TF is the term frequency and IDF is the inverse document frequency.

Supervised Learning for Data Mining

Model Validation is challenging due to the large amounts of data.
Good models should be:
- Simple
- Stable
- Fast to learn and predict
- Updateable
Naive Bayes and Linear Regression/Multiple Linear Regression are commonly used algorithms.

Semi-Supervised Learning

Association Rule Learning is used when there is no known y, and the information is in the form of transactions.

Use-Case Scenarios

Market Basket Analysis
Fraud Detection
Identification of High-Risk Patients
Classification
Dimensionality Reduction
Spam Filters

Metrics of Rule Interestingness

Support
Confidence
Lift
Conviction

Frequent Itemset Generation

Algorithms: Brute Force, Apriori, FP-Growth, and ECLAT

Closed and Maximal Itemsets

Closed itemsets have no superset with the same support.
Maximal itemsets have no frequent itemset.

Searching for Similar Items / Exact Matches

Hashing: map immutable objects into an integer.
Shingling: reduce items to 0s and 1s according to the presence or absence of a feature.
Minhashing: create a permutation of all possible columns of the dataset.
Local Sensitive Hashing (LSH): hash the same item several times, and similar items are more likely to be hashed to the same bucket.

Clustering - Unsupervised Learning

Grouping items into classes of similar objects.
Clustering is hard due to the distribution of data, algorithm used, and unknown number of clusters.
May be used for:
- Gaining insight into the distribution of data
- Preprocessing for classification
- Outlier detection
Requirements:
- Similarity metric
- Distance metric

Clustering Algorithms

Hierarchical Clustering:
- Can be Agglomerative or Divisive
- Distance metrics: Single, Complete, Average, Mean, and Ward
- Advantages: can use any distance function, average and Ward are good compromises
- Limitations: not scalable, requires a distance matrix
K-Means:
- Define a number of clusters
- Choose initial centroids
- Associate objects with the closest centroid
- Recalculate centroid locations
- Repeat until convergence
- Advantages: scalable, updateable
- Limitations: sensitive to noise and outliers, not stable
DBSCAN:
- Density-Based Methods
- Clusters are dense regions separated by low-density regions
- Find core objects with dense neighborhoods
- Parameters: ε (neighborhood radius) and Min Pts (minimum number of samples)
- Algorithm: compute core objects, then assign non-core objects to the closest core object### DBSCAN Algorithm
DBSCAN is a density-based clustering algorithm
Sensitive to parameters and hard to determine the best parameters
Advantages:
- Resistance to noise
- Can handle different shapes and sizes
Limitations:
- Can't handle varying densities

OPTICS Refinement of DBSCAN

Based on the idea that ε is related to MinPts
Uses Reachability Distance: min(core_distance, D(p,q)) to merge the two parameters
Algorithm:
1. Compute reachability plot
2. Go to the first object in the reachability plot, if it is a core, create a new cluster
3. Repeat step 2 until all points of the reachability plot have been analyzed
Advantages:
- Has just one parameter
Disadvantages:
- Higher computation requirements

Evaluating Clustering

Intrinsic methods (without ground truth)
- Evaluate how well the clusters are separated and how compact they are
- Elbow method: look at the k for which there is an elbow on the graph
- Silhouette coefficient [-1, 1]: measures how similar an object is to its own cluster compared to other clusters
Extrinsic methods (with ground truth)
- Comparing the clusters obtained with the ones that are verifiable truth
- Homogeneity: measures how pure the clusters are
- Completeness: measures if the items of the same class are in the same cluster
- Rag Bag: an outlier put into a 'rag bag' category has a higher score than being attributed to a cluster
- Small Cluster Preservation: splitting a large category to pieces is better than splitting a small category

Problems and Major Methodologies in High-Dimensional Clustering

Traditional distance metrics might be ineffective in many dimensions
Subspace clustering: search in smaller feature spaces
Biclustering
Dimensionality reduction

Recommender Systems

Key problems:
- Collecting data from the utility matrix
- Extrapolating unknown ratings from the known ones
- Evaluating the extrapolation
Types of recommender systems:
- Content-based
- Collaborative Filtering
- Latent factor-based

Content-Based Recommender Systems

Recommend items to a customer similar to previous items rated highly by the customer
Create an item profile: a set of features
Check for each user the items they like - user profile: the same set of features
See what items are similar to the items the user likes and recommend them

Collaborative Filtering

User-user: Recommend items to a user that are liked by users similar to them
Item-item: Recommend items similar to the ones a user likes
Finding similar items is hard due to many missing values
Solutions:
- Center the data, subtracting the row average
- Replace NaNs with 0
- Use baseline estimation to deal with bias +Dimensionality reduction or clustering to solve the cold start problem

Latent Factor-Based Recommender Systems

Transform items x users into a matricial product R = Q.P^T and optimize the number of factors k
Latent factors are underlying features of the data
Apply SVD to guarantee the minimum reconstruction error and minimize RMSE
Problem: many NaNs require minimizing training data, which can cause overfitting
Solution: regularization to penalize the prediction based on the size of the fitted elements

Gradient Descent for Latent Factors

Starts with a random solution (random P and Q)
Computes partial derivatives and updates the solution according to a learning rate η
Repeats until convergence
Stochastic Gradient Descent: each rating evaluated separately, many more evaluations, but each one is faster

Evaluating Recommender Systems

Test with a test data set that contains a sample of rows and a sample of columns
Evaluate predictions on non-binary data using RMSE
Evaluate predictions on binary data using accuracy, recall, precision, F1, and ROC curve analysis

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.