Machine Learning: Handling Large Data

HardWorkingDada avatar
HardWorkingDada
·
·
Download

Start Quiz

Study Flashcards

Questions and Answers

What is a k-itemset in the context of frequent itemset generation?

A k-itemset is a subset of size k of the set that contains all items.

Which algorithms are used for frequent itemset generation? Select all that apply.

Apriori

In Apriori, all subsets of a frequent itemset must also be frequent.

True

In FP-Growth, the frequent itemsets are stored in a FP-Tree, eliminating the need to generate __________.

<p>candidates</p> Signup and view all the answers

Match the clustering algorithms with their descriptions:

<p>Hierarchical Clustering = Can be Agglomerative or Divisive K-Means = Requires defining the number of clusters and centroid positions DBSCAN = Density-based method suitable for non-spherical clusters OPTICS = Refinement of DBSCAN based on Reachability Distance</p> Signup and view all the answers

What is one of the main challenges addressed when handling large amounts of data?

<p>Summarizing data to reduce its size</p> Signup and view all the answers

What are characteristics of algorithms designed for handling lots of data? (Select all that apply)

<p>Must be parallelizable</p> Signup and view all the answers

Bonferroni Correction is used to address the issue of too many hypotheses with large datasets.

<p>True</p> Signup and view all the answers

Principal Component Analysis (PCA) aims to transform data into a lower dimensional space while preserving most of the __________ of the data.

<p>variance</p> Signup and view all the answers

Match the following metrics with their descriptions:

<p>Support = Measure of how frequently two items appear together Confidence = Probability that one item appears given the presence of another item Lift = Indicates the strength of association between two items Conviction = Probability that one item appears without the presence of another item</p> Signup and view all the answers

What is the purpose of collaborative filtering?

<p>Recommend items to users based on previous user interactions.</p> Signup and view all the answers

Which method of collaborative filtering involves recommending items to a user x that are liked by users similar to x?

<p>User-user</p> Signup and view all the answers

How can the issue of many missing values in collaborative filtering be addressed?

<p>All of the above</p> Signup and view all the answers

Latent factors in recommendation systems are underlying features of the data.

<p>True</p> Signup and view all the answers

Regularization penalizes the prediction based on the size of the fitted elements to compensate for ______________.

<p>overfit</p> Signup and view all the answers

Match the following graph mining terms with their descriptions:

<p>Graph Partitioning = Aims to maximize within-group connections and minimize between-group connections Spectral Graph Partitioning = Utilizes matricial representations and eigenvalues to find optimal partitioning Overlapping Communities = Modeling communities based on nodes and memberships</p> Signup and view all the answers

Study Notes

Handling Lots of Data

  • Models will perform well if given enough data, and the chosen model can be selected based on the amount of data available.
  • To manage large amounts of data, it's essential to:
    • Summarize data to reduce its amount
    • Use algorithms that can handle large data efficiently
  • Algorithms for handling large data should:
    • Have low complexity
    • Be partitionable and parallelizable
    • Not require all data at once (e.g., Map and Reduce)

Too Many Hypotheses

  • With large amounts of data, noise can be amplified, and patterns may arise where they don't exist.
  • Bonferroni Correction is used to address this issue.

Summarization

  • Operations that take a long time, such as:
    • Counting
    • Summing
    • Finding maxima
  • These operations require going through all the data, making them challenging.
  • Solutions involve:
    • Reducing the number of points
    • Understanding distributions
    • Clustering
    • Plotting data (but can be hard with many features)

One-Pass Strategy

  • Adapt algorithms to require only a single pass through the data.
  • Example: Pearson's algorithm that requires only 1 pass.

Fastest Possible Algorithms

  • Opt for algorithms with low time complexity (O(N^p), where p is small).

Sampling

  • Sampling is useful when:
    • More data doesn't improve the model
    • Computational resources are limited
  • Sampling should be:
    • Unbiased
    • Representative of the whole population

Sparse Matrices

  • Many columns with very few filled values.

Handling Lots of Columns

  • Dimensionality Reduction is essential to:
    • Find data representations that fit in fewer variables
    • Alternative to feature selection, which can be infeasible for data mining

PCA (Principal Component Analysis)

  • Assume the combination of features is linear.
  • Identify a set of projection vectors (eigenvectors) to transform variables into a lower-dimensional space while preserving most of the variance.
  • Steps:
    1. Compute the covariance matrix
    2. Compute eigenvectors
    3. Select the principal components
  • Kernel PCA is used when the combination of features is not linear.

SVD (Singular Value Decomposition)

  • Decompose the data matrix into three matrices that can reconstruct the original data.
  • SVD is efficient for sparse matrices and high-dimensional data when the number of factors can be selected beforehand.

CUR

  • Produces more sparse matrices that are economical to store and process.
  • Compromises with SVD, which produces better results but is less efficient.

Map Reduce

  • Divide the task into two phases:
    1. Map: generate a sequence of key-value pairs from the input data
    2. Reduce: merge all values in the list associated with the same key

Handling Texts and Words

  • TF.IDF (Term Frequency-Inverse Document Frequency) measures the importance of a word in a document.
  • TF.IDF = TF * IDF, where TF is the term frequency and IDF is the inverse document frequency.

Supervised Learning for Data Mining

  • Model Validation is challenging due to the large amounts of data.
  • Good models should be:
    • Simple
    • Stable
    • Fast to learn and predict
    • Updateable
  • Naive Bayes and Linear Regression/Multiple Linear Regression are commonly used algorithms.

Semi-Supervised Learning

  • Association Rule Learning is used when there is no known y, and the information is in the form of transactions.

Use-Case Scenarios

  • Market Basket Analysis
  • Fraud Detection
  • Identification of High-Risk Patients
  • Classification
  • Dimensionality Reduction
  • Spam Filters

Metrics of Rule Interestingness

  • Support
  • Confidence
  • Lift
  • Conviction

Frequent Itemset Generation

  • Algorithms: Brute Force, Apriori, FP-Growth, and ECLAT

Closed and Maximal Itemsets

  • Closed itemsets have no superset with the same support.
  • Maximal itemsets have no frequent itemset.

Searching for Similar Items / Exact Matches

  • Hashing: map immutable objects into an integer.
  • Shingling: reduce items to 0s and 1s according to the presence or absence of a feature.
  • Minhashing: create a permutation of all possible columns of the dataset.
  • Local Sensitive Hashing (LSH): hash the same item several times, and similar items are more likely to be hashed to the same bucket.

Clustering - Unsupervised Learning

  • Grouping items into classes of similar objects.
  • Clustering is hard due to the distribution of data, algorithm used, and unknown number of clusters.
  • May be used for:
    • Gaining insight into the distribution of data
    • Preprocessing for classification
    • Outlier detection
  • Requirements:
    • Similarity metric
    • Distance metric

Clustering Algorithms

  • Hierarchical Clustering:
    • Can be Agglomerative or Divisive
    • Distance metrics: Single, Complete, Average, Mean, and Ward
    • Advantages: can use any distance function, average and Ward are good compromises
    • Limitations: not scalable, requires a distance matrix
  • K-Means:
    • Define a number of clusters
    • Choose initial centroids
    • Associate objects with the closest centroid
    • Recalculate centroid locations
    • Repeat until convergence
    • Advantages: scalable, updateable
    • Limitations: sensitive to noise and outliers, not stable
  • DBSCAN:
    • Density-Based Methods
    • Clusters are dense regions separated by low-density regions
    • Find core objects with dense neighborhoods
    • Parameters: ε (neighborhood radius) and Min Pts (minimum number of samples)
    • Algorithm: compute core objects, then assign non-core objects to the closest core object### DBSCAN Algorithm
  • DBSCAN is a density-based clustering algorithm
  • Sensitive to parameters and hard to determine the best parameters
  • Advantages:
    • Resistance to noise
    • Can handle different shapes and sizes
  • Limitations:
    • Can't handle varying densities

OPTICS Refinement of DBSCAN

  • Based on the idea that ε is related to MinPts
  • Uses Reachability Distance: min(core_distance, D(p,q)) to merge the two parameters
  • Algorithm:
    1. Compute reachability plot
    2. Go to the first object in the reachability plot, if it is a core, create a new cluster
    3. Repeat step 2 until all points of the reachability plot have been analyzed
  • Advantages:
    • Has just one parameter
  • Disadvantages:
    • Higher computation requirements

Evaluating Clustering

  • Intrinsic methods (without ground truth)
    • Evaluate how well the clusters are separated and how compact they are
    • Elbow method: look at the k for which there is an elbow on the graph
    • Silhouette coefficient [-1, 1]: measures how similar an object is to its own cluster compared to other clusters
  • Extrinsic methods (with ground truth)
    • Comparing the clusters obtained with the ones that are verifiable truth
    • Homogeneity: measures how pure the clusters are
    • Completeness: measures if the items of the same class are in the same cluster
    • Rag Bag: an outlier put into a 'rag bag' category has a higher score than being attributed to a cluster
    • Small Cluster Preservation: splitting a large category to pieces is better than splitting a small category

Problems and Major Methodologies in High-Dimensional Clustering

  • Traditional distance metrics might be ineffective in many dimensions
  • Subspace clustering: search in smaller feature spaces
  • Biclustering
  • Dimensionality reduction

Recommender Systems

  • Key problems:
    • Collecting data from the utility matrix
    • Extrapolating unknown ratings from the known ones
    • Evaluating the extrapolation
  • Types of recommender systems:
    • Content-based
    • Collaborative Filtering
    • Latent factor-based

Content-Based Recommender Systems

  • Recommend items to a customer similar to previous items rated highly by the customer
  • Create an item profile: a set of features
  • Check for each user the items they like - user profile: the same set of features
  • See what items are similar to the items the user likes and recommend them

Collaborative Filtering

  • User-user: Recommend items to a user that are liked by users similar to them
  • Item-item: Recommend items similar to the ones a user likes
  • Finding similar items is hard due to many missing values
  • Solutions:
    • Center the data, subtracting the row average
    • Replace NaNs with 0
    • Use baseline estimation to deal with bias +Dimensionality reduction or clustering to solve the cold start problem

Latent Factor-Based Recommender Systems

  • Transform items x users into a matricial product R = Q.P^T and optimize the number of factors k
  • Latent factors are underlying features of the data
  • Apply SVD to guarantee the minimum reconstruction error and minimize RMSE
  • Problem: many NaNs require minimizing training data, which can cause overfitting
  • Solution: regularization to penalize the prediction based on the size of the fitted elements

Gradient Descent for Latent Factors

  • Starts with a random solution (random P and Q)
  • Computes partial derivatives and updates the solution according to a learning rate η
  • Repeats until convergence
  • Stochastic Gradient Descent: each rating evaluated separately, many more evaluations, but each one is faster

Evaluating Recommender Systems

  • Test with a test data set that contains a sample of rows and a sample of columns
  • Evaluate predictions on non-binary data using RMSE
  • Evaluate predictions on binary data using accuracy, recall, precision, F1, and ROC curve analysis

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team
Use Quizgecko on...
Browser
Browser