Big data: Finding Similar Items

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

Which of the following is the MOST significant challenge when finding similar items within a large dataset?

The time it takes to access the data.
The difficulty in defining what constitutes a 'similar' item.
The complexity of comparing all possible combinations of item sets. (correct)
The computational resources needed to store the item characteristics.

In the context of finding similar items, what is the primary purpose of collaborative filtering?

To classify web pages based on similar words.
To identify duplicate web pages.
To suggest items to customers based on the preferences of similar users. (correct)
To resolve entity differences across various applications.

Which of the following scenarios would Jaccard similarity be MOST applicable to?

Measuring the number of edits required to make two strings identical.
Determining the distance between two points on a map.
Analyzing the sentiment expressed in customer reviews.
Identifying the degree of overlap in the word usage on two web pages. (correct)

Why might character-by-character matching be insufficient for identifying similar documents?

It fails when documents have minor variations. (D)

Signup and view all the answers

For document similarity, a Jaccard similarity of 50% or above is typically sought. What does this indicate?

The documents are considered highly similar. (D)

Signup and view all the answers

What is the role of shingles in document similarity detection?

To convert documents into sets of items for comparison. (A)

Signup and view all the answers

Which of the following statements BEST describes the purpose of minhashing?

To reduce the size of shingle sets while preserving Jaccard similarity estimates. (C)

Signup and view all the answers

What is the function of locality-sensitive hashing (LSH) in finding similar documents?

To filter out pairs of documents unlikely to be similar, and reduce the number of comparisons. (B)

Signup and view all the answers

In the context of locality-sensitive hashing, why are hash functions used?

To categorize document sets into buckets. (C)

Signup and view all the answers

How does Jaccard distance relate to Jaccard similarity?

It is the complement of the Jaccard similarity (1 - Jaccard Similarity). (B)

Signup and view all the answers

When is cosine distance MOST appropriate as a distance measure?

When the direction of vectors is important. (B)

Signup and view all the answers

In what scenario is Edit Distance MOST useful?

Assessing how many edits are required to make 2 strings identical. (D)

Signup and view all the answers

What distinguishes Hamming distance from other distance measures?

It is mainly applicable to Boolean vectors. (B)

Signup and view all the answers

In textual analysis, what's the MAIN goal of Structured Information Extraction?

Generating structured, storable data from unstructured text. (C)

Signup and view all the answers

Which task does Sentiment Analysis aim to accomplish?

Determining the subjective opinion or attitude expressed in text. (B)

Signup and view all the answers

What is the primary application of Audio-Video analytics?

Extracting useful information from speech and video data. (B)

Signup and view all the answers

What is a key challenge in social media data analysis?

The noisy, unstructured and user-generated nature of the data. (B)

Signup and view all the answers

What is the main purpose of Predictive Analysis?

Forecasting future trends based on past and present data. (D)

Signup and view all the answers

Which of the following BEST describes Supervised Learning?

Learning with pre-classified or labeled data. (D)

Signup and view all the answers

How does Unsupervised Learning differ from Supervised Learning?

It generates its own set of classes and models from the data. (D)

Signup and view all the answers

Flashcards

Locality-sensitive hashing

A technique to find possible pairs of documents that should be checked for similarity.

Jaccard Similarity

A measure of the number of elements common to both sets divided by the total number of elements in both sets

Shingling

Converting documents into sets of small substrings to find similar texts.

Minhashing

Technique to reduce the size of sets of shingles while preserving Jaccard similarity.

Signup and view all the flashcards

Edit Distance

The minimum number of single-character edits required to change one string into the other.

Signup and view all the flashcards

Cosine Distance

Distance between vectors, determined by the angle between them.

Signup and view all the flashcards

Collaborative Filtering

Classifying data into groups by finding similarities between customer profiles to provide recommendations.

Signup and view all the flashcards

Euclidean Distance

Distance computed as the square root of the sum of the squared differences between corresponding points.

Signup and view all the flashcards

Supervised Learning

A type of machine learning where the algorithm learns from labeled data.

Signup and view all the flashcards

Unsupervised Learning

A type of machine learning where the algorithm identifies patterns without labeled data.

Signup and view all the flashcards

Big Data Analytics

A set of processes used to analyze big data and extract useful information.

Signup and view all the flashcards

Hamming Distance

The difference between corresponding components of two vectors, especially Boolean vectors.

Signup and view all the flashcards

Study Notes

Techniques to find useful information from sources of big data are presented
Finding similar item sets in large datasets, measures to calculate distances between data objects, and document similarities using shingles is covered
Techniques for big data analysis are introduced

Learning Objectives

Understand different techniques for identifying similar items
Learn how collaborative filtering works
Use shingling to find similar documents
Learn techniques to measure distances between 2 objects
Define supervised and unsupervised learning

Finding Similar Items

Locating similar items is straightforward with small document sets
Major problem is working at scale

Problem Statement

Task is to identify similar item sets within a large collection (millions/billions)
Goal is to do this without comparing all possible combinations
Use the principle that similar item sets share many common subsets

Applications of Finding Similar Items

Classify webpages using similar words to categorize content
Group similar customer purchases and feedback to provide purchase recommendations
Entity resolution: determining if a single person appears across multiple applications
Detect mirror websites or plagiarized content among millions of webpages

Challenges of Finding Similar Items

Directly comparing documents/webpages involves comparing character/word sequences
Computationally expensive as the number of pairs to compare grows quadratically
- 10 documents/web pages requires comparing 45 pairs
- n documents/webpages requires comparing n*(n-1)/2 pairs
- 1 million necessitates 5x10^11 comparisons, making it hard to find similar documents without efficient methods

Finding Similar Sets

Similarity is gauged among sets for finding textual similarity between documents
This enables collaborative filtering to identify customer groups

Jaccard Similarity

Defined as the ratio of the cardinality of the intersection of sets A and B to the cardinality of their union
- JSA,B = |A∩B| / |A∪B|
Value ranges from 0 to 1
Example:
- A={a, b, c, d, e} and B={c, d, e, f, g}
- Jaccard similarity is |{c, d, e}| / |{a, b, c, d, e, f, g}| = 3/7

Document Similarity

Finding almost identical documents, news items, webpages, or reports can be achieved using the Jaccard similarity
Also known as character-level similarity
Useful algorithm with different types of techniques
Can be resolved the identification of similar words and sometimes the meaing of those words
Character-by-character matching identifies two identical documents
Mostly similar scenarios
- Plagiarism may have small segments changed or reordered
- Mirror websites may differ due to local changes
- Course material versions vary slightly across websites
- News reports append additional info across different newspapers

Collaborative Filtering

Recommends items to e-commerce users based on past purchases and preferences by tracking their purchases and likes
Users are grouped with those who exhibit similar behaviors
Recommendations are products that other group members have purchased and liked

Set Similarity in Collaborative Filtering

Jaccard Similarity measures similarity between customers who purchase/like similar products
E-commerce sites have huge datasets
Customers who purchase the same books may also purchase other similar books
Document similarity typically looks for a Jaccard Similarity of 50% or higher
Collaborative filtering considers values as low as 10-25%
General collaborative filtering uses binary data easily translated into sets
Likert scales turn customer ratings into bags with multiple instances of each item
- Customer A rates product A at 2 and product B at 3: BagCust1 = {a, a, b, b, b}
- Customer B rates product A at 1 and product B at 4: BagCust2 = {a, b, b, b, b}
- JS A,B = |{a, b, b, b}| / |{a, a, b, b, b, b}| = 4/6 = 2/3
Set similarity is an importnat consideration to find the smilarity between two sets
Goal is to convert a document into small string subsets to compare similarity

Finding Similar Documents

Jaccard Similarity applies to sets of document similarity
Document similarity can be checked using the following process:
- Create sets of items using shingles
- Apply minhashing to create document signatures to test similarity, which reduces number of shingles for testing
- Perform locality sense hashing to find document pairs that should be tested for similarity rather than comparing all possible pairs

Shingles

Goal is to represent a document as a set of items for finding lexicographically similar documents
Break documents into substrings of 3-7 characters because many substrings match across common sentences despite slight variances
k-shingle definition
- Any substring of size k within the document
- Documents have many shingles, which may appear once or more
- Example: “bit-by-bit” with k=3 yields {“bit”, “it-”, “t-b”, “-by”, “by-”, “y-b”, “-bi”}

Implementation Issues for Shingles

White Space: A common solution is to condense multiple whitespaces down to a single instance to deal with impact of white spaces
Shingle Size: Small shingles (k=1) are too common and large shingles will not be able to distinguish similar items
- Most documents share almost all alphabets when k=1
- k=5-9 is an ideal size for small to large documents
Shingle Count: The number of distinct shingles in a language with a character set of size n and shingle size k is equal to nk, which is substantial
- Using 9-shingles on English documents, given the English alphabet and one space character
- Max shingles equals 27^9 9-shingles, and each shingle is 9 bytes long
Data Reduction
- Employing hash functions instead of shingles
- 9-byte substring maps to an integral number hash bucket
- Integer size of 4 bytes turns possible 27^9 9-shingles → 2^32, which is much smaller
- Use bucket # as shingle to cut the shingle size from 9 to 4 bytes
Word-Based Shingles
- Can be more effective at finding news articles

Minhashing

Goal is to reduce the set of shingles and still be able to accommodate the shingles in the memory of a computer, so as to find the similarity of the documents
Process involved converting the set of shingles to a document signature using a large number of permutations of rows of the matrix
Process
- Use a visual representation of shingles using a matrix
- Actual representation closer to the representation of a sparse matrix

Locality Sensitive Hashing

LSH uses hash functions to find possible pairs of documents for similarity checking
Signature matrices can grow very large because of millions of documents and hundreds of thousands of functions
Technique is as follows
- Divide: Set of the signature matrix into horizontal bands to reduce data sizes in each band
- Hash: Use a separate hash function to hash the column of a horizontal band into a hash bucket
  - If two sets are similar, there is a very high probability that at least one of their horizontal band will be hashed by some hash function to the same bucket
- Check: Perform the similarity checking on the pairs of sets collected in each bucket by noting that the probability of hashing dissimilar sets to the same bucket is low

Distance Measures

Distance expresses separation in space and helps discern similarity using algorithms like Jaccard
Space defined as points with distance(x, y) being the distance between points Relevant characteristics:
- Always positive. Zero only if x and y are the same points
- Symmetrical distance(x, y) is the same as distance(y, x,).
- Shortest distance distance(x,y) ≤ distance(x,t) + distance(t, y)

Euclidean Distance

With Euclidean space, a point is an n-dimensional vector with distance between points x(x1,x2, x3,...xn) and y(y1,y2, y3, ...yn)
Computing the distance of two n-dimensional points use this equation
- distance(x, y) = [{[(vi – xi)|}m]^1/m i=1
For m=1 this becomes, the Manhattan Distance
- manhattendistance(x, y) = [(yi – xi)| i=1
For m=2 then
- distance(x, y) = [(yi-xi)^2] i=1
These confirm the properties.

Jaccard Distance

Using set A and set B, then the formula is:
- JaccardDistance(A, B) = 1 – JSA,B = 1 − |𝐴 ∩ 𝐵| |𝐴 ∪ 𝐵|
Jaccard Similarity varies from 0 to 1, therefore, the value of Jaccard distance will be from 1 to 0. Value with 0 means value A ∩ B will be the same as A U B, if set A and set B identical
Symmetrical operations and thus Jaccard distance is also symmetrical

Cosine Distance

Spaces computed for directions of vector components
Cosine distance is the angle between those points
Use dot product and magnitude of the vectors to compute the cosine distance between two vectors
The range of cosine distance is evaluated between 0 and 180 degrees only

Edit Distance

Determines distance between strings
Steps
- Consider two strings - string A consisting of n characters a1, a2,...an and a string B of m characters b1, b2,...bm. Find a sub-sequence, which is the longest and has the same character sequence in the two strings
- Sub sequence is ls Compute the edit distance using the formula: editdistance(A,B) = n + m − 2 * ls

Hamming Distance

In digital design and error detection, Hamming distance is used most
Defined between two vectors, especially Boolean to express a vectors' differences

Other Big Data Techniques

Big data analytics produce useful information for organizational decision-making
Textual Analysis, Audio-Video Analysis, Social Media Data Analysis, Predictive Analysis
Supervised and Unsupervised Learning

Textual Analysis

Information extraction from unstructured textual data so it's stored for the long term and it's reprocessed
Meaningful summarization of single or multiple documents by extracting important portions, and using semantic abstraction-based summaries
Question-Answering to create automated query-answering systems with a processed question, keywords, semantics
Such analysis is used to determine the opinion or perception of feedback about a product or service by developing techniques to perform sentiment analysis at the sentence level or aspect level

Audio-Video Analysis

Extracts useful information from speech data or video data
Dealing with data is to transcribe speech data using automated speech recognition software
Video analysis performs checks for security breaches using closed-circuit television (CCTV) data

Supervised learning vs Unsupervised learning

Supervised learning makes it easier to generate models using already available knowledge; Spam detection
Unsupervised learning produces new model sets

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Big data: Finding Similar Items

Choose a study mode

Podcast

Questions and Answers

Which of the following is the MOST significant challenge when finding similar items within a large dataset?

In the context of finding similar items, what is the primary purpose of collaborative filtering?

Which of the following scenarios would Jaccard similarity be MOST applicable to?

Why might character-by-character matching be insufficient for identifying similar documents?

For document similarity, a Jaccard similarity of 50% or above is typically sought. What does this indicate?

What is the role of shingles in document similarity detection?

Which of the following statements BEST describes the purpose of minhashing?

What is the function of locality-sensitive hashing (LSH) in finding similar documents?

In the context of locality-sensitive hashing, why are hash functions used?

How does Jaccard distance relate to Jaccard similarity?

When is cosine distance MOST appropriate as a distance measure?

In what scenario is Edit Distance MOST useful?

What distinguishes Hamming distance from other distance measures?

In textual analysis, what's the MAIN goal of Structured Information Extraction?

Which task does Sentiment Analysis aim to accomplish?

What is the primary application of Audio-Video analytics?

What is a key challenge in social media data analysis?

What is the main purpose of Predictive Analysis?

Which of the following BEST describes Supervised Learning?

How does Unsupervised Learning differ from Supervised Learning?

Flashcards

Locality-sensitive hashing

Jaccard Similarity

Shingling

Minhashing

Edit Distance

Cosine Distance

Collaborative Filtering

Euclidean Distance

Supervised Learning

Unsupervised Learning

Big Data Analytics

Hamming Distance

Study Notes

Learning Objectives

Finding Similar Items

Problem Statement

Applications of Finding Similar Items

Challenges of Finding Similar Items

Finding Similar Sets

Jaccard Similarity

Document Similarity

Collaborative Filtering

Set Similarity in Collaborative Filtering

Finding Similar Documents

Shingles

Implementation Issues for Shingles

Minhashing

Locality Sensitive Hashing

Distance Measures

Euclidean Distance

Jaccard Distance

Cosine Distance

Edit Distance

Hamming Distance

Other Big Data Techniques

Textual Analysis

Audio-Video Analysis

Supervised learning vs Unsupervised learning

Studying That Suits You

Related Documents

More Like This

Collaborative Filtering and Recommendation Systems

Recommendation Systems Overview

Information Systems in Organizations Day 3: Collaborative Filtering Re...

Multi-Agent Systems: Recommendation Systems