Big data: Finding Similar Items

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

Which of the following is the MOST significant challenge when finding similar items within a large dataset?

  • The time it takes to access the data.
  • The difficulty in defining what constitutes a 'similar' item.
  • The complexity of comparing all possible combinations of item sets. (correct)
  • The computational resources needed to store the item characteristics.

In the context of finding similar items, what is the primary purpose of collaborative filtering?

  • To classify web pages based on similar words.
  • To identify duplicate web pages.
  • To suggest items to customers based on the preferences of similar users. (correct)
  • To resolve entity differences across various applications.

Which of the following scenarios would Jaccard similarity be MOST applicable to?

  • Measuring the number of edits required to make two strings identical.
  • Determining the distance between two points on a map.
  • Analyzing the sentiment expressed in customer reviews.
  • Identifying the degree of overlap in the word usage on two web pages. (correct)

Why might character-by-character matching be insufficient for identifying similar documents?

<p>It fails when documents have minor variations. (D)</p> Signup and view all the answers

For document similarity, a Jaccard similarity of 50% or above is typically sought. What does this indicate?

<p>The documents are considered highly similar. (D)</p> Signup and view all the answers

What is the role of shingles in document similarity detection?

<p>To convert documents into sets of items for comparison. (A)</p> Signup and view all the answers

Which of the following statements BEST describes the purpose of minhashing?

<p>To reduce the size of shingle sets while preserving Jaccard similarity estimates. (C)</p> Signup and view all the answers

What is the function of locality-sensitive hashing (LSH) in finding similar documents?

<p>To filter out pairs of documents unlikely to be similar, and reduce the number of comparisons. (B)</p> Signup and view all the answers

In the context of locality-sensitive hashing, why are hash functions used?

<p>To categorize document sets into buckets. (C)</p> Signup and view all the answers

How does Jaccard distance relate to Jaccard similarity?

<p>It is the complement of the Jaccard similarity (1 - Jaccard Similarity). (B)</p> Signup and view all the answers

When is cosine distance MOST appropriate as a distance measure?

<p>When the direction of vectors is important. (B)</p> Signup and view all the answers

In what scenario is Edit Distance MOST useful?

<p>Assessing how many edits are required to make 2 strings identical. (D)</p> Signup and view all the answers

What distinguishes Hamming distance from other distance measures?

<p>It is mainly applicable to Boolean vectors. (B)</p> Signup and view all the answers

In textual analysis, what's the MAIN goal of Structured Information Extraction?

<p>Generating structured, storable data from unstructured text. (C)</p> Signup and view all the answers

Which task does Sentiment Analysis aim to accomplish?

<p>Determining the subjective opinion or attitude expressed in text. (B)</p> Signup and view all the answers

What is the primary application of Audio-Video analytics?

<p>Extracting useful information from speech and video data. (B)</p> Signup and view all the answers

What is a key challenge in social media data analysis?

<p>The noisy, unstructured and user-generated nature of the data. (B)</p> Signup and view all the answers

What is the main purpose of Predictive Analysis?

<p>Forecasting future trends based on past and present data. (D)</p> Signup and view all the answers

Which of the following BEST describes Supervised Learning?

<p>Learning with pre-classified or labeled data. (D)</p> Signup and view all the answers

How does Unsupervised Learning differ from Supervised Learning?

<p>It generates its own set of classes and models from the data. (D)</p> Signup and view all the answers

Flashcards

Locality-sensitive hashing

A technique to find possible pairs of documents that should be checked for similarity.

Jaccard Similarity

A measure of the number of elements common to both sets divided by the total number of elements in both sets

Shingling

Converting documents into sets of small substrings to find similar texts.

Minhashing

Technique to reduce the size of sets of shingles while preserving Jaccard similarity.

Signup and view all the flashcards

Edit Distance

The minimum number of single-character edits required to change one string into the other.

Signup and view all the flashcards

Cosine Distance

Distance between vectors, determined by the angle between them.

Signup and view all the flashcards

Collaborative Filtering

Classifying data into groups by finding similarities between customer profiles to provide recommendations.

Signup and view all the flashcards

Euclidean Distance

Distance computed as the square root of the sum of the squared differences between corresponding points.

Signup and view all the flashcards

Supervised Learning

A type of machine learning where the algorithm learns from labeled data.

Signup and view all the flashcards

Unsupervised Learning

A type of machine learning where the algorithm identifies patterns without labeled data.

Signup and view all the flashcards

Big Data Analytics

A set of processes used to analyze big data and extract useful information.

Signup and view all the flashcards

Hamming Distance

The difference between corresponding components of two vectors, especially Boolean vectors.

Signup and view all the flashcards

Study Notes

  • Techniques to find useful information from sources of big data are presented
  • Finding similar item sets in large datasets, measures to calculate distances between data objects, and document similarities using shingles is covered
  • Techniques for big data analysis are introduced

Learning Objectives

  • Understand different techniques for identifying similar items
  • Learn how collaborative filtering works
  • Use shingling to find similar documents
  • Learn techniques to measure distances between 2 objects
  • Define supervised and unsupervised learning

Finding Similar Items

  • Locating similar items is straightforward with small document sets
  • Major problem is working at scale

Problem Statement

  • Task is to identify similar item sets within a large collection (millions/billions)
  • Goal is to do this without comparing all possible combinations
  • Use the principle that similar item sets share many common subsets

Applications of Finding Similar Items

  • Classify webpages using similar words to categorize content
  • Group similar customer purchases and feedback to provide purchase recommendations
  • Entity resolution: determining if a single person appears across multiple applications
  • Detect mirror websites or plagiarized content among millions of webpages

Challenges of Finding Similar Items

  • Directly comparing documents/webpages involves comparing character/word sequences
  • Computationally expensive as the number of pairs to compare grows quadratically
    • 10 documents/web pages requires comparing 45 pairs
    • n documents/webpages requires comparing n*(n-1)/2 pairs
    • 1 million necessitates 5x10^11 comparisons, making it hard to find similar documents without efficient methods

Finding Similar Sets

  • Similarity is gauged among sets for finding textual similarity between documents
  • This enables collaborative filtering to identify customer groups

Jaccard Similarity

  • Defined as the ratio of the cardinality of the intersection of sets A and B to the cardinality of their union
    • JSA,B = |A∩B| / |A∪B|
  • Value ranges from 0 to 1
  • Example:
    • A={a, b, c, d, e} and B={c, d, e, f, g}
    • Jaccard similarity is |{c, d, e}| / |{a, b, c, d, e, f, g}| = 3/7

Document Similarity

  • Finding almost identical documents, news items, webpages, or reports can be achieved using the Jaccard similarity
  • Also known as character-level similarity
  • Useful algorithm with different types of techniques
  • Can be resolved the identification of similar words and sometimes the meaing of those words
  • Character-by-character matching identifies two identical documents
  • Mostly similar scenarios
    • Plagiarism may have small segments changed or reordered
    • Mirror websites may differ due to local changes
    • Course material versions vary slightly across websites
    • News reports append additional info across different newspapers

Collaborative Filtering

  • Recommends items to e-commerce users based on past purchases and preferences by tracking their purchases and likes
  • Users are grouped with those who exhibit similar behaviors
  • Recommendations are products that other group members have purchased and liked

Set Similarity in Collaborative Filtering

  • Jaccard Similarity measures similarity between customers who purchase/like similar products
  • E-commerce sites have huge datasets
  • Customers who purchase the same books may also purchase other similar books
  • Document similarity typically looks for a Jaccard Similarity of 50% or higher
  • Collaborative filtering considers values as low as 10-25%
  • General collaborative filtering uses binary data easily translated into sets
  • Likert scales turn customer ratings into bags with multiple instances of each item
    • Customer A rates product A at 2 and product B at 3: BagCust1 = {a, a, b, b, b}
    • Customer B rates product A at 1 and product B at 4: BagCust2 = {a, b, b, b, b}
    • JS A,B = |{a, b, b, b}| / |{a, a, b, b, b, b}| = 4/6 = 2/3
  • Set similarity is an importnat consideration to find the smilarity between two sets
  • Goal is to convert a document into small string subsets to compare similarity

Finding Similar Documents

  • Jaccard Similarity applies to sets of document similarity
  • Document similarity can be checked using the following process:
    • Create sets of items using shingles
    • Apply minhashing to create document signatures to test similarity, which reduces number of shingles for testing
    • Perform locality sense hashing to find document pairs that should be tested for similarity rather than comparing all possible pairs

Shingles

  • Goal is to represent a document as a set of items for finding lexicographically similar documents
  • Break documents into substrings of 3-7 characters because many substrings match across common sentences despite slight variances
  • k-shingle definition
    • Any substring of size k within the document
    • Documents have many shingles, which may appear once or more
    • Example: “bit-by-bit” with k=3 yields {“bit”, “it-”, “t-b”, “-by”, “by-”, “y-b”, “-bi”}

Implementation Issues for Shingles

  • White Space: A common solution is to condense multiple whitespaces down to a single instance to deal with impact of white spaces
  • Shingle Size: Small shingles (k=1) are too common and large shingles will not be able to distinguish similar items
    • Most documents share almost all alphabets when k=1
    • k=5-9 is an ideal size for small to large documents
  • Shingle Count: The number of distinct shingles in a language with a character set of size n and shingle size k is equal to nk, which is substantial
    • Using 9-shingles on English documents, given the English alphabet and one space character
    • Max shingles equals 27^9 9-shingles, and each shingle is 9 bytes long
  • Data Reduction
    • Employing hash functions instead of shingles
    • 9-byte substring maps to an integral number hash bucket
    • Integer size of 4 bytes turns possible 27^9 9-shingles → 2^32, which is much smaller
    • Use bucket # as shingle to cut the shingle size from 9 to 4 bytes
  • Word-Based Shingles
    • Can be more effective at finding news articles

Minhashing

  • Goal is to reduce the set of shingles and still be able to accommodate the shingles in the memory of a computer, so as to find the similarity of the documents
  • Process involved converting the set of shingles to a document signature using a large number of permutations of rows of the matrix
  • Process
    • Use a visual representation of shingles using a matrix
    • Actual representation closer to the representation of a sparse matrix

Locality Sensitive Hashing

  • LSH uses hash functions to find possible pairs of documents for similarity checking
  • Signature matrices can grow very large because of millions of documents and hundreds of thousands of functions
  • Technique is as follows
    • Divide: Set of the signature matrix into horizontal bands to reduce data sizes in each band
    • Hash: Use a separate hash function to hash the column of a horizontal band into a hash bucket
      • If two sets are similar, there is a very high probability that at least one of their horizontal band will be hashed by some hash function to the same bucket
    • Check: Perform the similarity checking on the pairs of sets collected in each bucket by noting that the probability of hashing dissimilar sets to the same bucket is low

Distance Measures

  • Distance expresses separation in space and helps discern similarity using algorithms like Jaccard
  • Space defined as points with distance(x, y) being the distance between points Relevant characteristics:
    • Always positive. Zero only if x and y are the same points
    • Symmetrical distance(x, y) is the same as distance(y, x,).
    • Shortest distance distance(x,y) ≤ distance(x,t) + distance(t, y)

Euclidean Distance

  • With Euclidean space, a point is an n-dimensional vector with distance between points x(x1,x2, x3,...xn) and y(y1,y2, y3, ...yn)
  • Computing the distance of two n-dimensional points use this equation
    • distance(x, y) = [{[(vi – xi)|}m]^1/m i=1
  • For m=1 this becomes, the Manhattan Distance
    • manhattendistance(x, y) = [(yi – xi)| i=1
  • For m=2 then
    • distance(x, y) = [(yi-xi)^2] i=1
  • These confirm the properties.

Jaccard Distance

  • Using set A and set B, then the formula is:
    • JaccardDistance(A, B) = 1 – JSA,B = 1 − |𝐴 ∩ 𝐵| |𝐴 ∪ 𝐵|
  • Jaccard Similarity varies from 0 to 1, therefore, the value of Jaccard distance will be from 1 to 0. Value with 0 means value A ∩ B will be the same as A U B, if set A and set B identical
  • Symmetrical operations and thus Jaccard distance is also symmetrical

Cosine Distance

  • Spaces computed for directions of vector components
  • Cosine distance is the angle between those points
  • Use dot product and magnitude of the vectors to compute the cosine distance between two vectors
  • The range of cosine distance is evaluated between 0 and 180 degrees only

Edit Distance

  • Determines distance between strings
  • Steps
    • Consider two strings - string A consisting of n characters a1, a2,...an and a string B of m characters b1, b2,...bm. Find a sub-sequence, which is the longest and has the same character sequence in the two strings
    • Sub sequence is ls Compute the edit distance using the formula: editdistance(A,B) = n + m − 2 * ls

Hamming Distance

  • In digital design and error detection, Hamming distance is used most
  • Defined between two vectors, especially Boolean to express a vectors' differences

Other Big Data Techniques

  • Big data analytics produce useful information for organizational decision-making
  • Textual Analysis, Audio-Video Analysis, Social Media Data Analysis, Predictive Analysis
  • Supervised and Unsupervised Learning

Textual Analysis

  • Information extraction from unstructured textual data so it's stored for the long term and it's reprocessed
  • Meaningful summarization of single or multiple documents by extracting important portions, and using semantic abstraction-based summaries
  • Question-Answering to create automated query-answering systems with a processed question, keywords, semantics
  • Such analysis is used to determine the opinion or perception of feedback about a product or service by developing techniques to perform sentiment analysis at the sentence level or aspect level

Audio-Video Analysis

  • Extracts useful information from speech data or video data
  • Dealing with data is to transcribe speech data using automated speech recognition software
  • Video analysis performs checks for security breaches using closed-circuit television (CCTV) data

Supervised learning vs Unsupervised learning

  • Supervised learning makes it easier to generate models using already available knowledge; Spam detection
  • Unsupervised learning produces new model sets

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Recommender Systems Quiz
40 questions

Recommender Systems Quiz

TopQualityBlessing7037 avatar
TopQualityBlessing7037
Collaborative Filtering Techniques
38 questions

Collaborative Filtering Techniques

SpectacularFeministArt2813 avatar
SpectacularFeministArt2813
Use Quizgecko on...
Browser
Browser