Podcast
Questions and Answers
Which of the following is the MOST significant challenge when finding similar items within a large dataset?
Which of the following is the MOST significant challenge when finding similar items within a large dataset?
- The time it takes to access the data.
- The difficulty in defining what constitutes a 'similar' item.
- The complexity of comparing all possible combinations of item sets. (correct)
- The computational resources needed to store the item characteristics.
In the context of finding similar items, what is the primary purpose of collaborative filtering?
In the context of finding similar items, what is the primary purpose of collaborative filtering?
- To classify web pages based on similar words.
- To identify duplicate web pages.
- To suggest items to customers based on the preferences of similar users. (correct)
- To resolve entity differences across various applications.
Which of the following scenarios would Jaccard similarity be MOST applicable to?
Which of the following scenarios would Jaccard similarity be MOST applicable to?
- Measuring the number of edits required to make two strings identical.
- Determining the distance between two points on a map.
- Analyzing the sentiment expressed in customer reviews.
- Identifying the degree of overlap in the word usage on two web pages. (correct)
Why might character-by-character matching be insufficient for identifying similar documents?
Why might character-by-character matching be insufficient for identifying similar documents?
For document similarity, a Jaccard similarity of 50% or above is typically sought. What does this indicate?
For document similarity, a Jaccard similarity of 50% or above is typically sought. What does this indicate?
What is the role of shingles in document similarity detection?
What is the role of shingles in document similarity detection?
Which of the following statements BEST describes the purpose of minhashing?
Which of the following statements BEST describes the purpose of minhashing?
What is the function of locality-sensitive hashing (LSH) in finding similar documents?
What is the function of locality-sensitive hashing (LSH) in finding similar documents?
In the context of locality-sensitive hashing, why are hash functions used?
In the context of locality-sensitive hashing, why are hash functions used?
How does Jaccard distance relate to Jaccard similarity?
How does Jaccard distance relate to Jaccard similarity?
When is cosine distance MOST appropriate as a distance measure?
When is cosine distance MOST appropriate as a distance measure?
In what scenario is Edit Distance MOST useful?
In what scenario is Edit Distance MOST useful?
What distinguishes Hamming distance from other distance measures?
What distinguishes Hamming distance from other distance measures?
In textual analysis, what's the MAIN goal of Structured Information Extraction?
In textual analysis, what's the MAIN goal of Structured Information Extraction?
Which task does Sentiment Analysis aim to accomplish?
Which task does Sentiment Analysis aim to accomplish?
What is the primary application of Audio-Video analytics?
What is the primary application of Audio-Video analytics?
What is a key challenge in social media data analysis?
What is a key challenge in social media data analysis?
What is the main purpose of Predictive Analysis?
What is the main purpose of Predictive Analysis?
Which of the following BEST describes Supervised Learning?
Which of the following BEST describes Supervised Learning?
How does Unsupervised Learning differ from Supervised Learning?
How does Unsupervised Learning differ from Supervised Learning?
Flashcards
Locality-sensitive hashing
Locality-sensitive hashing
A technique to find possible pairs of documents that should be checked for similarity.
Jaccard Similarity
Jaccard Similarity
A measure of the number of elements common to both sets divided by the total number of elements in both sets
Shingling
Shingling
Converting documents into sets of small substrings to find similar texts.
Minhashing
Minhashing
Signup and view all the flashcards
Edit Distance
Edit Distance
Signup and view all the flashcards
Cosine Distance
Cosine Distance
Signup and view all the flashcards
Collaborative Filtering
Collaborative Filtering
Signup and view all the flashcards
Euclidean Distance
Euclidean Distance
Signup and view all the flashcards
Supervised Learning
Supervised Learning
Signup and view all the flashcards
Unsupervised Learning
Unsupervised Learning
Signup and view all the flashcards
Big Data Analytics
Big Data Analytics
Signup and view all the flashcards
Hamming Distance
Hamming Distance
Signup and view all the flashcards
Study Notes
- Techniques to find useful information from sources of big data are presented
- Finding similar item sets in large datasets, measures to calculate distances between data objects, and document similarities using shingles is covered
- Techniques for big data analysis are introduced
Learning Objectives
- Understand different techniques for identifying similar items
- Learn how collaborative filtering works
- Use shingling to find similar documents
- Learn techniques to measure distances between 2 objects
- Define supervised and unsupervised learning
Finding Similar Items
- Locating similar items is straightforward with small document sets
- Major problem is working at scale
Problem Statement
- Task is to identify similar item sets within a large collection (millions/billions)
- Goal is to do this without comparing all possible combinations
- Use the principle that similar item sets share many common subsets
Applications of Finding Similar Items
- Classify webpages using similar words to categorize content
- Group similar customer purchases and feedback to provide purchase recommendations
- Entity resolution: determining if a single person appears across multiple applications
- Detect mirror websites or plagiarized content among millions of webpages
Challenges of Finding Similar Items
- Directly comparing documents/webpages involves comparing character/word sequences
- Computationally expensive as the number of pairs to compare grows quadratically
- 10 documents/web pages requires comparing 45 pairs
- n documents/webpages requires comparing n*(n-1)/2 pairs
- 1 million necessitates 5x10^11 comparisons, making it hard to find similar documents without efficient methods
Finding Similar Sets
- Similarity is gauged among sets for finding textual similarity between documents
- This enables collaborative filtering to identify customer groups
Jaccard Similarity
- Defined as the ratio of the cardinality of the intersection of sets A and B to the cardinality of their union
- JSA,B = |A∩B| / |A∪B|
- Value ranges from 0 to 1
- Example:
- A={a, b, c, d, e} and B={c, d, e, f, g}
- Jaccard similarity is |{c, d, e}| / |{a, b, c, d, e, f, g}| = 3/7
Document Similarity
- Finding almost identical documents, news items, webpages, or reports can be achieved using the Jaccard similarity
- Also known as character-level similarity
- Useful algorithm with different types of techniques
- Can be resolved the identification of similar words and sometimes the meaing of those words
- Character-by-character matching identifies two identical documents
- Mostly similar scenarios
- Plagiarism may have small segments changed or reordered
- Mirror websites may differ due to local changes
- Course material versions vary slightly across websites
- News reports append additional info across different newspapers
Collaborative Filtering
- Recommends items to e-commerce users based on past purchases and preferences by tracking their purchases and likes
- Users are grouped with those who exhibit similar behaviors
- Recommendations are products that other group members have purchased and liked
Set Similarity in Collaborative Filtering
- Jaccard Similarity measures similarity between customers who purchase/like similar products
- E-commerce sites have huge datasets
- Customers who purchase the same books may also purchase other similar books
- Document similarity typically looks for a Jaccard Similarity of 50% or higher
- Collaborative filtering considers values as low as 10-25%
- General collaborative filtering uses binary data easily translated into sets
- Likert scales turn customer ratings into bags with multiple instances of each item
- Customer A rates product A at 2 and product B at 3: BagCust1 = {a, a, b, b, b}
- Customer B rates product A at 1 and product B at 4: BagCust2 = {a, b, b, b, b}
- JS A,B = |{a, b, b, b}| / |{a, a, b, b, b, b}| = 4/6 = 2/3
- Set similarity is an importnat consideration to find the smilarity between two sets
- Goal is to convert a document into small string subsets to compare similarity
Finding Similar Documents
- Jaccard Similarity applies to sets of document similarity
- Document similarity can be checked using the following process:
- Create sets of items using shingles
- Apply minhashing to create document signatures to test similarity, which reduces number of shingles for testing
- Perform locality sense hashing to find document pairs that should be tested for similarity rather than comparing all possible pairs
Shingles
- Goal is to represent a document as a set of items for finding lexicographically similar documents
- Break documents into substrings of 3-7 characters because many substrings match across common sentences despite slight variances
- k-shingle definition
- Any substring of size k within the document
- Documents have many shingles, which may appear once or more
- Example: “bit-by-bit” with k=3 yields {“bit”, “it-”, “t-b”, “-by”, “by-”, “y-b”, “-bi”}
Implementation Issues for Shingles
- White Space: A common solution is to condense multiple whitespaces down to a single instance to deal with impact of white spaces
- Shingle Size: Small shingles (k=1) are too common and large shingles will not be able to distinguish similar items
- Most documents share almost all alphabets when k=1
- k=5-9 is an ideal size for small to large documents
- Shingle Count: The number of distinct shingles in a language with a character set of size n and shingle size k is equal to nk, which is substantial
- Using 9-shingles on English documents, given the English alphabet and one space character
- Max shingles equals 27^9 9-shingles, and each shingle is 9 bytes long
- Data Reduction
- Employing hash functions instead of shingles
- 9-byte substring maps to an integral number hash bucket
- Integer size of 4 bytes turns possible 27^9 9-shingles → 2^32, which is much smaller
- Use bucket # as shingle to cut the shingle size from 9 to 4 bytes
- Word-Based Shingles
- Can be more effective at finding news articles
Minhashing
- Goal is to reduce the set of shingles and still be able to accommodate the shingles in the memory of a computer, so as to find the similarity of the documents
- Process involved converting the set of shingles to a document signature using a large number of permutations of rows of the matrix
- Process
- Use a visual representation of shingles using a matrix
- Actual representation closer to the representation of a sparse matrix
Locality Sensitive Hashing
- LSH uses hash functions to find possible pairs of documents for similarity checking
- Signature matrices can grow very large because of millions of documents and hundreds of thousands of functions
- Technique is as follows
- Divide: Set of the signature matrix into horizontal bands to reduce data sizes in each band
- Hash: Use a separate hash function to hash the column of a horizontal band into a hash bucket
- If two sets are similar, there is a very high probability that at least one of their horizontal band will be hashed by some hash function to the same bucket
- Check: Perform the similarity checking on the pairs of sets collected in each bucket by noting that the probability of hashing dissimilar sets to the same bucket is low
Distance Measures
- Distance expresses separation in space and helps discern similarity using algorithms like Jaccard
- Space defined as points with distance(x, y) being the distance between points
Relevant characteristics:
- Always positive. Zero only if x and y are the same points
- Symmetrical distance(x, y) is the same as distance(y, x,).
- Shortest distance distance(x,y) ≤ distance(x,t) + distance(t, y)
Euclidean Distance
- With Euclidean space, a point is an n-dimensional vector with distance between points x(x1,x2, x3,...xn) and y(y1,y2, y3, ...yn)
- Computing the distance of two n-dimensional points use this equation
- distance(x, y) = [{[(vi – xi)|}m]^1/m i=1
- For m=1 this becomes, the Manhattan Distance
- manhattendistance(x, y) = [(yi – xi)| i=1
- For m=2 then
- distance(x, y) = [(yi-xi)^2] i=1
- These confirm the properties.
Jaccard Distance
- Using set A and set B, then the formula is:
- JaccardDistance(A, B) = 1 – JSA,B = 1 − |𝐴 ∩ 𝐵| |𝐴 ∪ 𝐵|
- Jaccard Similarity varies from 0 to 1, therefore, the value of Jaccard distance will be from 1 to 0. Value with 0 means value A ∩ B will be the same as A U B, if set A and set B identical
- Symmetrical operations and thus Jaccard distance is also symmetrical
Cosine Distance
- Spaces computed for directions of vector components
- Cosine distance is the angle between those points
- Use dot product and magnitude of the vectors to compute the cosine distance between two vectors
- The range of cosine distance is evaluated between 0 and 180 degrees only
Edit Distance
- Determines distance between strings
- Steps
- Consider two strings - string A consisting of n characters a1, a2,...an and a string B of m characters b1, b2,...bm. Find a sub-sequence, which is the longest and has the same character sequence in the two strings
- Sub sequence is ls Compute the edit distance using the formula: editdistance(A,B) = n + m − 2 * ls
Hamming Distance
- In digital design and error detection, Hamming distance is used most
- Defined between two vectors, especially Boolean to express a vectors' differences
Other Big Data Techniques
- Big data analytics produce useful information for organizational decision-making
- Textual Analysis, Audio-Video Analysis, Social Media Data Analysis, Predictive Analysis
- Supervised and Unsupervised Learning
Textual Analysis
- Information extraction from unstructured textual data so it's stored for the long term and it's reprocessed
- Meaningful summarization of single or multiple documents by extracting important portions, and using semantic abstraction-based summaries
- Question-Answering to create automated query-answering systems with a processed question, keywords, semantics
- Such analysis is used to determine the opinion or perception of feedback about a product or service by developing techniques to perform sentiment analysis at the sentence level or aspect level
Audio-Video Analysis
- Extracts useful information from speech data or video data
- Dealing with data is to transcribe speech data using automated speech recognition software
- Video analysis performs checks for security breaches using closed-circuit television (CCTV) data
Supervised learning vs Unsupervised learning
- Supervised learning makes it easier to generate models using already available knowledge; Spam detection
- Unsupervised learning produces new model sets
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.