3-TFIDF.pdf
Document Details
Uploaded by ThrillingTuba
Tags
Full Transcript
Motivation We have text, i.e., a sequence of characters. Many algorithms expect a vector from. ➜ We need to convert text to vectors. Text = bytes = vector? ASCII: A=65, B=66, C=67, D=68, E=69, … Example = [69, 120, 97, 109, 112, 108, 101]? This does not work well at all! Algorithms assume that for e...
Motivation We have text, i.e., a sequence of characters. Many algorithms expect a vector from. ➜ We need to convert text to vectors. Text = bytes = vector? ASCII: A=65, B=66, C=67, D=68, E=69, … Example = [69, 120, 97, 109, 112, 108, 101]? This does not work well at all! Algorithms assume that for every dimension : if and are similar. But documents can be similar, yet different in almost every byte position: H e S h e s a i d : s a i d : T h i s T h i s i s i s a n a n e x a m p l e. e x a m p l e. ➜ We want a vector space with meaningful positions. 11 Bag of Words – Vectorizing Text Separate the documents into words, discard word order: Birds of a feather flock together. →a bird feather flock of together It is the early bird that gets the worm. →bird early get is it that the×2 worm But the second mouse gets the cheese. →but cheese get mouse second the×2 Early to bed and early to rise , makes a man healthy , wealthy and wise. →a and×2 bed early×2 healthy make man rise to×2 wealthy wise For better results, normalize words: birds → bird, gets → get. Discard. and ,. 12 Document-Term Matrix 2 1 1 1 1 1 1 1 1 worm 1 wise flock 1 1 wealthy feather 2 1 1 together early Doc 4 1 1 to cheese Doc 3 1 the but 1 that bird Doc 2 second bed 1 1 of and 1 Doc 1 1 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 rise 8 mouse 7 man 6 make 5 it 4 is 3 healthy 2 get 1 a Dim 1 1 2 1 2 2 1 1 Note: for a large document collection, we will have thousands of dimensions! Similar documents should now have similar vectors. ➜ How can we measure similarity? 13 Term Frequency, Document Frequency, Collection Frequency Three different frequencies are commonly used: ▶ term frequency the number of occurrences of word in document ▶ document frequency the number of documents which contain word ▶ collection frequency at least once (rare) the number of occurrences of word in all documents (total) These frequencies are usually not normalized (probabilites), but integer counts. 14 TF-IDF: Term Frequency × Inverse Document Frequency Words such as a, and, but, is, it, of, that, the, to are not helpful for differentiating documents. Remove them (→stopword removal), or assign them a low weight. TF-IDF (Term frequency × inverse document frequency) is a popular solution to this.1 The inverse document frequency (IDF) is defined as: This weight looks similar to Shannon information. But the theoretical justification is difficult [Robe04; RoSp76; Spär72; Spär73; SpWaRo00a; SpWaRo00b]. 1 TF-IDF is also written as “tf.idf” and “tf×idf”. This is not a minus, but a hyphen; we always use multiplication. 15 TF-IDF Variations There exist many variations of TF-IDF (in SMART notation [SaBu88; Salt71; SiSaBu95]): [MaRaSc08] Term frequency (if ) Document frequency Document Normalization n (natural) n (no) l (logarithm) t (idf) c (cosine) a (augmented) p (prob idf) u (pivoted unique) b (boolean) 1 (idf smooth) 1 n (none) b (byte size) 1 (see [MaRaSc08]) , L (log mean) d (double log) (BM25) (BM25) (three val.) (log1p) (sqrt) (manhattan) Apache Lucene and Xapian use the BM25 variant [RoZa09] as default nowadays. 16.1 TF and IDF Variations Visualized 16.2 TF-IDF Document Vectors Document vectors are derived from the document-term matrix by using the term frequency, inverse document frequency, and normalization functions. 1. each term frequency is transformed using the selected term frequency function 2. each term frequency is weighted with the document frequency term 3. the entire resulting vector is scaled with the normalization factor Examples: For the SMART nnn combination, the document-term matrix remains unchanged. For the SMART ltc combination (supposedly the standard choice), we compute for each term , then scale this vector to unit length by multiplying it with. 17