From Text to Numbers PDF
Document Details
Uploaded by Deleted User
UNC Charlotte
2024
Dr. Zhao
Tags
Summary
This presentation is on text mining, specifically on how to extract numerical information from textual documents. It covers basic calculations like Boolean and frequency counts, along with the concept of term frequency-inverse document frequency (TF-IDF).
Full Transcript
From Text to Numbers Basic calculation per document: Boolean counting (0-1) of terms Frequency counting of terms Information theoretic counting of terms (logarithm of frequency counts) Adjusting for document size and corpus size term weights: Tf-idf...
From Text to Numbers Basic calculation per document: Boolean counting (0-1) of terms Frequency counting of terms Information theoretic counting of terms (logarithm of frequency counts) Adjusting for document size and corpus size term weights: Tf-idf Others: entropy weights (Shannon information theory) 11/27/2024 Dr. Zhao, UNCC Fall 2024 1 Issues with Simple Frequency Issues Longer documents will tend to have higher term counts Terms that appear frequently across the corpus aren’t as important. Tf-idf Normalize documents based on their length Penalize terms that occur frequently across the corpus 11/27/2024 Dr. Zhao, UNCC Fall 2024 2 Tf-idf Notations: d: document t: term N: number of documents in the corpus tft,d : the number of times that term t occurs in document d dft : the number of documents that contain term t Term Variables Documents Term 1 Term 2 … Term m Document 1 Matrix … Document n 11/27/2024 Dr. Zhao, UNCC Fall 2024 3 Term Frequency TF Term frequency tft,d : the number of times that term t occurs in document d TFt,d : the proportion of the count of term t in document d I: the number of distinct terms in document d. 11/27/2024 Dr. Zhao, UNCC Fall 2024 4 Inverse Document Frequency dft : the number of documents in the corpus that contain term t Inverse Document Frequency (IDF) 𝑁 𝐼𝐷𝐹 𝑡 =log ( ) 𝑑𝑓 𝑡 0< 𝐼𝐷𝐹 𝑡