Podcast
Questions and Answers
Which of the following steps is NOT a major component in inverted index construction?
Which of the following steps is NOT a major component in inverted index construction?
- Indexing the documents by term occurrence.
- Performing sentiment analysis on tokens. (correct)
- Tokenizing the text.
- Collecting documents to be indexed.
What is the primary function of tokenization in text processing?
What is the primary function of tokenization in text processing?
- Converting documents into a byte sequence.
- Determining the language of a document.
- Building equivalence classes of tokens.
- Splitting a character stream into tokens. (correct)
What aspect of text is addressed by linguistic preprocessing?
What aspect of text is addressed by linguistic preprocessing?
- Determining the vocabulary of terms.
- Converting byte sequences into character sequences.
- Building equivalence classes of tokens. (correct)
- Implementing postings lists.
What is the initial stage in converting digital documents for indexing?
What is the initial stage in converting digital documents for indexing?
Why is decoding character sequences more complex than simply reading ASCII text?
Why is decoding character sequences more complex than simply reading ASCII text?
What is a common method for handling decoding document formats and character encodings?
What is a common method for handling decoding document formats and character encodings?
In the context of document indexing, what does it mean to 'determine the document format'?
In the context of document indexing, what does it mean to 'determine the document format'?
How do modern Unicode representation concepts affect the order of characters in files?
How do modern Unicode representation concepts affect the order of characters in files?
What is primarily implied by the issue of 'indexing granularity?
What is primarily implied by the issue of 'indexing granularity?
What is a potential drawback of using very small document units for indexing?
What is a potential drawback of using very small document units for indexing?
How can large document units negatively impact information retrieval?
How can large document units negatively impact information retrieval?
What is a 'token' in the context of text processing?
What is a 'token' in the context of text processing?
What is the purpose of token normalization?
What is the purpose of token normalization?
Which of the following is a language-specific issue in tokenization?
Which of the following is a language-specific issue in tokenization?
What is one approach to handling hyphenated words in tokenization?
What is one approach to handling hyphenated words in tokenization?
What potential issue arises from splitting tokens on whitespace?
What potential issue arises from splitting tokens on whitespace?
What is the primary challenge in word segmentation for languages like Chinese?
What is the primary challenge in word segmentation for languages like Chinese?
What is the purpose of using a compound splitter module for languages like German?
What is the purpose of using a compound splitter module for languages like German?
What is a 'stop word' in the context of information retrieval?
What is a 'stop word' in the context of information retrieval?
What is a potential downside of using stop words?
What is a potential downside of using stop words?
What is the main goal of normalization in information retrieval?
What is the main goal of normalization in information retrieval?
What is 'case-folding' and how can it improve search results?
What is 'case-folding' and how can it improve search results?
What is 'truecasing'?
What is 'truecasing'?
What is the purpose of 'stemming' in text processing?
What is the purpose of 'stemming' in text processing?
How does 'lemmatization' differ from 'stemming'?
How does 'lemmatization' differ from 'stemming'?
What is the Porter algorithm primarily designed for?
What is the Porter algorithm primarily designed for?
What is a potential drawback of aggressive normalization techniques like stemming?
What is a potential drawback of aggressive normalization techniques like stemming?
What is a 'biword index', and what is its purpose?
What is a 'biword index', and what is its purpose?
What is the primary advantage of using a positional index?
What is the primary advantage of using a positional index?
Flashcards
Tokenization
Tokenization
Process of chopping character streams into tokens
Linguistic preprocessing
Linguistic preprocessing
Deals with building equivalence classes of tokens
Token (Type/Token distinction)
Token (Type/Token distinction)
An instance of a sequence of characters in some particular document
Type (Type/Token distinction)
Type (Type/Token distinction)
Signup and view all the flashcards
Term
Term
Signup and view all the flashcards
Stop words
Stop words
Signup and view all the flashcards
Token Normalization
Token Normalization
Signup and view all the flashcards
Case-folding
Case-folding
Signup and view all the flashcards
Lemmatization
Lemmatization
Signup and view all the flashcards
Skip Pointers
Skip Pointers
Signup and view all the flashcards
Phrase Queries
Phrase Queries
Signup and view all the flashcards
Biword Indexes
Biword Indexes
Signup and view all the flashcards
Positional indexes
Positional indexes
Signup and view all the flashcards
Combination schemes
Combination schemes
Signup and view all the flashcards
Study Notes
- The chapter covers how to define a document's basic unit, determine its character sequence, and examines tokenization and linguistic preprocessing
- Tokenization and linguistic preprocessing determine the vocabulary of terms used
- Indexing itself is covered in Chapters 1 and 4
- The implementation of postings lists, extended postings list data structure for faster querying, and handling phrase/proximity queries in extended Boolean models and on the web are considered
Document Delineation and Character Sequence Decoding:
- Digital documents for indexing are typically bytes in a file or on a web server
- The first step converts this byte sequence into a linear sequence of characters
- Plain English text in ASCII encoding makes this conversion trivial
- Character sequences can be encoded by various single-byte or multibyte schemes like Unicode UTF-8
- Determining correct encoding can be regarded as a machine learning classification problem, often handled by heuristics, user selection, or provided document metadata
- After encoding determination, the byte sequence is decoded to a character sequence
- Choice of encoding may be saved as evidence of the document's language
- The characters may need decoding from binary representations like Microsoft Word DOC files or compressed formats such as zip files
- Determining the document format requires an appropriate decoder
- Additional decoding may be needed for plain text documents
- XML documents require decoding of character entities such as & to give the correct character &
- The textual part of a document may need extraction from other non-processed material
- Commercial products support a broad range of document types/encodings and are often solved by licensing a software library that handles decoding
- Some writing systems, like Arabic, have two-dimensional and mixed-order characteristics, questioning the idea of text as a linear sequence of characters
- Despite writing system conventions, an underlying sequence of sounds represented gives an essentially linear structure in the digital representation
Choosing a Document Unit
- The document unit is determined for indexing
- Documents are often assumed to be fixed units, like each file in a folder
- Email messages in a Unix (mbox-format) email file can be regarded as separate documents
- Email messages with attachments may have the email and each attachment as separate documents
- ZIP files attached to emails may have each contained file regarded as a separate document
- Web software may split single documents (e.g., PowerPoint or LaTeX) into separate HTML pages for each slide/subsection
- Multiple files can be combined into a single document
- Indexing granularity is important for very long documents
- Indexing an entire book as a document can lead to irrelevant search results
- Indexing each chapter or paragraph as a mini-document improves relevance and ease of finding relevant passages
- Treating individual sentences as mini-documents is possible
- There is a precision/recall tradeoff
- Units that are too small may cause important passages to be missed due to terms being distributed over multiple mini-documents
- Units that are too large may lead to spurious matches and difficulty in finding relevant information
Determining the Vocabulary of Terms
- Problems with large document units can be alleviated by use of explicit or implicit proximity search
- Index granularity and indexing documents at multiple levels of granularity appear prominently in XML retrieval
- An information retrieval (IR) system should offer choices of granularity
- The person deploying the system must understand the document collection, users, their needs, and usage patterns to make informed choices
- An appropriate document unit size should be chosen with a way to divide or aggregate files
Tokenization
- Tokenization is the task of chopping a given character sequence and defined document unit into pieces called tokens, while discarding certain characters like punctuation
- Tokens are loosely referred to as terms or words
- A token is an instance of a sequence of characters in a specific document as a useful semantic unit for processing
- A type is the class of all tokens containing the same character sequence
- A term is a normalized type included in the IR system's dictionary
- The set of index terms could be entirely distinct from the tokens (e.g., semantic identifiers in a taxonomy)
- In modern IR systems, index terms are strongly related to the tokens in the document, derived from them through normalization processes
- If a document to be indexed is "to sleep perchance to dream", there are five tokens, but only four types because "to" appears twice
- If "to" is omitted (as a stop word), only three terms remain: sleep, perchance, and dream
Major Tokenization Questions:
- Determining correct tokens involves considerations beyond simple whitespace and punctuation removal
- English has tricky cases, such as apostrophes for possession and contractions
- Spliting on all nonalphanumeric characters gives bad intuition
- Choices determine which Boolean queries match
- Exact same tokenization is needed for Document and Query words
- Language is key so language identification on short features is very effective
Language-Specific Tokens
- Some tokens are unusual but should be recogniszed like C++ and aircraft names
- Computer technology brought single Token usage like emails, URLs, IP addresses and tracking numbers
- Omitting Tokens might help space bit restricts search
- English uses hyphens with split vowels, joining and showing group up - handing like classication is complex
- Spliting on white space has names, phrases and common compounds problems and may cause bad retrieval since white space may be incorrect
Segmentation
- Segmentation can be usefull for east asian Laguages
- Hyphen use can be trained into users
- each language has a unquie issue
Dropping Common Terms: Stop Words
- Excludes common words from vocab
- Done by collection frequency
- can harm and some qerry types
Equivalence Classing of Terms
- How it makes up for document differnce
- Nomalization is a way to match querrys - standered way is to create classes
- rules like hyphens is helpful but adding is harder
- Alt is relations between token and can be in synonyms
- usual way is with list maintiaing
- expansion can be non identical and asymmetric as well as case
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.