Text Normalization and Inflection Quiz
19 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Text normalization involves replacing rare synonyms with common words.

True (A)

In text normalization, the term 'nite' would be replaced with ______ to standardize the language.

night

Which of the following is NOT a benefit of text normalization?

  • Reduces vocabulary variation
  • Improves spell checking accuracy
  • Increases the complexity of natural language processing (correct)
  • Facilitates data analysis
  • Match the following text normalization techniques with their corresponding examples:

    <p>Accent removal = café → cafe Word numeral transformation = twenty three → 23 Contraction substitution = I'm → I am Abbreviation normalization = btw → by the way</p> Signup and view all the answers

    The 'lemma' is the basic word form, also known as the citation form.

    <p>True (A)</p> Signup and view all the answers

    What is the difference between 'types' and 'tokens' in text analysis?

    <p>Types refer to the number of distinct words in a corpus, while tokens refer to the total number of words including repetitions and punctuation.</p> Signup and view all the answers

    A ______ is a unit of lexical meaning that underlies a set of related words.

    <p>lexeme</p> Signup and view all the answers

    What is the process of changing the form of a word to express different grammatical categories called?

    <p>Inflection (B)</p> Signup and view all the answers

    Subwords are always meaning-bearing units like morphemes.

    <p>False (B)</p> Signup and view all the answers

    Match the following terms with their definitions:

    <p>Lemma = The full inflected or derived form of a word. Wordform = The basic word form, used in a dictionary entry. Lexeme = A set of lexical forms with the same stem, part-of-speech, and meaning.</p> Signup and view all the answers

    What is the term used to describe the inflection of verbs?

    <p>conjugation</p> Signup and view all the answers

    The goal of text preprocessing is to reduce the ______ in text and bring it closer to a predefined standard.

    <p>randomness</p> Signup and view all the answers

    Phonetic normalization is a standard step in all NLP applications.

    <p>False (B)</p> Signup and view all the answers

    Which of the following is NOT a standard text normalization technique?

    <p>Phonetic Transcription (C)</p> Signup and view all the answers

    Match the following text normalization techniques with their descriptions:

    <p>Lowercasing = Converting all text to lowercase letters. Stemming = Reducing words to their root form. Lemmatization = Reducing words to their dictionary form. Stopwords Removal = Removing common words like 'the', 'a', and 'is'</p> Signup and view all the answers

    What is the main purpose of removing stopwords in text normalization?

    <p>Dimensionality reduction</p> Signup and view all the answers

    Text normalization is only relevant for speech applications.

    <p>False (B)</p> Signup and view all the answers

    What are some examples of non-standard characters that might be removed during text normalization?

    <p>Hashtags, emojis, special characters</p> Signup and view all the answers

    Which of the following is a potential consequence of lowercasing text in NLP applications?

    <p>Loss of information about proper nouns (C)</p> Signup and view all the answers

    Study Notes

    Text Preprocessing

    • Text preprocessing is a standard first step in any natural language processing (NLP) application.
    • It normalizes text formats into a more suitable form, such as canonical form.
    • It brings text into a standard format for a specific task.
    • For speech applications, this includes phonetic normalization and phonetic transcriptions.
    • The goal is to reduce randomness in text and bring it closer to a predefined standard.
    • This reduces the amount of various formats for the same word and reduces the dimensionality of the problem.

    Definitions

    • Lemma: A set of lexical forms with the same stem, part-of-speech, and word sense. It's the basic word form (citation form), typically the dictionary entry format (singular nominative for nouns, infinitive of verbs, positive adjective).
    • Wordform: The full inflected or derived form of a word.
    • Lexeme: A unit of lexical meaning underlying a set of words related through inflection. It's the basic abstract unit of meaning, roughly corresponding to a set of forms taken by a single root word (lemma).
    • Types (Word Types): The number of distinct words in a corpus.
    • Tokens: The total count of running words (including repetitions and punctuation).
    • Vocabulary Size: The number of different words in a corpus.
    • Subwords: Arbitrary substrings or meaning-bearing units like morphemes (-est, -er). A morpheme is the smallest significant unit of meaning in a language (e.g., unlikely contains the morphemes un-, likely, and -est).
    • Inflection (or Inflection): The process of word formation where a word is modified to signify different grammatical categories (tense, case, aspect, number, gender, mood, animacy, definiteness). Verb inflection is conjugation. Noun and adjective inflection is declension.

    Definitions

    • Sentence: A linguistic expression. In traditional grammar, it is typically defined as a string of words expressing a complete thought or as a subject-predicate unit. In non-functional linguistics, it is typically a maximal unit of syntactic structure, such as a constituent.
    • Utterance: A spoken block segmented by natural pauses in speech. It correlates with a sentence but does not need to be a sentence.

    Text Normalization

    • Noise Elimination: Removing redundant word formats, punctuation, accents.
    • Normalization of word formats: Normalizing word formats.
    • Tokenization: Breaking text into segments (words, sentences).
    • Stemming: Reducing words to their morphological root.
    • Lemmatization: Reducing inflectional forms of a word to a base form.
    • Stopwords Removal: Discarding common words with little meaning in the syntax (e.g., "the," “a,” “is”).
    • Dimensionality reduction: This technique, used in other normalization applications, is more effective than simple normalization.
    • Phonetic normalization, spelling correction, and non-standard word, acronym, or slang normalization: Are all important steps that are additional normalization techniques.

    Normalization of Word Forms

    • Lowercasing: Converting text to lowercase.
    • Removing Non-Standard Characters: Removing special characters, emojis, or hashtags.
    • Normalization of specific word types: Acronyms, special formats such as acronyms or words/names (NER, MT).

    Normalization of Word Forms

    • Removal of duplicate whitespaces and punctuation: Eliminating duplicate spaces and unnecessary punctuation marks.
    • Accent Removal: Removing accents from foreign words.
    • Contraction Substitution: Transforming contractions into their full forms (e.g., "I'm" to "I am").
    • Number and value substitution: Converting words (e.g. twenty-three) to numeric values.
    • Abbreviation normalization: Changing abbreviations to their full form (e.g., "U.S.A" to "United States").
    • Normalization of date and numeric formats: Adjusting formats for social security numbers, dates, and other normalized data formats.
    • Spell correction: Correcting misspelled words; very important with open inputs (e.g. tweets, emails, IMs).
    • Synonym substitution: Replacing rare words with common synonyms.

    Social Media Text Normalization

    • Abbreviations: Abbreviations like "nite," "gr8," "lol."
    • Misspellings: Common misspelt words in social media.
    • Omitted Punctuation: Omitted punctuation marks (e.g., "im," "dont").
    • Slang: Social media slang terms.
    • Wordplays: Words with altered forms (e.g., "soooooo great").
    • Disguised Vulgarities: Coded obscenities (e.g., "sh1t").
    • Emoticons: :-) for smiling faces, <3 for hearts.
    • Informal Transliteration: Language-specific transliteration differences

    Tokenization

    • Tokenization is breaking unstructured text into smaller meaningful fragments, called tokens.
    • This includes words, phrases, sentences to create granular data.
    • Tokens are defined using separators (e.g., whitespace). Tokens are determined by various factors depending on the language (e.g., Japanese, Chinese) and the need to segment words (e.g., idioms, bigrams).
    • Consider whether to split special collocations (e.g., "raining cats and dogs").
    • Issues in Tokenization:
    • Capitalization issues (e.g. Finland, Finlands)
    • Compound words and punctuation (e.g., Hewlett-Packard).
    • Lowercasing text (e.g., single vs multiple tokens for lowercase words).
    • Splitting compounds (e.g., "state-of-the-art").

    Tokenization - Sentences

    • Segmentation based on punctuation: !,?, period . are used to split sentences, including abbreviations and numeric data in sentences.
    • Sentence boundary recognition, including abbreviations (e.g., Inc., Dr., etc.), and numerals (.02%, 4.3).
    • !!!: sentences are required as input for training many neural network tasks.

    SUBWord Tokenization

    • Byte-Pair Encoding (BPE): Break down rare words into smaller sub-words.
    • WordPiece: Another approach similar to BPE in concept.
    • Unigram: Another sub-word splitting approach.
    • SentencePiece: Breaking down raw text into sentences then tokens.

    Byte-Pair Encoding (BPE)

    • A word segmentation algorithm typically used in natural language processing (NLP).
    • It begins with a vocabulary of characters and incrementally creates new n-grams.

    Byte-Pair Encoding (BPE)

    • A compression algorithm that is now used as a form of word segmentation.
    • It analyses the input text to determine the most frequently occurring byte pairs.
    • Replace frequent occurrences of byte pairs with new bytes.
    • How it Works:
    • Start with all characters as vocabulary.
    • Create a new n-gram based off most frequent ngram pairs.

    Byte-Pair Encoding

    • Aims at achieving a vocabulary that has common words as a single token, while breaking down rare words into sub-word tokens.
    • Character ngrams: replace bytes with character ngrams.
    • Word segmentation algorithm: treats text as bottom up clustering.

    Byte-Pair Encoding

    • Vocabulary starts with all unique characters.
    • Most frequent ngram pairs.

    WordPiece/Sentencepiece Model

    • A variant of the wordpiece model used by Google NMT.
    • Works from text (does not assume spaces split words).
    • Whitespace is retained as __ (special token).
    • Uses subword sampling to improve robustness and accuracy.
    • Wordpiece: Uses this variant in BERT which has the ability to split words as needed.

    WordPiece

    • A greedy algorithm that chooses the pair that maximizes the likelihood of the training data.
    • Choosing pairs is based on count frequency.
    • Similar to BPE and Unigram.

    SentencePiece

    • Purely data-driven. Doesn't need any language-dependent logic.
    • Supported by multiple word splitting algorithms, including BPE and unigram language model methods.
    • Uses subword regularization.
    • Fast & lightweight with 50,000 sentences/sec and ~6 MB memory footprint.
    • Can be used to easily generate vocabulary IDs from raw text sentences.

    SentencePiece

    • Treats raw text as input.
    • Uses BPE or unigram algorithms to build vocabulary.
    • Example, the character "_" is included in the vocabulary.
    • Whitespace is a special token.

    NFKC-based Normalization

    • Normalization of characters (e.g., Japanese full-width to half-width), making processing easier.

    Stemming

    • Removing suffixes from words, reducing them to a root word.
    • Example, "flying" becomes "fly".
    • OverStemming: When different words are stemmed to the same root. This is a problem - an example is that universal, universe, and university are all stemmed to the same word–univers.
    • This is called false positive in stemming.
    • Sometimes, too much of the word is removed, and it becomes nonsensical, losing all meaning.

    Stemming

    • Reduces words to their stems.
    • Stemming is a crude affix chopping process.
    • Language-dependent (e.g., automate(s), automatic, automation all reducing to automat).

    Stemming

    • UnderStemming: When two words should be the same stem but are not.
    • Example: "alumnus", "alumni", and "alumnae" should all stem to "alumn".
    • OverStemming: When stemming incorrectly stems words to the same stem, when they are different.
    • Example: stemming "universes", "universities, and universe" to "univers".

    Lemmatization

    • Reducing inflected or variant forms of a word to its base form (lemma).
    • Example: "am", "are", and "is" all become "be"; "cars," "cars'," "car's," and "cars'" to "car."
    • More complex than stemming.
    • Crucial for accuracy in NLP, especially for text comparison, keyword extraction, and corpus analysis.

    Stopwords Removal

    • Filtering out common words that carry little meaning (e.g., "the," "a," "is").
    • Stopwords depend on the task and domain.
    • Sometimes, removing rare terms also improves accuracy but is not always necessary or helpful.

    Studying That Suits You

    Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

    Quiz Team

    Related Documents

    Description

    Test your knowledge on text normalization and grammatical inflection with this quiz. You'll encounter questions about synonyms, lexical meaning, and various techniques used in text analysis. Perfect for students and enthusiasts of linguistics and language processing.

    More Like This

    Text Normalization Quiz
    5 questions

    Text Normalization Quiz

    ValiantIntelligence2744 avatar
    ValiantIntelligence2744
    Text Structure Definition of Terms
    6 questions

    Text Structure Definition of Terms

    WellReceivedSquirrel7948 avatar
    WellReceivedSquirrel7948
    Use Quizgecko on...
    Browser
    Browser