Text Normalization and Inflection Quiz

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Text normalization involves replacing rare synonyms with common words.

True (A)

In text normalization, the term 'nite' would be replaced with ______ to standardize the language.

night

Which of the following is NOT a benefit of text normalization?

Reduces vocabulary variation

Improves spell checking accuracy

Increases the complexity of natural language processing (correct)

Facilitates data analysis

Match the following text normalization techniques with their corresponding examples:

Accent removal = café → cafe Word numeral transformation = twenty three → 23 Contraction substitution = I'm → I am Abbreviation normalization = btw → by the way Signup and view all the answers

The 'lemma' is the basic word form, also known as the citation form.

True (A) Signup and view all the answers

What is the difference between 'types' and 'tokens' in text analysis?

Types refer to the number of distinct words in a corpus, while tokens refer to the total number of words including repetitions and punctuation. Signup and view all the answers

A ______ is a unit of lexical meaning that underlies a set of related words.

lexeme Signup and view all the answers

What is the process of changing the form of a word to express different grammatical categories called?

Inflection (B) Signup and view all the answers

Subwords are always meaning-bearing units like morphemes.

False (B) Signup and view all the answers

Match the following terms with their definitions:

Lemma = The full inflected or derived form of a word. Wordform = The basic word form, used in a dictionary entry. Lexeme = A set of lexical forms with the same stem, part-of-speech, and meaning. Signup and view all the answers

What is the term used to describe the inflection of verbs?

conjugation Signup and view all the answers

The goal of text preprocessing is to reduce the ______ in text and bring it closer to a predefined standard.

randomness Signup and view all the answers

Phonetic normalization is a standard step in all NLP applications.

False (B) Signup and view all the answers

Which of the following is NOT a standard text normalization technique?

Phonetic Transcription (C) Signup and view all the answers

Match the following text normalization techniques with their descriptions:

Lowercasing = Converting all text to lowercase letters. Stemming = Reducing words to their root form. Lemmatization = Reducing words to their dictionary form. Stopwords Removal = Removing common words like 'the', 'a', and 'is' Signup and view all the answers

What is the main purpose of removing stopwords in text normalization?

Dimensionality reduction Signup and view all the answers

Text normalization is only relevant for speech applications.

False (B) Signup and view all the answers

What are some examples of non-standard characters that might be removed during text normalization?

Hashtags, emojis, special characters Signup and view all the answers

Which of the following is a potential consequence of lowercasing text in NLP applications?

Loss of information about proper nouns (C) Signup and view all the answers

Study Notes

Text Preprocessing

Text preprocessing is a standard first step in any natural language processing (NLP) application.
It normalizes text formats into a more suitable form, such as canonical form.
It brings text into a standard format for a specific task.
For speech applications, this includes phonetic normalization and phonetic transcriptions.
The goal is to reduce randomness in text and bring it closer to a predefined standard.
This reduces the amount of various formats for the same word and reduces the dimensionality of the problem.

Definitions

Lemma: A set of lexical forms with the same stem, part-of-speech, and word sense. It's the basic word form (citation form), typically the dictionary entry format (singular nominative for nouns, infinitive of verbs, positive adjective).
Wordform: The full inflected or derived form of a word.
Lexeme: A unit of lexical meaning underlying a set of words related through inflection. It's the basic abstract unit of meaning, roughly corresponding to a set of forms taken by a single root word (lemma).
Types (Word Types): The number of distinct words in a corpus.
Tokens: The total count of running words (including repetitions and punctuation).
Vocabulary Size: The number of different words in a corpus.
Subwords: Arbitrary substrings or meaning-bearing units like morphemes (-est, -er). A morpheme is the smallest significant unit of meaning in a language (e.g., unlikely contains the morphemes un-, likely, and -est).
Inflection (or Inflection): The process of word formation where a word is modified to signify different grammatical categories (tense, case, aspect, number, gender, mood, animacy, definiteness). Verb inflection is conjugation. Noun and adjective inflection is declension.

Definitions

Sentence: A linguistic expression. In traditional grammar, it is typically defined as a string of words expressing a complete thought or as a subject-predicate unit. In non-functional linguistics, it is typically a maximal unit of syntactic structure, such as a constituent.
Utterance: A spoken block segmented by natural pauses in speech. It correlates with a sentence but does not need to be a sentence.