Text Preprocessing Techniques in NLP PDF

Text Preprocessing Sanda Martinčić-Ipšić Full professor [email protected] Definitions Lemma a set of lexical forms having the same stem, the same major part-of-speech, and the same word sense. the basic word form – lemma - (or citation form), is chosen by convention as the canonical form of a lexeme the word entry in the dictionary – singular nominative for nouns, infinitive of the verb, positive adjective wordform the full inflected or derived form of the word. lexeme (/ˈlɛksiːm/) unit of lexical meaning that underlies a set of words that are related through inflection basic abstract unit of meaning, unit of morphological analysis in linguistics that roughly corresponds to a set of forms taken by a single root word - lemma 2 Definitions Types (word types) number of distinct words in a corpus; if the set of words in the vocabulary is V, the number of types is the word token vocabulary size |V|. The vocabulary size – number of different words (types) in corpus Tokens the total number N of running words With word repetitions and punctuation The totall count Subwords can be arbitrary substrings, or they can be meaning-bearing units like morphemes -est or -er. morpheme is the smallest meaning-bearing unit of a language; unlikeliest has the morphemes un-, likely, and -est. 3 Definitions inflection (or inflexion) process of word formation in which a word is modified to express different grammatical categories tense, case, voice, aspect, person, number, gender, mood, animacy, and definiteness. inflection of verbs is called conjugation, inflection of nouns, adjectives, adverbs, pronouns, determiners, participles, prepositions and postpositions, numerals, articles, etc., as declension. expresses grammatical categories with affixation prefix, suffix, infix, circumfix, and transfix Flective word forms All word forms that word takes 4 Definitions Sentence linguistic expression In traditional grammar, it is typically defined as a string of words that expresses a complete thought, or as a unit consisting of a subject and predicate. In non-functional linguistics it is typically defined as a maximal unit of syntactic structure such as a constituent. Utterance vs sentence spoken block, segemented on natural pauses in speech, correlate of a sentence but it does not need to be a sentence 5 Text preprocessing Standard first step in any NLP application normalizing the format of text into more suitable form - canonical form bring your text into a suitable / standard /canonical form for the task For speech applications – additonal Phonetic Normalization – and phonetic transcriptions GOAL reduce the randomness in text – bring it closer to a predefined “standard” reduce the amount of different formats of the same word reduce the dimensionality of the problem 6 Text normalization Noise elimination Normalizing word formats Tokenization (segmenting words, sentences) Stemming Lemmatization Stopwords removal more dimensionality reduction than normalization technique + Phonetic Normalization, Spelling Correction, Non-Standard Words, Acronyms, slang normalization, etc… 7 Normalization of word forms Lowercasing Usually: first transform text into lowercase Canada, Hrvatska canada, hrvatska → canada, hrvatska CANADA, HRVATSKA BUT for somem tasks the first capital letter is usefull (NER, MT, Surname or word: Smith vs smith (blacksmith) very important to extract names and locations > NER ACRONYMS loweracasing can transform the acronym into the regular word US > us Removing non standard characters Removal or substitution of special characters/emojis (e.g.: remove hashtags). 8 Normalization of word forms Removal of duplicate whitespaces and punctuation. Accent removal (if your data includes diacritical marks from ‘foreign’ languages — this helps to reduce errors related to encoding type). Substitution of contractions (very common in English; e.g.: ‘I’m’→‘I am’). Transform word numerals into numbers (eg.: ‘twenty three’→‘23’). Substitution of values for their type (e.g.: ‘$50’→‘MONEY’). Acronym normalization (e.g.: ‘US’→‘United States’/‘U.S.A’) and abbreviation normalization (e.g.: ‘btw’→‘by the way’). Normalize date formats, social security numbers or other data that have a standard format. 9 Normalization of word forms Spell correction (one could say that a word can be misspelled infinite ways, so spell corrections reduce the vocabulary variation by “correcting”) — this is very important if you’re dealing with open user inputs, such as tweets, IMs and emails. Substitution of rare words for more common synonyms. https://towardsdatascience.com/text-normalization-7ecc8e084e31 10 Social media text normalization Abbreviations: nite (night), gr8 (great), sayin (saying), lol (laugh out loud), iirc (if I remember correctly), hard2tell (hard to tell) Misspelling: wouls (would), rediculous (ridiculous) Omitted Punctuation: im (I'm), dont (don't) Slang: that was well mint (that was well good) Wordplays: that was soooooo great (that was so great) 11 Social media text normalization Disguised Vulgarities: sh1t, f**k Emoticons: :) for smiling face,

Text Preprocessing Techniques in NLP PDF

Document Details

Tags

Related

Summary

Full Transcript