Tokenization and Lexical Units PDF

Document Details

ThrillingTuba

Uploaded by ThrillingTuba

Tags

tokenization lexical units natural language processing text analysis

Summary

This document is about the concepts of tokenization and lexical units, covering introductory topics such as sentence splitting and the use of regular expressions for such tasks. There is also discussion on the topic of regular expression playground. It's likely part of a course on natural language processing (NLP).

Full Transcript

Tokenization and Lexical Units Introduction: Sentence Splitting Many analysis techniques work at a sentence level. How do we split a text into sentences? Naive approach: split using the regular expression [.:?!]\s This fails easily: Abbreviations: D. Trump and B. Obama … Ordinal numbers in some lang...

Tokenization and Lexical Units Introduction: Sentence Splitting Many analysis techniques work at a sentence level. How do we split a text into sentences? Naive approach: split using the regular expression [.:?!]\s This fails easily: Abbreviations: D. Trump and B. Obama … Ordinal numbers in some languages: 1. Bundesliga in German, meaning 1st Bundesliga Unusual proper names: Microsoft.NET Quoted speech: Donald Trump called Mitch McConnell “A dumb son of a b—! A stone cold loser!” Then … Missing space (due to typos, preprocessing errors, …): … example sentence.Headline.Next sentence … Wikipedia Regular Expression Playground In-browser Python, may take briefly to load: xxxxxxxxxx import re print(re.split(r"[.:?!]\s", "This is an example. Questions? ")) print(re.split(r"(?

Use Quizgecko on...
Browser
Browser