Tokenization and Lexical Units PDF
Document Details
Uploaded by ThrillingTuba
Tags
Summary
This document is about the concepts of tokenization and lexical units, covering introductory topics such as sentence splitting and the use of regular expressions for such tasks. There is also discussion on the topic of regular expression playground. It's likely part of a course on natural language processing (NLP).
Full Transcript
Tokenization and Lexical Units Introduction: Sentence Splitting Many analysis techniques work at a sentence level. How do we split a text into sentences? Naive approach: split using the regular expression [.:?!]\s This fails easily: Abbreviations: D. Trump and B. Obama … Ordinal numbers in some lang...
Tokenization and Lexical Units Introduction: Sentence Splitting Many analysis techniques work at a sentence level. How do we split a text into sentences? Naive approach: split using the regular expression [.:?!]\s This fails easily: Abbreviations: D. Trump and B. Obama … Ordinal numbers in some languages: 1. Bundesliga in German, meaning 1st Bundesliga Unusual proper names: Microsoft.NET Quoted speech: Donald Trump called Mitch McConnell “A dumb son of a b—! A stone cold loser!” Then … Missing space (due to typos, preprocessing errors, …): … example sentence.Headline.Next sentence … Wikipedia Regular Expression Playground In-browser Python, may take briefly to load: xxxxxxxxxx import re print(re.split(r"[.:?!]\s", "This is an example. Questions? ")) print(re.split(r"(?