2-Tokenization PDF
Document Details
Uploaded by ThrillingTuba
Tags
Summary
This document discusses sentence splitting using regular expressions in Python. It highlights common issues with simple splitting methods, such as handling abbreviations, numbers, and quoted speech. It also covers subtopics like tokenization and provides code examples.
Full Transcript
Introduction: Sentence Splitting Many analysis techniques work at a sentence level. Naive approach: split using the regular expression [.:?!]\s 🔗 Wikipedia This fails easily: ▶ Abbreviations: D. Trump and B. Obama … ▶ Ordinal numbers in some languages: 1. Bundesliga in German, meaning 1st Bundeslig...
Introduction: Sentence Splitting Many analysis techniques work at a sentence level. Naive approach: split using the regular expression [.:?!]\s 🔗 Wikipedia This fails easily: ▶ Abbreviations: D. Trump and B. Obama … ▶ Ordinal numbers in some languages: 1. Bundesliga in German, meaning 1st Bundesliga ▶ Unusual proper names: Microsoft.NET ▶ Quoted speech: Donald Trump called Mitch McConnell “A dumb son of a b—! A stone cold loser!” Then … ▶ Missing space (due to typos, preprocessing errors, …): … example sentence.Headline.Next sentence … 3.1 Regular Expression Playground In-browser Python, may take briefly to load: import re print(re.split(r"[.:?!]\s", "This is an example. Questions? ")) print(re.split(r"(?