🎧 New: AI-Generated Podcasts Turn your study notes into engaging audio conversations. Learn more

2 - Foundations.pdf

Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...

Full Transcript

Tokenization and Lexical Units Introduction: Sentence Splitting Many analysis techniques work at a sentence level. How do we split a text into sentences? Naive approach: split using the regular expression [.:?!]\s This fails easily: Abbreviations: D. Trump and B. Obama … Ordinal numbers in some lang...

Tokenization and Lexical Units Introduction: Sentence Splitting Many analysis techniques work at a sentence level. How do we split a text into sentences? Naive approach: split using the regular expression [.:?!]\s This fails easily: Abbreviations: D. Trump and B. Obama … Ordinal numbers in some languages: 1. Bundesliga in German, meaning 1st Bundesliga Unusual proper names: Microsoft.NET Quoted speech: Donald Trump called Mitch McConnell “A dumb son of a b—! A stone cold loser!” Then … Missing space (due to typos, preprocessing errors, …): … example sentence.Headline.Next sentence … Wikipedia Regular Expression Playground In-browser Python, may take briefly to load: xxxxxxxxxx import re print(re.split(r"[.:?!]\s", "This is an example. Questions? ")) print(re.split(r"(?

Use Quizgecko on...
Browser
Browser