Recent Lessons

Show all results for ""

Feature Overview

Ace your exams with our all-in-one platform for creating and sharing quizzes and tests.

Quizzes

Create quizzes and tests automatically from your content using AI.

Flashcards

Automatically turn your notes into digital flashcards.

Share, Export & Embed

Share with classmates or export to Excel and your learning management system.

Stats & Reporting

Auto-grading quizzes and tests with detailed stats and reports.

Mobile Apps

The smarter way to study – wherever you are.

Pricing

Search...

2-Tokenization PDF

Chat with Document Download Quiz & Flashcards

Document Details

Uploaded by ThrillingTuba

Summary

This document discusses sentence splitting using regular expressions in Python. It highlights common issues with simple splitting methods, such as handling abbreviations, numbers, and quoted speech. It also covers subtopics like tokenization and provides code examples.

Full Transcript

Introduction: Sentence Splitting Many analysis techniques work at a sentence level. Naive approach: split using the regular expression [.:?!]\s 🔗 Wikipedia This fails easily: ▶ Abbreviations: D. Trump and B. Obama … ▶ Ordinal numbers in some languages: 1. Bundesliga in German, meaning 1st Bundesliga ▶ Unusual proper names: Microsoft.NET ▶ Quoted speech: Donald Trump called Mitch McConnell “A dumb son of a b—! A stone cold loser!” Then … ▶ Missing space (due to typos, preprocessing errors, …): … example sentence.Headline.Next sentence … 3.1 Regular Expression Playground In-browser Python, may take briefly to load: import re print(re.split(r"[.:?!]\s", "This is an example. Questions? ")) print(re.split(r"(?

Use Quizgecko on...

Open

Browser

2-Tokenization PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue