Recent Lessons

Show all results for ""

Feature Overview

Ace your exams with our all-in-one platform for creating and sharing quizzes and tests.

Quizzes

Create quizzes and tests automatically from your content using AI.

Flashcards

Automatically turn your notes into digital flashcards.

Share, Export & Embed

Share with classmates or export to Excel and your learning management system.

Stats & Reporting

Auto-grading quizzes and tests with detailed stats and reports.

Mobile Apps

The smarter way to study – wherever you are.

Pricing

Search...

Tokenization and Lexical Units PDF

Chat with Document Download

Document Details

Uploaded by ThrillingTuba

Summary

This document is about the concepts of tokenization and lexical units, covering introductory topics such as sentence splitting and the use of regular expressions for such tasks. There is also discussion on the topic of regular expression playground. It's likely part of a course on natural language processing (NLP).

Full Transcript

Tokenization and Lexical Units Introduction: Sentence Splitting Many analysis techniques work at a sentence level. How do we split a text into sentences? Naive approach: split using the regular expression [.:?!]\s This fails easily: Abbreviations: D. Trump and B. Obama … Ordinal numbers in some languages: 1. Bundesliga in German, meaning 1st Bundesliga Unusual proper names: Microsoft.NET Quoted speech: Donald Trump called Mitch McConnell “A dumb son of a b—! A stone cold loser!” Then … Missing space (due to typos, preprocessing errors, …): … example sentence.Headline.Next sentence … Wikipedia Regular Expression Playground In-browser Python, may take briefly to load: xxxxxxxxxx import re print(re.split(r"[.:?!]\s", "This is an example. Questions? ")) print(re.split(r"(?

Use Quizgecko on...

Open

Browser

Tokenization and Lexical Units PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue