Python Regular Expressions

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Which metacharacter matches any single character except a newline?

  • *
  • . (correct)
  • ^
  • $

The re.match() function searches for a pattern anywhere in the string, returning the first match found.

False (B)

What is the purpose of using (?:...) in a regular expression?

It groups a part of a pattern without capturing it.

The $ metacharacter in regular expressions matches the ______ of the string.

<p>end</p> Signup and view all the answers

Match the following special sequences with their descriptions:

<p>\d = Matches any decimal digit (0-9) \s = Matches any whitespace character \w = Matches any word character (alphanumeric and underscore) \D = Matches any character that is not a decimal digit</p> Signup and view all the answers

What does the re.sub() function do?

<p>Replaces all occurrences of a pattern in a string with a replacement string. (D)</p> Signup and view all the answers

By default, quantifiers in regular expressions are non-greedy.

<p>False (B)</p> Signup and view all the answers

What is the purpose of the re.compile() function in the re module?

<p>It compiles a regular expression pattern into a regular expression object.</p> Signup and view all the answers

The flag re.IGNORECASE (or re.I) is used to perform ______ matching.

<p>case-insensitive</p> Signup and view all the answers

Which of the following regular expression patterns can be used to extract email addresses from a given text?

<p>\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Z|a-z]{2,}\b (D)</p> Signup and view all the answers

Flashcards

Regular Expressions (Regex)

Sequences of characters that define a search pattern, used for pattern matching, searching, and manipulation.

Regex Metacharacter: .

Matches any single character except a newline.

Regex Repetition: *

Matches zero or more occurrences of the preceding character or group.

Regex Character Classes: []

Defines a character class, matching any single character within the brackets.

Signup and view all the flashcards

Regex Special Sequence: \d

Matches any decimal digit (0-9).

Signup and view all the flashcards

Regex Grouping: ()

Used to group parts of a pattern and capture the matched text.

Signup and view all the flashcards

Regex Alternation: |

Acts as an 'or' operator, matching either the expression before or after the |.

Signup and view all the flashcards

re.search(pattern, string)

Searches the string for the first occurrence of the pattern.

Signup and view all the flashcards

re.sub(pattern, replacement, string)

Replaces all occurrences of the pattern in the string with the replacement string.

Signup and view all the flashcards

re.compile(pattern)

Compiles a regular expression pattern into a regular expression object.

Signup and view all the flashcards

Study Notes

  • Regular expressions (regex) are sequences of characters that define a search pattern.
  • They are used for pattern matching within strings, text searching, and text manipulation.
  • The re module in Python provides support for regular expressions.

Basic Patterns

  • Literal characters match themselves. For instance, the pattern a will match the first occurrence of the character 'a' in a string.
  • Metacharacters have special meanings, including ., ^, $, *, +, ?, [], \, |, and ().
  • . matches any single character except a newline
  • ^ matches the beginning of the string.
  • $ matches the end of the string.

Repetition

  • * matches zero or more occurrences of the preceding character or group.
  • + matches one or more occurrences of the preceding character or group.
  • ? matches zero or one occurrence of the preceding character or group.
  • {m,n} matches between m and n occurrences of the preceding character or group.

Character Classes

  • [] defines a character class, matching any single character within the brackets. For example, [abc] matches either 'a', 'b', or 'c'.
  • Ranges can be specified within character classes. [a-z] matches any lowercase letter and [0-9] matches any digit.
  • [^...] matches any character not in the character class. [^abc] matches any character except 'a', 'b', or 'c'.

Special Sequences

  • \d matches any decimal digit (0-9).
  • \D matches any character that is not a decimal digit.
  • \s matches any whitespace character (space, tab, newline).
  • \S matches any non-whitespace character.
  • \w matches any word character (alphanumeric and underscore).
  • \W matches any non-word character.

Grouping and Capturing

  • () groups together parts of a pattern. It also captures the matched text, which can be retrieved later.
  • Captured groups are numbered starting from 1.
  • (?:...) groups a part of a pattern without capturing it.

Alternation

  • | acts as an "or" operator, matching either the expression before or after the |. a|b matches either 'a' or 'b'.

Escaping

  • \ is used to escape metacharacters, allowing you to match them literally. \$ matches a literal dollar sign.

re Module Functions

  • re.search(pattern, string) searches the string for the first occurrence of the pattern. Returns a match object if found, otherwise None.
  • re.match(pattern, string) tries to match the pattern at the beginning of the string. Returns a match object if found, otherwise None.
  • re.findall(pattern, string) finds all non-overlapping matches of the pattern in the string and returns them as a list of strings.
  • re.finditer(pattern, string) finds all non-overlapping matches of the pattern in the string and returns them as an iterator of match objects.
  • re.sub(pattern, replacement, string) replaces all occurrences of the pattern in the string with the replacement string.
  • re.split(pattern, string) splits the string by the occurrences of the pattern.

Match Objects

  • Match objects are returned by re.search() and re.match() when a match is found.
  • match.group(n) returns the nth captured group. match.group(0) returns the entire match.
  • match.groups() returns a tuple containing all captured groups.
  • match.start(n) returns the starting position of the nth group.
  • match.end(n) returns the ending position of the nth group.
  • match.span(n) returns a tuple containing the starting and ending positions of the nth group.

Regular Expression Compilation

  • re.compile(pattern) compiles a regular expression pattern into a regular expression object.
  • Compiling a regex can improve performance if the pattern is used multiple times.

Flags

  • Flags modify the behavior of regular expressions.
  • re.IGNORECASE or re.I performs case-insensitive matching.
  • re.DOTALL or re.S makes the dot (.) match any character, including newlines.
  • re.MULTILINE or re.M makes ^ match the beginning of each line and $ match the end of each line.
  • re.VERBOSE or re.X allows for more readable regular expressions by ignoring whitespace and comments.

Text Processing

  • Text processing involves manipulating and analyzing text data.
  • Regular expressions are a powerful tool for text processing tasks, including data extraction, data cleaning, data validation, and text transformation.

Common Text Processing Tasks with Regex

  • Extracting email addresses from a document.
  • Validating phone numbers.
  • Removing HTML tags from a web page.
  • Replacing multiple spaces with a single space.
  • Standardizing date formats.

Example: Extracting Email Addresses

  • Pattern: \b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b
    • \b: Word boundary.
    • [A-Za-z0-9._%+-]+: One or more alphanumeric characters, dots, underscores, percentage signs, plus or minus signs.
    • @: The at symbol.
    • [A-Za-z0-9.-]+: One or more alphanumeric characters, dots, or hyphens.
    • \.: A literal dot.
    • [A-Z|a-z]{2,}: Two or more letters (top-level domain).
    • \b: Word boundary.

Example: Validating Phone Numbers (US Format)

  • Pattern: ^(\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}$
    • ^: Start of the string.
    • (\+\d{1,2}\s)?: Optional international code.
    • \(?: Optional opening parenthesis.
    • \d{3}: Three digits (area code).
    • \)?: Optional closing parenthesis.
    • [\s.-]?: Optional whitespace, dot, or hyphen.
    • \d{3}: Three digits.
    • [\s.-]?: Optional whitespace, dot, or hyphen.
    • \d{4}: Four digits.
    • $: End of the string.

Example: Removing HTML Tags

  • Pattern: <.*?>
    • <: Matches the opening angle bracket.
    • .*?: Matches any character (.) zero or more times (*?) non-greedily.
    • >: Matches the closing angle bracket.

Non-Greedy vs. Greedy Matching

  • By default, quantifiers like * and + are greedy; they will match as much text as possible.
  • Adding a ? after a quantifier makes it non-greedy, which will match as little text as possible.
  • Given the string <h1>Title</h1>, the greedy pattern <.*> would match the entire string, while the non-greedy pattern <.*?> would match <h1> and </h1> separately.

Lookarounds

  • Lookarounds are zero-width assertions that match a position in the string based on what precedes or follows it, without including those characters in the match.
  • Positive lookahead (?=...): Matches if the subpattern ... matches at the current position.
  • Negative lookahead (?!...): Matches if the subpattern ... does not match at the current position.
  • Positive lookbehind (?<=...): Matches if the subpattern ... matches before the current position.
  • Negative lookbehind (?<!...): Matches if the subpattern ... does not match before the current position.

Example: Lookaround

  • To find words that are followed by a specific word, use a positive lookahead: \b\w+(?=\sfollowed)
  • To find words that are not followed by a specific word, use a negative lookahead: \b\w+(?!\snotfollowed)

Best Practices

  • Compile regular expressions for reuse to improve performance.
  • Use raw strings (r"...") to define regular expressions, which prevents backslashes from being interpreted as escape sequences.
  • Comment complex regular expressions to explain their functionality.
  • Test regular expressions thoroughly with a variety of inputs.
  • Be aware of the performance implications of complex regular expressions.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

More Like This

Cryptic Code Deciphering Quiz
10 questions
Introduction to Regular Expressions Quiz
40 questions
Regular Expressions Matching Quiz
12 questions
Use Quizgecko on...
Browser
Browser