Python RegEx: String Matching

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

Which of the following is the most accurate description of a regular expression (regex)?

  • A module used for complex mathematical calculations.
  • A sequence of characters that defines a search pattern. (correct)
  • A method within the string module to find substrings.
  • A specific data type in Python used for pattern matching.

Which Python module is primarily used for working with regular expressions?

  • regex
  • string
  • pattern
  • re (correct)

Which of the following functions from the re module returns a match object if a pattern is found anywhere in a string?

  • findall()
  • search() (correct)
  • sub()
  • split()

What does the re.search() function return if it cannot find a match for the specified regex pattern in the given string?

<p>None (B)</p> Signup and view all the answers

If result = re.search('123', 'foo123bar'), what would result.span() evaluate to?

<p>(3, 6) (A)</p> Signup and view all the answers

Which of the following metacharacters, when placed inside square brackets [], negates the character class?

<p>^ (C)</p> Signup and view all the answers

What is the function of the '.' metacharacter in regular expressions?

<p>Matches any character except a newline. (A)</p> Signup and view all the answers

Which special sequence matches any digit character?

<p>\d (B)</p> Signup and view all the answers

What does the special sequence \b do in a regular expression?

<p>Matches a word boundary. (D)</p> Signup and view all the answers

Which of the following is true regarding the use of raw strings in regular expressions?

<p>They prevent backslashes from being interpreted as escape sequences. (D)</p> Signup and view all the answers

What does the quantifier * do in a regular expression?

<p>Matches zero or more occurrences of the preceding character or group. (C)</p> Signup and view all the answers

Given the regex a+, which of the following strings would not be a match?

<p>'' (D)</p> Signup and view all the answers

Given the regex foo[0-9]{3}bar, which of the following strings would match?

<p>'foo123bar' (B)</p> Signup and view all the answers

What is the purpose of grouping constructs in regular expressions?

<p>To break up the regex into subexpressions and capture parts of the matched string. (D)</p> Signup and view all the answers

If you have a match object m, and you want to retrieve all captured groups as a tuple, which method should you use?

<p>m.groups() (C)</p> Signup and view all the answers

What do backreferences in regular expressions allow you to do?

<p>Refer back to a previously captured group in the same regex. (B)</p> Signup and view all the answers

What is the purpose of a non-capturing group in regular expressions, denoted by (?:regex)?

<p>It groups the regex but does not capture the matching portion for later retrieval. (B)</p> Signup and view all the answers

What is a key benefit of using a non-capturing group (?:...) in a regular expression?

<p>It improves performance by not storing the matched group. (C)</p> Signup and view all the answers

What does a lookahead assertion do in a regular expression?

<p>Checks what follows the current position without consuming it. (A)</p> Signup and view all the answers

In a regular expression, what is the key characteristic of lookahead and lookbehind assertions?

<p>They are zero-width assertions and do not consume characters. (D)</p> Signup and view all the answers

What is the difference between a positive lookahead (?=...) and a negative lookahead (?!...) assertion?

<p>A positive lookahead asserts that the pattern <em>must</em> follow, while a negative lookahead asserts that the pattern <em>must not</em> follow. (D)</p> Signup and view all the answers

Which of the following is the correct usage for the 'or' operator in regular expressions?

<p>| (C)</p> Signup and view all the answers

If you want to match either 'cat' or 'dog' in a string, what regex pattern would you use?

<p>(cat|dog) (D)</p> Signup and view all the answers

Which function from the re module would you use to find all non-overlapping matches of a pattern in a string, returning them as a list?

<p>findall() (B)</p> Signup and view all the answers

Which re module function is best suited for replacing occurrences of a pattern with a replacement string?

<p>sub() (D)</p> Signup and view all the answers

What is the purpose of flags in re.search() and other regular expression functions?

<p>To modify how the regular expression is interpreted. (B)</p> Signup and view all the answers

Which flag makes alphabetic character matching case-insensitive?

<p>re.I (A)</p> Signup and view all the answers

What is the effect of the re.MULTILINE flag?

<p>It causes the '^' and '$' anchors to match at the beginning and end of each line. (A)</p> Signup and view all the answers

Which flag causes the dot (.) metacharacter to match newline characters as well?

<p>re.DOTALL (B)</p> Signup and view all the answers

What is the primary purpose of the re.VERBOSE flag?

<p>To allow for whitespace and comments within the regular expression. (B)</p> Signup and view all the answers

What is the output of bool(re.search('abc', 'def'))?

<p>False (D)</p> Signup and view all the answers

If you have result = re.search('[0-9]{2}', 'abc12def'), what would result.group() return?

<p>'12' (D)</p> Signup and view all the answers

What is the output of the following code?

import re
s = 'hello world'
result = re.search('^hello', s)
print(bool(result))

<p>True (D)</p> Signup and view all the answers

What is the output of the following code?

import re
s = 'hello world'
result = re.search('world$', s)
print(result.group())

<p>'world' (C)</p> Signup and view all the answers

What does re.search(r'\d+', 'abc 123 def').group() return?

<p>'123' (B)</p> Signup and view all the answers

What will the following code print?

import re
text = "The cat in the hat."
result = re.search(r"t.e", text, re.IGNORECASE)
print(result.group())

<p>The (D)</p> Signup and view all the answers

What will the following code print?

import re
text = "apple, banana, cherry"
result = re.split(r",\s*", text)
print(result)

<p>['apple', 'banana', 'cherry'] (D)</p> Signup and view all the answers

What will be the output of the code?

import re
text = '123abc456def'
new_text = re.sub(r'\d+', '#', text)
print(new_text)

<p>'#abc#def' (D)</p> Signup and view all the answers

What is the output of the following Python code snippet?

import re
string = 'hello123world'
pattern = r'(\D+)(\d+)(\D+)'
match = re.search(pattern, string)
print(match.groups())

<p>('hello', '123', 'world') (C)</p> Signup and view all the answers

What does the regular expression r'\btest\b' match?

<p>Only the whole word 'test'. (C)</p> Signup and view all the answers

Which of the following regular expressions correctly matches a string that starts with 'foo', followed by any number of digits, and ends with 'bar'?

<p>r'^foo\d*bar$' (B)</p> Signup and view all the answers

Flashcards

Regular Expression (RegEx)

A sequence of characters that forms a complex string-matching pattern.

re Module in Python

A module in Python used for working with regular expressions.

re.search()

Returns a Match object if there is a match anywhere in the string.

Falsy Values

Values that evaluate to False in a Boolean context. Examples - Empty lists [], Empty tuples (), Zero of any numeric type.

Signup and view all the flashcards

Truthy Values

Values that are considered True in a Boolean context. Example - non empty lists.

Signup and view all the flashcards

Metacharacters

Special characters in regex that have unique interpretations by the regex matching engine.

Signup and view all the flashcards

[0-9] in Regex

Matches any single decimal digit character, which means any character between '0' and '9', inclusive.

Signup and view all the flashcards

Dot (.) Metacharacter

Matches any single character except a newline.

Signup and view all the flashcards

Special Sequence (Regex)

A \ followed by one of the characters that represents a special meaning.

Signup and view all the flashcards

\w in Regex

Matches any alphanumeric word character, like uppercase and lowercase letters, digits, and the underscore

Signup and view all the flashcards

\W in Regex

Matches any character that isn't a word character.

Signup and view all the flashcards

\d in Regex

Matches any decimal digit character.

Signup and view all the flashcards

\D in Regex

Matches any character that isn't a decimal digit.

Signup and view all the flashcards

\B in Regex

Anchors a match to a location that isn't a word boundary.

Signup and view all the flashcards

\b in Regex

Anchors a match to a word boundary.

Signup and view all the flashcards

Raw string

String literal with a prefix of r. That ignores special characters.

Signup and view all the flashcards

  • Quantifier

Matches zero or more repetitions of the preceding regex.

Signup and view all the flashcards

  • Quantifier

Matches one or more repetitions of the preceding regex.

Signup and view all the flashcards

? Quantifier

Match only if the preceding regex occurs once or not at all.

Signup and view all the flashcards

{m} Quantifier

Matches exactly m repetitions of the preceding regex.

Signup and view all the flashcards

Grouping Constructs

Breaks up a regex in Python into subexpressions or groups to create single syntactic entity. Additional metacharacters apply to the entire group as a unit.

Signup and view all the flashcards

m.groups()

A tuple containing all the captured groups from a regex match.

Signup and view all the flashcards

m.group()

A string containing the captured match.

Signup and view all the flashcards

Backreferences

Matches the contents of a previously captured group.

Signup and view all the flashcards

re.IGNORECASE

Modifies matching of alphabetic characters case insensitive.

Signup and view all the flashcards

findall()

Returns a list containing all matches

Signup and view all the flashcards

split()

Returns a list where the string has been split at each match

Signup and view all the flashcards

sub()

Replaces one or many matches with a string

Signup and view all the flashcards

re.MULTILINE

Creates a new regex object with the MULTILINE setting enabled

Signup and view all the flashcards

re.DOTALL

DOTALL flag, causes the dot metacharacter to match a newline

Signup and view all the flashcards

Study Notes

  • Regular Expressions (RegEx) constitute a sequence of characters shaping a complex string-matching pattern
  • The re module is used in Python for regular expressions

Key Resources

  • https://docs.python.org/3/library/re.html#re.IGNORECASE
  • https://realpython.com/regex-python/
  • https://www.programiz.com/python-programming/regex
  • https://www.w3schools.com/python/python_regex.asp

Matching a Substring

  • When you have a string object referred to as s, you can ascertain if s contains a specific substring, such as '123', using Python code
  • One method is available to determine if a string object contains a specified substring
  • The .find() or .index() methods locate the position of a substring like '123'
  • These find() or index() methods are integrated into the string module of Python

Python's Regular Expressions

  • Simple character-by-character comparisons are effective in many situations, but, might not always be sufficient for complex string matching
  • Regular expressions can identify a sequence of three consecutive decimal digits within strings like 'foo123bar', 'foo456bar', '234baz', and 'qux678'
  • Python's regular expressions (regexes) solve complex matching issues

The ‘re’ Module

  • A built-in package in Python called re is used when working with Regular Expressions
  • One must import the re module to search a string for matches with regular expressions using import re

Key Functions

  • findall gets a list containing all matches
  • search gets a Match object if there is a match anywhere in the string
  • split gets a list where the string has been split at each match
  • sub replaces one or many matches with a string

The re.search() Function

  • The focus will be on the re.search() function for regex matching
  • re.search(<regex>, <string>) scans a string for a regex match
  • re.search(<regex>, <string>) scans the string to find the first location where the pattern matches
  • re.search() gets a match object if a match is found; otherwise, None is returned

Examining the Returned Match Object

  • The search pattern is 123 and is s, and the returned match object will be shown
  • A successful call returned a match object rather than None
  • span=(3, 6) show the indices where the match was found found in the string
  • s[3:6] # '123' means the same as if the substring was obtained through slice notation, i.e. s[3:6]
  • match='123' indicates the characters from the string that matched

Truthiness of Match Objects

  • A match object is "truthy", enabling its use in a Boolean context such as a conditional statement
  • A value such as False is considered Falsy
  • A value such as True is considered Truthy

Truthy and Falsy Values according to Python Documentation

  • A default object is True
  • Non-empty sequences and/or collections (lists, tuples, strings, dictionaries, sets) are True
  • Numeric values that are not 0 are True

Falsy

  • Empty lists, such as []
  • Empty tuples, such as ()
  • Empty dictionaries, such as {}
  • Empty sets, such as set()
  • Empty strings, such as ""
  • Empty ranges, such as range(0)
  • Zero is False for any non-complex numeric type
  • Numerical Types:
    • Integer: 0
    • Float: 0.0
  • Constants: None and False

Metacharacters

  • The true strength of regex matching in Python uses special characters known as metacharacters in it's pattern
  • Interpratation of metacharacters by the RegEx matching engine expands search capabilities
  • A character class can determine whether a string contains any sequence of three consecutive decimal digits
  • A character class is defined as a set of characters enclosed in square brackets ([])
  • The metacharacter sequence matches any single character that belongs to the class

Commonly Used Metacharacters

  • [0-9] matches any single decimal digit character, any character between '0' and '9'
  • The expression [0-9][0-9][0-9] matches a string containing any sequence of three decimal digit characters
  • s matches because it contains three consecutive decimal digit characters, '123'
  • Examples of matching other numbers in strings include '465' in "foo465"
  • A string that does not contain three consecutive digits will not match

Obtaining String Details with Metacharacters

  • When you want details of the output, you can use the following
  • .start() shows the index where the match occurred
  • .end() shows the index (not included) where the match finished
  • .span() returns a tuple including the starting and ending (not included) indexes
  • .group() the match case

Metacharacters Listed

Character(s) Meaning Character(s) Meaning
. Matches any single character except for newline {} Matches an explicitly specified number of repetitions
^ Anchors a match at the start of a string, complements a character class \ Escapes a metacharacter of its special meaning, Introduces a special character class
$ Anchors a match at the end of a string [] Specifies a character class
* Matches zero or more repetitions | Designates alternation
+ Matches one or more repetitions () Creates a group
? Matches zero or one repetition, specifies the non-greedy versions of *, +, and ?, Creates a named group : Designate a specialized group
Introduces a lookahead or lookbehind assertion # = ! Creates a named group
<>

Enumerating Characters

  • Characters contained in square brackets ([]) represent a character class
  • This is an enumerated set of characters to match from
  • A character class metacharacter sequence will match any single character that is contained in the class
  • Individual matching characters can be enumerated individually

Representing Ranges

  • A regex pattern can include a range of characters separated by a hyphen (-)
  • This setup matches any single character that falls within that specified range
  • For instance, the character class [a-z] matches any lowercase letter from 'a' to 'z', inclusive
  • [0-9] matches any digit character and [0-9a-fA-F] matches any hexadecimal digit character
  • The returned match is always the leftmost one found. the function re.search() scans the search string from left to right and stops as soon as it finds a match for the regex pattern, it then halts scanning

Complementation

  • You can complement a character class by specifying ^ as the first character to match any character that is not in the set
  • [^0-9] matches any character that isn’t a digit
  • If a ^ character appears in a character class somewhere other than first, then the ^ has no special meaning and matches a literal '^'

Other Usage Cases for Metacharacters

  • Use as first or last character or escape with a backslash (\)

Hyphen Usage

  • Hyphens can be used in three ways
    • You can place it as the first or last character
    • Escape it with a backslash \ to use, instead of specifying a range of characters in a character class

Square Brackets

  • Square brackets can be used in two ways
    • Place it as the first character
    • Escape it with \ backslash

Other Regex Metacharacters

  • They lose special meaning inside a character class

Dot metacharacter

  • Matches one single character, excluding newline
  • As a RegEx, foo.bar equals the characters "foo," any character except newline, then characters "bar"
    • The first string shown above, fooxbar means .metacharacter matches character x
  • Match fails when newline is encountered

Special Sequences in RegEx

  • A special sequence is a backslash (\) that follows one of the characters in the list on the next slide
  • They each have special meanings

List of Special Character Sequences

Character Description Example
\A Returns a match if the specified characters are at the beginning of the string "\AThe"
\b Returns a match where the specified characters are at the beginning or end of a word, the letter "r" ensures string becomes a raw string. It is not a command, but a modifier of a string r"\bain"
\B Returns a match where the specified characters are present, but NOT at the beginning or end of a word. Letter r ensures String is being treated as a raw string r"\Bain"
\d Returns a match where the string contains digits (numbers 0-9) "\d"
\D Returns a match where the string DOES NOT contain digits "\D"
\s Returns a match where the string contains a white space character "\s"
\S Returns a match where the string DOES NOT contain a white space character "\S"
\w Returns a match where the string contains any word character (characters a-Z, digits 0-9), and the underscore _character "\w"
\w Returns a match where the string DOES NOT contain any word characters "\W"

Commonly Used Special Sequences

\w and \W

Feature \w \W
Matches Any alphanumeric character Any non-word character
Definition "[A-Z a-z 0-9_]" [^a-zA-Z0-9_]
Type of Word Characters Uppercase and lowercase Not applicable.
Characteristics Letters, digits, and "_" Symbolic (e.g. # @ % &*().)

\d and \D

Feature \d \D
Matches Any decimal digit Isn't a decimal digit
Definition [0-9] [^0-9]
Digits Numerical *Symbol_ like @
Characteristics Recognizeable character "Q", "?", " \{=^~

\s and \S

Feature "\s" "\S"
Matches New line Yes No
Is it the "Opposite match" Yes Yes
Considers tab spaces and returns new line Yes No

Anchors and Bolds

  • Anchors in Regex are unique and specialized, they always match zero-width positions. | ^ and \A | Dollar "$" and \Z*| | :--------------------- | :---------------------------------------- | | The Beginning| end| | Word (\b ) | not a word boundary (\B)|
  • Anchors do not consume any part of the search string
  • Anchors Specifies a specific location in a search for a Match

Boundaries with Raw Strings

  • \b (and`\B`*) demands the use of raw strings in Python
  • Strings will begin with a suffix letter as r or R that will be used to ignore the special character

\b Special Sequence

  • Anchors a match to a word boundary
  • Position current position is used at the beginning or end of a word
  • Common Alphanumeric Characters or underscore "_" in Regex as ([A-Za-z,0-9__])

What happens then?

  • You use \b on the end in situations you're present
  • It is present in the string so it's present for the whole word

Important side note

Raw literal text that exists at the end and start as a \b boundary.

\B - Opposite\b in Regex

  • Not the start or finish
  • The word "barfoobaz" exists and contains no word in *The Search**

"Escaping Metacharacters"

  • Backslash \
  • Removes the special meaning of a Metacharacter

"First Example-back slash"

  • The dot here represents a wildcard
  • The dot . matches every character

"Second Example- Backarrow"

  • Represent this to a new line string by defining the Literal Text
  • Escape with \ backslash

To Escape a Backslash

  • you need to use raw strings

\Quantifiers

  • A* quantifier* Metaclass follows a segment and how many characters you need to define for the match to be considered

  • To the power * as 0 or greater

  • Additions use + to the power of and greater

  • If it's something you don't want to add you use ?

  • The quantifier meta characters represent both the lazy or greedy versions ### The "Regex"

  • Effectively defines what ends the character at the ending

  • And what follows the *<> character

Summary of the Lazy or Greedy Versin

Greedy Lazy
Longer strings are used Shorter strings are used
The longest is specified "\<[.+]\>" "\<[.+?]\>"
What about 0?

Additional Quantifiers

  • {\m} - Exact charaters or repetitions/Quantifiers that the preceding Regex refers to
  • {\m, \n} Number of any that is preceding
Examples Matches
Non-N-Neg. Int - Regex Integer
If omitted Character is the same

The Regex Version

  • Greedy Version = 3, 5 produces, the longest match

  • a{3, 5} produces "aaaaaaaa."

  • Then in *Regex" The shortest match"

  • A(3,5) produces A

Terms

"Grouping Constructs"

  • Regex in parts

"Sub expression"

  • Group represts the sinlge unit

"Additional Meta Characters"

  • Applies each group like Unit
The captured text is the Unit that returns later than after
Group
A group that is contained
Additional Metacharacters

Regex Terms and Applications

What expression does How to solve it
What happens in a string? \b must use raw* string*
You may create groups Additional MetaCharacters and units of units

"Groups and Syntax

  • What follows? "One or more" of string 'bar'

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Python Regular Expressions Quiz
5 questions

Python Regular Expressions Quiz

AffirmativeTourmaline avatar
AffirmativeTourmaline
Python: Regular Expressions
26 questions

Python: Regular Expressions

SatisfyingZinnia4509 avatar
SatisfyingZinnia4509
Regular Expressions in Python
8 questions
Use Quizgecko on...
Browser
Browser