Python regular expression

A regular expression is a special sequence of characters that helps you easily check if a string matches a pattern.
Python has added the re module since version 1.5, which provides a Perl-style regular expression pattern.
The re module gives the Python language full of regular expression functionality.
The compile function generates a regular expression object based on a pattern string and optional flag parameters. This object has a set of methods for regular expression matching and replacement.
This chapter focuses on regular expression processing functions commonly used in Python.

re.match function

re.match attempts to match a pattern from the beginning of the string. If the starting position is not matched successfully, match() returns none.
Function syntax:

re.match(pattern, string, flags=0)

Description of function parameters:

Parameter Description
pattern Matched regular expression
string The string to match.
flags Flag bit, used to control how regular expressions are matched, such as whether to distinguish between uppercase and lowercase, multi-line matching

Matches the successful re.match method to return a matching object, otherwise returns None.

Matching object method Description
group(num=0) A string that matches the entire expression, group() can enter multiple group numbers at a time, in which case it returns a tuple containing the values for those groups.
groups() Returns a tuple containing all the group strings, from 1 to the included team number.

example

1
2
3
4
5
6
#!/usr/bin/python
# -*- coding: UTF-8 -*-

import re
print(re.match('www', 'www.mysite.com').span()) # Match at the starting position
print(re.match('com', 'www.mysite.com')) # Does not match at the starting position

result

1
2
(0, 3)
None

example

1
2
3
4
5
6
7
8
9
10
11
12
13
#!/usr/bin/python
import re

line = "Cats are smarter than dogs"

matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)

if matchObj:
print "matchObj.group() : ", matchObj.group()
print "matchObj.group(1) : ", matchObj.group(1)
print "matchObj.group(2) : ", matchObj.group(2)
else:
print "No match!!"

The above example execution results are as follows:

1
2
3
matchObj.group() :  Cats are smarter than dogs
matchObj.group(1) : Cats
matchObj.group(2) : smarter

re.search method

re.search scans the entire string and returns the first successful match.
Function syntax:

re.search(pattern, string, flags=0)

Description of function parameters:

Parameter Description
pattern Matched regular expression
string The string to match.
flags Flag bit, used to control how regular expressions are matched, such as whether to distinguish between uppercase and lowercase, multi-line matching

Matches the successful re.search method to return a matching object, otherwise returns None.
We can use the group(num) or groups() matching object functions to get the matching expression.

Matching object method Description
group(num=0) A string that matches the entire expression, group() can enter multiple group numbers at a time, in which case it returns a tuple containing the values for those groups.
groups() Returns a tuple containing all the group strings, from 1 to the included team number.

example

1
2
3
4
5
6
#!/usr/bin/python
# -*- coding: UTF-8 -*-

import re
print(re.search('www', 'www.mysite.com').span()) # 在起始位置匹配
print(re.search('com', 'www.mysite.com').span()) # 不在起始位置匹配

result

1
2
(0, 3)
(11, 14)

example

1
2
3
4
5
6
7
8
9
10
11
12
13
#!/usr/bin/python
import re

line = "Cats are smarter than dogs";

searchObj = re.search( r'(.*) are (.*?) .*', line, re.M|re.I)

if searchObj:
print "searchObj.group() : ", searchObj.group()
print "searchObj.group(1) : ", searchObj.group(1)
print "searchObj.group(2) : ", searchObj.group(2)
else:
print "Nothing found!!"

The above example execution results are as follows:

1
2
3
searchObj.group() :  Cats are smarter than dogs
searchObj.group(1) : Cats
searchObj.group(2) : smarter

re.match only matches the beginning of the string. If the string does not match the regular expression, the match fails, the function returns None; and re.search matches the entire string until a match is found.

example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#!/usr/bin/python
import re

line = "Cats are smarter than dogs";

matchObj = re.match( r'dogs', line, re.M|re.I)
if matchObj:
print "match --> matchObj.group() : ", matchObj.group()
else:
print "No match!!"

matchObj = re.search( r'dogs', line, re.M|re.I)
if matchObj:
print "search --> matchObj.group() : ", matchObj.group()
else:
print "No match!!"

The above example execution results are as follows:

1
2
No match!!
search --> matchObj.group() : dogs

Retrieve and replace

Python’s re module provides re.sub for replacing matches in strings.
grammar:

re.sub(pattern, repl, string, count=0, flags=0)

parameter:

  • pattern : The pattern string in the regular.
  • repl : The replaced string, which can also be a function.
  • string : The original string to be replaced by the lookup.
  • count : The maximum number of substitutions after pattern matching. The default 0 means to replace all matches.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
#!/usr/bin/python
# -*- coding: UTF-8 -*-

import re

phone = "2004-959-559 # This a phone number"

# Remove Python comments from strings
num = re.sub(r'#.*$', "", phone)
print "phone number is: ", num

# Delete non-numeric (-) strings
num = re.sub(r'\D', "", phone)
print "phone number is : ", num

The above example execution results are as follows:

1
2
This a phone number:  2004-959-559 
This a phone number : 2004959559

The repl argument is a function

In the following example, multiply the matching number in the string by 2:
example

1
2
3
4
5
6
7
8
9
10
11
12
#!/usr/bin/python
# -*- coding: UTF-8 -*-

import re

# multiply the matching number in the string by 2
def double(matched):
value = int(matched.group('value'))
return str(value * 2)

s = 'A23G4HFD567'
print(re.sub('(?P<value>\d+)', double, s))

The execution output is:

1
A46G8HFD1134

re.compile function

The compile function is used to compile a regular expression and generate a regular expression (pattern ) object for use by the match() and search() functions.
The syntax is:

re.compile(pattern[, flags])

parameter:

  • pattern : a regular expression in the form of a string
  • flags : Optional, indicating matching mode, such as ignoring case, multi-line mode, etc. The specific parameters are:
    1. re.I ignores case
    2. re.L indicates that the special character set \w, \W, \b, \B, \s, \S depends on the current environment
    3. re.M multi-line mode
    4. re.S is . and any character including line breaks (. does not include newline characters)
    5. re.U indicates a special character set \w, \W, \b, \B, \d, \D, \s, \S depends on the Unicode character attribute database
    6. re.X to increase readability, ignore spaces and comments after #

example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
>>>import re
>>> pattern = re.compile(r'\d+') # Used to match at least one number
>>> m = pattern.match('one12twothree34four') # Lookup head, no match
>>> print m
None
>>> m = pattern.match('one12twothree34four', 2, 10) # Matching from the 'e' position, no match
>>> print m
None
>>> m = pattern.match('one12twothree34four', 3, 10) # Matches from the '1' position, just matching
>>> print m # Return a Match object
<_sre.SRE_Match object at 0x10a42aac0>
>>> m.group(0)
'12'
>>> m.start(0)
3
>>> m.end(0)
5
>>> m.span(0)
(3, 5)

Above, when the match is successful, a Match object is returned, where:

  • The group([group1, ...]) method is used to obtain one or more group matching strings. When you want to get the entire matching substring, you can use group() or group(0) directly.
  • The start([group]) method is used to obtain the starting position of the substring matched by the group in the entire string (the index of the first character of the substring), and the default value of the parameter is 0;
  • The end([group]) method is used to obtain the end position of the substring matched by the packet in the entire string (index +1 of the last character of the substring), and the default value of the parameter is 0;
  • The span([group]) method returns (start(group), end(group)).

example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
>>>import re
>>> pattern = re.compile(r'([a-z]+) ([a-z]+)', re.I) # re.I means ignore case
>>> m = pattern.match('Hello World Wide Web')
>>> print m # Matches successfully, returns a Match object
<_sre.SRE_Match object at 0x10bea83e8>
>>> m.group(0) # Returns the entire substring of the matching success
'Hello World'
>>> m.span(0) # Returns the index of the entire substring that matched the success
(0, 11)
>>> m.group(1) # Returns the substring of the first group matching success
'Hello'
>>> m.span(1) # Returns the index of the substring that the first packet matches successfully
(0, 5)
>>> m.group(2) # Returns the substring of the second group matching success
'World'
>>> m.span(2) # Returns the substring of the second group matching success
(6, 11)
>>> m.groups() # Equivalent to (m.group(1), m.group(2), ...)
('Hello', 'World')
>>> m.group(3) # There is no third grouping
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: no such group

findall

Find all substrings matched by the regular expression in the string and return a list. If no match is found, an empty list is returned.
Note: match and search are matched once and findall matches all.
The syntax is:

findall(string[, pos[, endpos]])

parameter:

  • string : The string to be matched.
  • pos : Optional parameter specifying the starting position of the string. The default is 0.
  • endpos : An optional parameter that specifies the end position of the string. The default is the length of the string.
    Find all the numbers in the string
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    # -*- coding:UTF8 -*-

    import re

    pattern = re.compile(r'\d+') # Find numbers
    result1 = pattern.findall('runoob 123 google 456')
    result2 = pattern.findall('run88oob123google456', 0, 10)

    print(result1)
    print(result2)

Output result:

1
2
['123', '456']
['88', '12']

re.finditer

Similar to findall, find all substrings matched by the regular expression in the string and return them as an iterator.

re.finditer(pattern, string, flags=0)

Description of function parameters:

Parameter Description
pattern Matched regular expression
string The string to match.
flags Flag bit, used to control how regular expressions are matched, such as whether to distinguish between uppercase and lowercase, multi-line matching

example

1
2
3
4
5
6
7
# -*- coding: UTF-8 -*-

import re

it = re.finditer(r"\d+","12a32bc43jf3")
for match in it:
print (match.group() )

Output result:

1
2
3
4
12 
32
43
3

re.split

The split method splits the string into a list according to the substring that can be matched. It is used in the following form:

re.split(pattern, string[, maxsplit=0, flags=0])

Description of function parameters:

Parameter Description
pattern Matched regular expression
string The string to match.
maxsplit The number of separations, maxsplit=1 is separated once, the default is 0, and the number of times is not limited.
flags Flag bit, used to control how regular expressions are matched, such as whether to distinguish between uppercase and lowercase, multi-line matching

Regular expression pattern

A pattern string uses a special syntax to represent a regular expression:
Letters and numbers represent themselves. The letters and numbers in a regular expression pattern match the same string.
Most letters and numbers have a different meaning when preceded by a backslash.
Punctuation marks match themselves only when they are escaped, otherwise they represent special meanings.
The backslash itself needs to be escaped with a backslash.
Since regular expressions usually contain backslashes, you’d better use primitive strings to represent them. Pattern elements (such as r’\t’, equivalent to ‘\t’) match the corresponding special characters.
The following table lists the special elements in the regular expression pattern syntax. If you use the mode while providing optional flag parameters, the meaning of some mode elements will change.

mode description
^ Match the beginning of the string
$ Matches the end of the string.
. Matches any character except the newline character. When the re.DOTALL tag is specified, it can match any character that includes a newline character.
[…] Used to represent a set of characters, listed separately: [amk] matches ‘a’, ‘m’ or ‘k’
[^…] Characters not in []: [^abc] matches characters other than a, b, c.
re* Matches zero or more expressions.
re+ Match one or more expressions.
re? Match 0 or 1 fragment defined by the previous regular expression, non-greedy
re{ n} Match exactly n previous expressions. For example, o{2} cannot match “o” in “Bob” but matches two o in “food”.
re{ n,} Match n previous expressions. For example, o{2,} cannot match “o” in “Bob” but matches all o in “foooood”. “o{1,}” is equivalent to “o+”. “o{0,}” is equivalent to “o*”.
re{ n, m} Match n to m times the fragment defined by the previous regular expression, greedy
(re) Matches the expression in parentheses and also represents a group
(?imx) Regular expressions contain three optional flags: i, m, or x . Only affect the area in parentheses.
(?-imx) Regular expressions turn off i, m, or x optional flags. Only affect the area in parentheses.
(?: re) Similar to (…) but does not represent a group
(?imx: re) Use i, m, or x optional flags in parentheses
(?-imx: re) Do not use i, m, or x optional flags in parentheses
(?#…) Comment.
(?= re) Forward positive delimiter. If the regular expression is included, it is represented by …, it succeeds when the current position is successfully matched, otherwise it fails. But once the included expression has been tried, the matching engine does not improve at all; the rest of the pattern tries to delimit the right side of the character.
(?! re) Forward negation delimiter. Contrary to a positive delimiter; succeeds when the contained expression cannot match at the current position of the string
(?> re) Matching independent mode, eliminating backtracking.
\w Match alphanumeric and underline
\W Match non-alphanumeric and underline
\s Matches any whitespace character, equivalent to [\t\n\r\f].
\S Match any non-null character
\d Match any number, equivalent to [0-9].
\D Match any non-number
\A Match string start
\Z Matches the end of the string. If there is a newline, it only matches the ending string before the newline.
\z Match string end
\G Matches the location where the last match was completed.
\b Matches a word boundary, which is the position between a word and a space. For example, ‘er\b’ can match ‘er’ in “never” but not ‘er’ in “verb”.
\B Match non-word boundaries. ‘er\B’ matches ‘er’ in “verb” but does not match ‘er’ in “never”.
\n, \t, Matches a newline character. Matches a tab.
\1…\9 Matches the content of the nth packet.
\10 Matches the contents of the nth packet if it matches. Otherwise it refers to the expression of the octal character code.

Regular expression instance

Character matching

Instance description
python Match “python”.

Character class

Instance description
[Pp]ython Match “Python” or “python”
rub[ye] Match “ruby” or “rube”
[aeiou] Match any letter in brackets
[0-9] Match any number. Similar to [0123456789]
[a-z] Match any lowercase letters
[A-Z] Match any uppercase letter
[a-zA-Z0-9] Match any letters and numbers
[^aeiou] All characters except the aeiou letter
[^0-9] Match characters other than numbers

Special character class

Instance description
. Matches any single character except “\n”. To match any character that includes ‘\n’, use a pattern like ‘[.\n]’.
\d Matches a numeric character. Equivalent to [0-9].
\D Matches a non-numeric character. Equivalent to [^0-9].
\s Matches any whitespace characters, including spaces, tabs, page breaks, and more. Equivalent to [ \f\n\r\t\v].
\S Matches any non-whitespace characters. Equivalent to [^ \f\n\r\t\v].
\w Matches any word character that includes an underscore. Equivalent to ‘[A-Za-z0-9_]’.
\W Matches any non-word characters. Equivalent to ‘[^A-Za-z0-9_]’.
-------------End of the articleThank you for reading-------------
  • Author of this article:zfish
  • Link to this article: archives/86e5f9e1.html
  • Copyright Notice: All articles in this blog, except for special statements, please indicate the source!
0%