Python 简明教程

Python - Regular Expressions

正则表达式是一个特殊字符序列,它帮助你使用模式中保存的特殊语法匹配或查找其他字符串或字符串集。正则表达式通常称为 regex 或 regexp。

A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern. Regular expression are popularly known as regex or regexp.

通常,此类模式由字符串搜索算法用于字符串上的“查找”或“查找并替换”操作,或用于输入验证。

Usually, such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation.

数据科学项目中的大规模文本处理需要对文本数据进行操作。许多编程语言(包括 Python)都支持正则表达式处理。Python 的标准库为此目的提供了 re 模块。

Large scale text processing in data science projects requires manipulation of textual data. The regular expressions processing is supported by many programming languages including Python. Python’s standard library has re module for this purpose.

由于 re 模块中定义的大多数函数都使用原始字符串,因此让我们首先了解什么是原始字符串。

Since most of the functions defined in re module work with raw strings, let us first understand what the raw strings are.

Raw Strings

正则表达式使用反斜杠字符 ('\') 来指示特殊形式或允许使用特殊字符而不调用它们的特殊含义。另一方面,Python 使用相同字符作为转义字符。因此,Python 使用原始字符串表示法。

Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. Python on the other hand uses the same character as escape character. Hence Python uses the raw string notation.

如果在引号前加上 r 或 R,则字符串将成为原始字符串。因此,'Hello' 是一个普通字符串,而 r’Hello' 是一个原始字符串。

A string become a raw string if it is prefixed with r or R before the quotation symbols. Hence 'Hello' is a normal string were are r’Hello' is a raw string.

>>> normal="Hello"
>>> print (normal)
Hello
>>> raw=r"Hello"
>>> print (raw)
Hello

在正常情况下,两者之间没有区别。但是,当转义字符嵌入字符串中时,普通字符串实际上解释转义序列,而原始字符串不处理转义字符。

In normal circumstances, there is no difference between the two. However, when the escape character is embedded in the string, the normal string actually interprets the escape sequence, where as the raw string doesn’t process the escape character.

>>> normal="Hello\nWorld"
>>> print (normal)
Hello
World
>>> raw=r"Hello\nWorld"
>>> print (raw)
Hello\nWorld

在上述示例中,打印普通字符串时,转义字符 '\n' 会被处理为引入换行符。然而,由于原始字符串运算符“r”,转义字符的作用不会按其含义进行转换。

In the above example, when a normal string is printed the escape character '\n' is processed to introduce a newline. However because of the raw string operator 'r' the effect of escape character is not translated as per its meaning.

Metacharacters

大多数字母和字符只是匹配它们自己。但是,有些字符是特殊元字符,不匹配它们自己。元字符是具有特殊含义的字符,类似通配符中的 *。

Most letters and characters will simply match themselves. However, some characters are special metacharacters, and don’t match themselves. Meta characters are characters having a special meaning, similar to * in wild card.

以下是元字符的完整列表:

Here’s a complete list of the metacharacters −

. ^ $ * + ? { } [ ] \ | ( )

方括号符号 [ 和 ] 指示一组您希望匹配的字符。字符可以逐个列出,或作为以 '-' 分隔的字符范围列出。

The square bracket symbols[ and ] indicate a set of characters that you wish to match. Characters can be listed individually, or as a range of characters separating them by a '-'.

'\' 是一个转义元字符。后面跟着各种字符会形成各种特殊序列。如果您需要匹配 [ 或 \, 您可以使用反斜杠为它们加上前缀以消除其特殊含义:\[ 或 \\。

'\'is an escaping metacharacter. When followed by various characters it forms various special sequences. If you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\.

以 '\' 开头的此类特殊序列表示的预定义字符集如下所示:

Predefined sets of characters represented by such special sequences beginning with '\' are listed below −

Python 的 re 模块提供了用于查找匹配、搜索模式和用其他字符串替换匹配字符串等有用函数。

Python’s re module provides useful functions for finding a match, searching for a pattern, and substitute a matched string with other string etc.

The re.match() Function

此函数尝试使用可选标志匹配字符串开头的 RE 模式。以下是此函数的 syntax

This function attempts to match RE pattern at the start of string with optional flags. Following is the syntax for this function −

re.match(pattern, string, flags=0)

以下是参数说明 −

Here is the description of the parameters −

re.match() 函数在成功时返回 match 对象,在失败时返回 None 。匹配对象实例包含有关匹配的信息:开始和结束的位置、匹配的子字符串等。

The re.match() function returns a match object on success, None on failure. A match object instance contains information about the match: where it starts and ends, the substring it matched, etc.

匹配对象的 start() 方法返回模式在字符串中的起始位置,end() 返回结束点。

The match object’s start() method returns the starting position of pattern in the string, and end() returns the endpoint.

如果找不到模式,则匹配对象为 None。

If the pattern is not found, the match object is None.

我们使用 match 对象的 group(num)groups() 函数来获取匹配的表达式。

We use group(num) or groups() function of match object to get matched expression.

Example

import re
line = "Cats are smarter than dogs"
matchObj = re.match( r'Cats', line)
print (matchObj.start(), matchObj.end())
print ("matchObj.group() : ", matchObj.group())

它将生成以下 output

It will produce the following output

0 4
matchObj.group() : Cats

The re.search() Function

此函数搜索字符串中 RE 模式的第一次出现,并带有可选标志。以下是此函数的 syntax

This function searches for first occurrence of RE pattern within the string, with optional flags. Following is the syntax for this function −

re.search(pattern, string, flags=0)

以下是参数说明 −

Here is the description of the parameters −

re.search 函数在成功时返回 match 对象,在失败时返回 none 。我们使用 match 对象的 group(num) 或 groups() 函数获取匹配的表达式。

The re.search function returns a match object on success, none on failure. We use group(num) or groups() function of match object to get the matched expression.

Example

import re
line = "Cats are smarter than dogs"
matchObj = re.search( r'than', line)
print (matchObj.start(), matchObj.end())
print ("matchObj.group() : ", matchObj.group())

它将生成以下 output

It will produce the following output

17 21
matchObj.group() : than

Matching Vs Searching

Python 基于正则表达式提供了两种不同的基本操作, match 仅检查字符串开头的匹配,而 search 检查字符串中任何位置的匹配(这是 Perl 默认执行的操作)。

Python offers two different primitive operations based on regular expressions, match checks for a match only at the beginning of the string, while search checks for a match anywhere in the string (this is what Perl does by default).

Example

import re
line = "Cats are smarter than dogs";
matchObj = re.match( r'dogs', line, re.M|re.I)
if matchObj:
   print ("match --> matchObj.group() : ", matchObj.group())
else:
   print ("No match!!")
searchObj = re.search( r'dogs', line, re.M|re.I)
if searchObj:
   print ("search --> searchObj.group() : ", searchObj.group())
else:
   print ("Nothing found!!")

当执行以上代码时,它会产生以下 output -

When the above code is executed, it produces the following output

No match!!
search --> matchObj.group() : dogs

The re.findall() Function

findall() 函数将字符串中的模式所有不重叠的匹配项作为字符串或元组列表返回。从左到右扫描字符串,并按找到的顺序返回匹配项。结果中包含空匹配项。

The findall() function returns all non-overlapping matches of pattern in string, as a list of strings or tuples. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result.

Syntax

re.findall(pattern, string, flags=0)

Parameters

Example

import re
string="Simple is better than complex."
obj=re.findall(r"ple", string)
print (obj)

它将生成以下 output

It will produce the following output

['ple', 'ple']

以下代码借助 findall() 函数获取句子中的单词列表。

Following code obtains the list of words in a sentence with the help of findall() function.

import re
string="Simple is better than complex."
obj=re.findall(r"\w*", string)
print (obj)

它将生成以下 output

It will produce the following output

['Simple', '', 'is', '', 'better', '', 'than', '', 'complex', '', '']

The re.sub() Function

使用正则表达式的最重要的 re 方法之一是 sub

One of the most important re methods that use regular expressions is sub.

Syntax

re.sub(pattern, repl, string, max=0)

此方法使用 repl 替换字符串中 RE 模式的所有出现,除非提供了 max,否则将替换所有出现。此方法返回修改后的字符串。

This method replaces all occurrences of the RE pattern in string with repl, substituting all occurrences unless max is provided. This method returns modified string.

Example

import re
phone = "2004-959-559 # This is Phone Number"

# Delete Python-style comments
num = re.sub(r'#.*$', "", phone)
print ("Phone Num : ", num)

# Remove anything other than digits
num = re.sub(r'\D', "", phone)
print ("Phone Num : ", num)

它将生成以下 output

It will produce the following output

Phone Num : 2004-959-559
Phone Num : 2004959559

Example

以下示例使用 sub() 函数将所有 is 替换为单词 was −

The following example uses sub() function to substitute all occurrences of is with was word −

import re
string="Simple is better than complex. Complex is better than complicated."
obj=re.sub(r'is', r'was',string)
print (obj)

它将生成以下 output

It will produce the following output

Simple was better than complex. Complex was better than complicated.

The re.compile() Function

compile() 函数将正则表达式模式编译成正则表达式对象,可以使用该对象与其 match()、search() 和其他方法进行匹配。

The compile() function compiles a regular expression pattern into a regular expression object, which can be used for matching using its match(), search() and other methods.

Syntax

re.compile(pattern, flags=0)

Flags

序列 −

The sequence −

prog = re.compile(pattern)
result = prog.match(string)

等效于 −

is equivalent to −

result = re.match(pattern, string)

但当在单个程序中多次使用表达式时,使用 re.compile() 并保存生成的正则表达式对象以便重用更为有效。

But using re.compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program.

Example

import re
string="Simple is better than complex. Complex is better than complicated."
pattern=re.compile(r'is')
obj=pattern.match(string)
obj=pattern.search(string)
print (obj.start(), obj.end())

obj=pattern.findall(string)
print (obj)

obj=pattern.sub(r'was', string)
print (obj)

它将生成如下输出:

It will produce the following output −

7 9
['is', 'is']
Simple was better than complex. Complex was better than complicated.

The re.finditer() Function

此函数返回一个迭代器,为字符串中 RE 模式的所有非重叠匹配生成匹配对象。

This function returns an iterator yielding match objects over all non-overlapping matches for the RE pattern in string.

Syntax

re.finditer(pattern, string, flags=0)

Example

import re
string="Simple is better than complex. Complex is better than complicated."
pattern=re.compile(r'is')
iterator = pattern.finditer(string)
print (iterator )

for match in iterator:
   print(match.span())

它将生成以下 output

It will produce the following output

(7, 9)
(39, 41)

Use Cases of Python Regex

Finding all Adverbs

findall() 匹配模式的所有出现,而不仅仅是 search() 所匹配的第一个出现。例如,如果编写者想要查找某个文本中所有副词,他们可能会以以下方式使用 findall() −

findall() matches all occurrences of a pattern, not just the first one as search() does. For example, if a writer wanted to find all of the adverbs in some text, they might use findall() in the following manner −

import re
text = "He was carefully disguised but captured quickly by police."
obj = re.findall(r"\w+ly\b", text)
print (obj)

它将生成以下 output

It will produce the following output

['carefully', 'quickly']

Finding words starting with vowels

import re
text = 'Errors should never pass silently. Unless explicitly silenced.'
obj=re.findall(r'\b[aeiouAEIOU]\w+', text)
print (obj)

它将生成以下 output

It will produce the following output

['Errors', 'Unless', 'explicitly']

Regular Expression Modifiers: Option Flags

正则表达式文本中可能包含一个可选修饰符来控制匹配的各个方面。修饰符指定为可选标志。你可以使用互斥或 (|),来提供多个修饰符,如以前所示,也可以用以下之一来表示 −

Regular expression literals may include an optional modifier to control various aspects of matching. The modifiers are specified as an optional flag. You can provide multiple modifiers using exclusive OR (|), as shown previously and may be represented by one of these −

Sr.No.

Modifier & Description

1

re.I Performs case-insensitive matching.

2

re.L Interprets words according to the current locale. This interpretation affects the alphabetic group (\w and \W), as well as word boundary behavior(\b and \B).

3

re.M Makes $ match the end of a line (not just the end of the string) and makes ^ match the start of any line (not just the start of the string).

4

re.S Makes a period (dot) match any character, including a newline.

5

re.U Interprets letters according to the Unicode character set. This flag affects the behavior of \w, \W, \b, \B.

6

re.X Permits "cuter" regular expression syntax. It ignores whitespace (except inside a set [] or when escaped by a backslash) and treats unescaped # as a comment marker.

Regular Expression Patterns

除了控制字符 (+ ? . * ^ $ ( ) [ ] { } | \) ,所有字符都匹配自身。你可以在控制字符前面加上反斜杠对其进行转义。

Except for control characters, (+ ? . * ^ $ ( ) [ ] { } | \), all characters match themselves. You can escape a control character by preceding it with a backslash.

下表列出了 Python 中提供的正则表达式语法 −

Following table lists the regular expression syntax that is available in Python −

Sr.No.

Pattern & Description

1

^ Matches beginning of line.

2

$ Matches end of line.

3

. Matches any single character except newline. Using m option allows it to match newline as well.

4

[…​] Matches any single character in brackets.

5

[^…​] Matches any single character not in brackets

6

re* Matches 0 or more occurrences of preceding expression.

7

re+ Matches 1 or more occurrence of preceding expression.

8

re? Matches 0 or 1 occurrence of preceding expression.

9

re{ n} Matches exactly n number of occurrences of preceding expression.

10

re{ n,} Matches n or more occurrences of preceding expression.

11

re{ n, m} Matches at least n and at most m occurrences of preceding expression.

12

*a

b* Matches either a or b.

13

(re) Groups regular expressions and remembers matched text.

14

(?imx) Temporarily toggles on i, m, or x options within a regular expression. If in parentheses, only that area is affected.

15

(?-imx) Temporarily toggles off i, m, or x options within a regular expression. If in parentheses, only that area is affected.

16

(?: re) Groups regular expressions without remembering matched text.

17

(?imx: re) Temporarily toggles on i, m, or x options within parentheses.

18

(?-imx: re) Temporarily toggles off i, m, or x options within parentheses.

19

(?#…​) Comment.

20

(?= re) Specifies position using a pattern. Doesn’t have a range.

21

(?! re) Specifies position using pattern negation. Doesn’t have a range.

22

(?> re) Matches independent pattern without backtracking.

23

\w Matches word characters.

24

\W Matches nonword characters.

25

\s Matches whitespace. Equivalent to [\t\n\r\f].

26

\S Matches nonwhitespace.

27

\d Matches digits. Equivalent to [0-9].

28

\D Matches nondigits.

29

\A Matches beginning of string.

30

\Z Matches end of string. If a newline exists, it matches just before newline.

31

\z Matches end of string.

32

\G Matches point where last match finished.

33

\b Matches word boundaries when outside brackets. Matches backspace (0x08) when inside brackets.

34

\B Matches nonword boundaries.

35

\n, \t, etc. Matches newlines, carriage returns, tabs, etc.

36

\1…​\9 Matches nth grouped subexpression.

37

Regular Expression Examples

Literal characters

Sr.No.

Example & Description

1

python Match "python".

Character classes

Sr.No.

Example & Description

1

[Pp]ython Match "Python" or "python"

2

rub[ye] Match "ruby" or "rube"

3

[aeiou] Match any one lowercase vowel

4

[0-9] Match any digit; same as [0123456789]

5

[a-z] Match any lowercase ASCII letter

6

[A-Z] Match any uppercase ASCII letter

7

[a-zA-Z0-9] Match any of the above

8

[^aeiou] Match anything other than a lowercase vowel

9

[^0-9] Match anything other than a digit

Special Character Classes

Sr.No.

Example & Description

1

. Match any character except newline

2

\d Match a digit: [0-9]

3

\D Match a nondigit: [^0-9]

4

\s Match a whitespace character: [ \t\r\n\f]

5

\S Match nonwhitespace: [^ \t\r\n\f]

6

\w Match a single word character: [A-Za-z0-9_]

7

\W Match a nonword character: [^A-Za-z0-9_]

Repetition Cases

Sr.No.

Example & Description

1

ruby? Match "rub" or "ruby": the y is optional

2

ruby* Match "rub" plus 0 or more ys

3

ruby+ Match "rub" plus 1 or more ys

4

\d{3} Match exactly 3 digits

5

\d{3,} Match 3 or more digits

6

\d{3,5} Match 3, 4, or 5 digits

Nongreedy repetition

这匹配最少重复数量−

This matches the smallest number of repetitions −

Sr.No.

Example & Description

1

<.>* Greedy repetition: matches "<python>perl>"

2

<.?>* Nongreedy: matches "<python>" in "<python>perl>"

Grouping with Parentheses

Sr.No.

Example & Description

1

\D\d+ No group: + repeats \d

2

(\D\d)+ Grouped: + repeats \D\d pair

3

([Pp]ython(, )?)+ Match "Python", "Python, python, python", etc.

Backreferences

这再次匹配先前的匹配组——

This matches a previously matched group again −

Sr.No.

Example & Description

1

([Pp])ython&\1ails Match python&pails or Python&Pails

2

(['"])[^\1]\1* Single or double-quoted string. \1 matches whatever the 1st group matched. \2 matches whatever the 2nd group matched, etc.

Alternatives

Sr.No.

Example & Description

1

*python

perl* Match "python" or "perl"

2

*rub(y

le))* Match "ruby" or "ruble"

3

*Python(!+

Anchors

这需要指定匹配位置

This needs to specify match position.

Sr.No.

Example & Description

1

^Python Match "Python" at the start of a string or internal line

2

Python$ Match "Python" at the end of a string or line

3

\APython Match "Python" at the start of a string

4

Python\Z Match "Python" at the end of a string

5

\bPython\b Match "Python" at a word boundary

6

\brub\B \B is nonword boundary: match "rub" in "rube" and "ruby" but not alone

7

Python(?=!) Match "Python", if followed by an exclamation point.

8

Python(?!!) Match "Python", if not followed by an exclamation point.

Special Syntax with Parentheses

Sr.No.

Example & Description

1

R(?#comment) Matches "R". All the rest is a comment

2

R(?i)uby Case-insensitive while matching "uby"

3

R(?i:uby) Same as above

4

*rub(?:y