Perl 简明教程

Perl - Regular Expressions

正则表达式是由定义你要查看的模式或模式的字符串组成的。Perl 中正则表达式的语法与你会在支持正则表达式的其他程序中发现的语法非常类似,如 sedgrepawk

A regular expression is a string of characters that defines the pattern or patterns you are viewing. The syntax of regular expressions in Perl is very similar to what you will find within other regular expression.supporting programs, such as sed, grep, and awk.

应用正则表达式的基本方法是使用模式绑定操作符 =~ 和 ! ~. 第一个操作符是测试和赋值操作符。

The basic method for applying a regular expression is to use the pattern binding operators =~ and !~. The first operator is a test and assignment operator.

Perl 中有三个正则表达式操作符。

There are three regular expression operators within Perl.

  1. Match Regular Expression - m//

  2. Substitute Regular Expression - s///

  3. Transliterate Regular Expression - tr///

在每种情况下,正斜杠都充当你要指定的正则表达式 (regex) 的定界符。如果你习惯使用任何其他定界符,则可以在正斜杠处使用它。

The forward slashes in each case act as delimiters for the regular expression (regex) that you are specifying. If you are comfortable with any other delimiter, then you can use in place of forward slash.

The Match Operator

匹配操作符 m// 用于将字符串或语句与正则表达式进行匹配。例如,要将字符序列“foo”与标量 $bar 进行匹配,你可以使用如下的语句:

The match operator, m//, is used to match a string or statement to a regular expression. For example, to match the character sequence "foo" against the scalar $bar, you might use a statement like this −

#!/usr/bin/perl

$bar = "This is foo and again foo";
if ($bar =~ /foo/) {
   print "First time is matching\n";
} else {
   print "First time is not matching\n";
}

$bar = "foo";
if ($bar =~ /foo/) {
   print "Second time is matching\n";
} else {
   print "Second time is not matching\n";
}

當以上程式執行時,會產生以下結果 −

When above program is executed, it produces the following result −

First time is matching
Second time is matching

实际上,m// 的工作方式与 q// 操作符系列相同,你可以使用任何自然匹配字符组合来充当表达式的定界符。例如,m{}、m() 和 m>< 都是有效的。因此,上面的示例可以改写如下:

The m// actually works in the same fashion as the q// operator series.you can use any combination of naturally matching characters to act as delimiters for the expression. For example, m{}, m(), and m>< are all valid. So above example can be re-written as follows −

#!/usr/bin/perl

$bar = "This is foo and again foo";
if ($bar =~ m[foo]) {
   print "First time is matching\n";
} else {
   print "First time is not matching\n";
}

$bar = "foo";
if ($bar =~ m{foo}) {
   print "Second time is matching\n";
} else {
   print "Second time is not matching\n";
}

如果你使用正斜杠作为定界符,则可以将 m 从 m// 中省略,但对于所有其他定界符,你都必须使用 m 前缀。

You can omit m from m// if the delimiters are forward slashes, but for all other delimiters you must use the m prefix.

请注意,整个匹配表达式,即 =~ 或 !~ 左边的表达式和匹配操作符,如果表达式匹配,则在标量上下文中返回 true。因此,语句:

Note that the entire match expression, that is the expression on the left of =~ or !~ and the match operator, returns true (in a scalar context) if the expression matches. Therefore the statement −

$true = ($foo =~ m/foo/);

如果 $foo 匹配 regex,则将把 $true 设为 1,如果匹配失败,则将其设为 0。在列表上下文中,匹配返回任何分组表达式的内容。例如,当从时间字符串中提取小时、分钟和秒时,我们可以使用:

will set $true to 1 if $foo matches the regex, or 0 if the match fails. In a list context, the match returns the contents of any grouped expressions. For example, when extracting the hours, minutes, and seconds from a time string, we can use −

my ($hours, $minutes, $seconds) = ($time =~ m/(\d+):(\d+):(\d+)/);

Match Operator Modifiers

匹配操作符支持其自己的修饰符集。/g 修饰符允许进行全局匹配。/i 修饰符将使匹配不区分大小写。以下是修饰符的完整列表,

The match operator supports its own set of modifiers. The /g modifier allows for global matching. The /i modifier will make the match case insensitive. Here is the complete list of modifiers

Sr.No.

Modifier & Description

1

i Makes the match case insensitive.

2

m Specifies that if the string has newline or carriage return characters, the ^ and $ operators will now match against a newline boundary, instead of a string boundary.

3

o Evaluates the expression only once.

4

s Allows use of . to match a newline character.

5

x Allows you to use white space in the expression for clarity.

6

g Globally finds all matches.

7

cg Allows the search to continue even after a global match fails.

Matching Only Once

还有更简单的匹配运算符版本 - ?PATTERN? 运算符。这基本上与 m// 运算符相同,除了它只在你每次调用 reset 之间在搜索的字符串中匹配一次。

There is also a simpler version of the match operator - the ?PATTERN? operator. This is basically identical to the m// operator except that it only matches once within the string you are searching between each call to reset.

例如,你可以用它来获取列表中的第一个和最后一个元素 −

For example, you can use this to get the first and last elements within a list −

#!/usr/bin/perl

@list = qw/food foosball subeo footnote terfoot canic footbrdige/;

foreach (@list) {
   $first = $1 if /(foo.*?)/;
   $last = $1 if /(foo.*)/;
}
print "First: $first, Last: $last\n";

當以上程式執行時,會產生以下結果 −

When above program is executed, it produces the following result −

First: foo, Last: footbrdige

Regular Expression Variables

正则表达式变量包括 $ ,其中包含最后一个分组匹配匹配的任何内容; $& ,其中包含整个匹配的字符串; $` ,其中包含匹配字符串之前的所有内容; $' ,其中包含匹配字符串之后的所有内容。下面的代码演示了结果 −

Regular expression variables include $, which contains whatever the last grouping match matched; $&, which contains the entire matched string; $`, which contains everything before the matched string; and $', which contains everything after the matched string. Following code demonstrates the result −

#!/usr/bin/perl

$string = "The food is in the salad bar";
$string =~ m/foo/;
print "Before: $`\n";
print "Matched: $&\n";
print "After: $'\n";

當以上程式執行時,會產生以下結果 −

When above program is executed, it produces the following result −

Before: The
Matched: foo
After: d is in the salad bar

The Substitution Operator

替换运算符 s/// 实际上只是匹配运算符的扩展,它允许你用一些新文本来替换匹配的文本。运算符的基本形式是 −

The substitution operator, s///, is really just an extension of the match operator that allows you to replace the text matched with some new text. The basic form of the operator is −

s/PATTERN/REPLACEMENT/;

PATTERN 是我们正在寻找的文本的正则表达式。REPLACEMENT 是我们希望用于替换找到的文本的文本或正则表达式的规范。例如,我们可以使用以下正则表达式将 dog 的所有出现替换为 cat

The PATTERN is the regular expression for the text that we are looking for. The REPLACEMENT is a specification for the text or regular expression that we want to use to replace the found text with. For example, we can replace all occurrences of dog with cat using the following regular expression −

#/user/bin/perl

$string = "The cat sat on the mat";
$string =~ s/cat/dog/;

print "$string\n";

當以上程式執行時,會產生以下結果 −

When above program is executed, it produces the following result −

The dog sat on the mat

Substitution Operator Modifiers

以下是替换运算符中使用的所有修饰符的列表。

Here is the list of all the modifiers used with substitution operator.

Sr.No.

Modifier & Description

1

i Makes the match case insensitive.

2

m Specifies that if the string has newline or carriage return characters, the ^ and $ operators will now match against a newline boundary, instead of a string boundary.

3

o Evaluates the expression only once.

4

s Allows use of . to match a newline character.

5

x Allows you to use white space in the expression for clarity.

6

g Replaces all occurrences of the found expression with the replacement text.

7

e Evaluates the replacement as if it were a Perl statement, and uses its return value as the replacement text.

The Translation Operator

翻译类似于替换原理,但不同之处在于,翻译(或音译)不使用正则表达式对其替换值进行搜索。翻译运算符是 −

Translation is similar, but not identical, to the principles of substitution, but unlike substitution, translation (or transliteration) does not use regular expressions for its search on replacement values. The translation operators are −

tr/SEARCHLIST/REPLACEMENTLIST/cds
y/SEARCHLIST/REPLACEMENTLIST/cds

翻译将 SEARCHLIST 中所有字符的所有出现替换为 REPLACEMENTLIST 中的相应字符。例如,使用我们在本章中一直使用的字符串“The cat sat on the mat.” −

The translation replaces all occurrences of the characters in SEARCHLIST with the corresponding characters in REPLACEMENTLIST. For example, using the "The cat sat on the mat." string we have been using in this chapter −

#/user/bin/perl

$string = 'The cat sat on the mat';
$string =~ tr/a/o/;

print "$string\n";

當以上程式執行時,會產生以下結果 −

When above program is executed, it produces the following result −

The cot sot on the mot.

也可以使用标准 Perl 范围,从而允许你通过字母或数字值指定字符范围。要更改字符串的大小写,你可以使用以下语法代替 uc 函数。

Standard Perl ranges can also be used, allowing you to specify ranges of characters either by letter or numerical value. To change the case of the string, you might use the following syntax in place of the uc function.

$string =~ tr/a-z/A-Z/;

Translation Operator Modifiers

以下是与翻译相关的运算符列表。

Following is the list of operators related to translation.

Sr.No.

Modifier & Description

1

c Complements SEARCHLIST.

2

d Deletes found but unreplaced characters.

3

s Squashes duplicate replaced characters.

/d 修饰符删除匹配 SEARCHLIST 但在 REPLACEMENTLIST 中没有相应项的字符。例如 −

The /d modifier deletes the characters matching SEARCHLIST that do not have a corresponding entry in REPLACEMENTLIST. For example −

#!/usr/bin/perl

$string = 'the cat sat on the mat.';
$string =~ tr/a-z/b/d;

print "$string\n";

當以上程式執行時,會產生以下結果 −

When above program is executed, it produces the following result −

b b   b.

最后一个修饰符 /s 删除了被替换的重复字符序列,所以 −

The last modifier, /s, removes the duplicate sequences of characters that were replaced, so −

#!/usr/bin/perl

$string = 'food';
$string = 'food';
$string =~ tr/a-z/a-z/s;

print "$string\n";

當以上程式執行時,會產生以下結果 −

When above program is executed, it produces the following result −

fod

More Complex Regular Expressions

您不必仅仅在固定字符串上进行匹配。事实上,您可以通过使用更复杂的正则表达式匹配您能梦想到的任何内容。这里有一个速查表 −

You don’t just have to match on fixed strings. In fact, you can match on just about anything you could dream of by using more complex regular expressions. Here’s a quick cheat sheet −

下表列出了 Python 中可用的正则表达式语法。

Following table lists the regular expression syntax that is available in Python.

Sr.No.

Pattern & Description

1

^ Matches beginning of line.

2

$ Matches end of line.

3

. Matches any single character except newline. Using m option allows it to match newline as well.

4

[…​] Matches any single character in brackets.

5

[^…​] Matches any single character not in brackets.

6

* Matches 0 or more occurrences of preceding expression.

7

+ Matches 1 or more occurrence of preceding expression.

8

? Matches 0 or 1 occurrence of preceding expression.

9

{ n} Matches exactly n number of occurrences of preceding expression.

10

{ n,} Matches n or more occurrences of preceding expression.

11

{ n, m} Matches at least n and at most m occurrences of preceding expression.

12

*a

b* Matches either a or b.

13

\w Matches word characters.

14

\W Matches nonword characters.

15

\s Matches whitespace. Equivalent to [\t\n\r\f].

16

\S Matches nonwhitespace.

17

\d Matches digits. Equivalent to [0-9].

18

\D Matches nondigits.

19

\A Matches beginning of string.

20

\Z Matches end of string. If a newline exists, it matches just before newline.

21

\z Matches end of string.

22

\G Matches point where last match finished.

23

\b Matches word boundaries when outside brackets. Matches backspace (0x08) when inside brackets.

24

\B Matches nonword boundaries.

25

\n, \t, etc. Matches newlines, carriage returns, tabs, etc.

26

\1…​\9 Matches nth grouped subexpression.

27

\10 Matches nth grouped subexpression if it matched already. Otherwise refers to the octal representation of a character code.

28

[aeiou] Matches a single character in the given set

29

脱字符 ^ 匹配字符串的开头,元符号 $ 匹配字符串的结尾。这里有一些简要的示例。

The ^ metacharacter matches the beginning of the string and the $ metasymbol matches the end of the string. Here are some brief examples.

# nothing in the string (start and end are adjacent)
/^$/

# a three digits, each followed by a whitespace
# character (eg "3 4 5 ")
/(\d\s) {3}/

# matches a string in which every
# odd-numbered letter is a (eg "abacadaf")
/(a.)+/

# string starts with one or more digits
/^\d+/

# string that ends with one or more digits
/\d+$/

我们来看另一个示例。

Lets have a look at another example.

#!/usr/bin/perl

$string = "Cats go Catatonic\nWhen given Catnip";
($start) = ($string =~ /\A(.*?) /);
@lines = $string =~ /^(.*?) /gm;
print "First word: $start\n","Line starts: @lines\n";

當以上程式執行時,會產生以下結果 −

When above program is executed, it produces the following result −

First word: Cats
Line starts: Cats When

Matching Boundaries

\b 与任何单词词边界匹配,如 \w 类和 \W 类之间的差异所定义的。因为 \w 包含单词的字符,\W 包含相反的字符,这通常意味着单词的结束。断言 \B 与任何不是单词词边界的位置匹配。例如 −

The \b matches at any word boundary, as defined by the difference between the \w class and the \W class. Because \w includes the characters for a word, and \W the opposite, this normally means the termination of a word. The \B assertion matches any position that is not a word boundary. For example −

/\bcat\b/ # Matches 'the cat sat' but not 'cat on the mat'
/\Bcat\B/ # Matches 'verification' but not 'the cat on the mat'
/\bcat\B/ # Matches 'catatonic' but not 'polecat'
/\Bcat\b/ # Matches 'polecat' but not 'catatonic'

Selecting Alternatives

| 字符与 Perl 中的标准或按位 OR 运算相同。它指定正则表达式或组内的交替匹配。例如,要在表达式中匹配“cat”或“dog”,您可以使用以下内容 −

The | character is just like the standard or bitwise OR within Perl. It specifies alternate matches within a regular expression or group. For example, to match "cat" or "dog" in an expression, you might use this −

if ($string =~ /cat|dog/)

您可以将表达式的各个元素组合在一起,以支持复杂的匹配。可以通过两个单独的测试来搜索两个人的姓名,如下所示 −

You can group individual elements of an expression together in order to support complex matches. Searching for two people’s names could be achieved with two separate tests, like this −

if (($string =~ /Martin Brown/) ||  ($string =~ /Sharon Brown/))

This could be written as follows

if ($string =~ /(Martin|Sharon) Brown/)

Grouping Matching

从正则表达式的角度来看,除了可能前者稍显清楚外,并没有区别。

From a regular-expression point of view, there is no difference between except, perhaps, that the former is slightly clearer.

$string =~ /(\S+)\s+(\S+)/;

and

$string =~ /\S+\s+\S+/;

但是,分组的好处在于它允许我们从正则表达式中提取一个序列。分组以它们在原始字符串中出现的顺序返回列表。例如,在以下片段中,我们从字符串中提取了小时、分钟和秒。

However, the benefit of grouping is that it allows us to extract a sequence from a regular expression. Groupings are returned as a list in the order in which they appear in the original. For example, in the following fragment we have pulled out the hours, minutes, and seconds from a string.

my ($hours, $minutes, $seconds) = ($time =~ m/(\d+):(\d+):(\d+)/);

除了这种直接方法,匹配的组还可以使用特殊变量 $x 获得,其中 x 是正则表达式中组的编号。因此,我们可以将前面的示例重写如下 −

As well as this direct method, matched groups are also available within the special $x variables, where x is the number of the group within the regular expression. We could therefore rewrite the preceding example as follows −

#!/usr/bin/perl

$time = "12:05:30";

$time =~ m/(\d+):(\d+):(\d+)/;
my ($hours, $minutes, $seconds) = ($1, $2, $3);

print "Hours : $hours, Minutes: $minutes, Second: $seconds\n";

當以上程式執行時,會產生以下結果 −

When above program is executed, it produces the following result −

Hours : 12, Minutes: 05, Second: 30

在替换表达式中使用组时,可以在替换文本中使用 $x 语法。因此,我们可以使用此来重新设置日期字符串的格式 −

When groups are used in substitution expressions, the $x syntax can be used in the replacement text. Thus, we could reformat a date string using this −

#!/usr/bin/perl

$date = '03/26/1999';
$date =~ s#(\d+)/(\d+)/(\d+)#$3/$1/$2#;

print "$date\n";

當以上程式執行時,會產生以下結果 −

When above program is executed, it produces the following result −

1999/03/26

The \G Assertion

\G 断言允许你继续从前一次匹配发生的位置进行搜索。例如,在以下代码中,我们使用了 \G,以便我们可以搜索到正确的位置,然后提取一些信息,而不必创建一个更加复杂、单一的正则表达式 −

The \G assertion allows you to continue searching from the point where the last match occurred. For example, in the following code, we have used \G so that we can search to the correct position and then extract some information, without having to create a more complex, single regular expression −

#!/usr/bin/perl

$string = "The time is: 12:31:02 on 4/12/00";

$string =~ /:\s+/g;
($time) = ($string =~ /\G(\d+:\d+:\d+)/);
$string =~ /.+\s+/g;
($date) = ($string =~ m{\G(\d+/\d+/\d+)});

print "Time: $time, Date: $date\n";

當以上程式執行時,會產生以下結果 −

When above program is executed, it produces the following result −

Time: 12:31:02, Date: 4/12/00

\G 断言实际上只是 pos 函数的元符号等价形式,因此在正则表达式调用之间你可以继续使用 pos,甚至可以通过将 pos 用作 lvalue 子例程,来修改 pos(因此还有 \G)的值。

The \G assertion is actually just the metasymbol equivalent of the pos function, so between regular expression calls you can continue to use pos, and even modify the value of pos (and therefore \G) by using pos as an lvalue subroutine.

Regular-expression Examples

Literal Characters

Sr.No.

Example & Description

1

Perl Match "Perl".

Character Classes

Sr.No.

Example & Description

1

[Pp]ython Matches "Python" or "python"

2

rub[ye] Matches "ruby" or "rube"

3

[aeiou] Matches any one lowercase vowel

4

[0-9] Matches any digit; same as [0123456789]

5

[a-z] Matches any lowercase ASCII letter

6

[A-Z] Matches any uppercase ASCII letter

7

[a-zA-Z0-9] Matches any of the above

8

[^aeiou] Matches anything other than a lowercase vowel

9

[^0-9] Matches anything other than a digit

Special Character Classes

Sr.No.

Example & Description

1

. Matches any character except newline

2

\d Matches a digit: [0-9]

3

\D Matches a nondigit: [^0-9]

4

\s Matches a whitespace character: [ \t\r\n\f]

5

\S Matches nonwhitespace: [^ \t\r\n\f]

6

\w Matches a single word character: [A-Za-z0-9_]

7

\W Matches a nonword character: [^A-Za-z0-9_]

Repetition Cases

Sr.No.

Example & Description

1

ruby? Matches "rub" or "ruby": the y is optional

2

ruby* Matches "rub" plus 0 or more ys

3

ruby+ Matches "rub" plus 1 or more ys

4

\d{3} Matches exactly 3 digits

5

\d{3,} Matches 3 or more digits

6.

\d{3,5} Matches 3, 4, or 5 digits

Nongreedy Repetition

这匹配最少重复数量−

This matches the smallest number of repetitions −

Sr.No.

Example & Description

1

<.>* Greedy repetition: matches "<python>perl>"

2

<.?>* Nongreedy: matches "<python>" in "<python>perl>"

Grouping with Parentheses

Sr.No.

Example & Description

1

\D\d+ No group: + repeats \d

2

(\D\d)+ Grouped: + repeats \D\d pair

3

([Pp]ython(, )?)+ Match "Python", "Python, python, python", etc.

Backreferences

这再次匹配先前的匹配组——

This matches a previously matched group again −

Sr.No.

Example & Description

1

([Pp])ython&\1ails Matches python&pails or Python&Pails

2

(['"])[^\1]\1* Single or double-quoted string. \1 matches whatever the 1st group matched. \2 matches whatever the 2nd group matched, etc.

Alternatives

Sr.No.

Example & Description

1

*python

perl* Matches "python" or "perl"

2

*rub(y

le))* Matches "ruby" or "ruble"

3

*Python(!+

Anchors

此需要指定匹配位置。

This need to specify match positions.

Sr.No.

Example & Description

1

^Python Matches "Python" at the start of a string or internal line

2

Python$ Matches "Python" at the end of a string or line

3

\APython Matches "Python" at the start of a string

4

Python\Z Matches "Python" at the end of a string

5

\bPython\b Matches "Python" at a word boundary

6

\brub\B \B is nonword boundary: match "rub" in "rube" and "ruby" but not alone

7

Python(?=!) Matches "Python", if followed by an exclamation point

8

Python(?!!) Matches "Python", if not followed by an exclamation point

Special Syntax with Parentheses

Sr.No.

Example & Description

1

R(?#comment) Matches "R". All the rest is a comment

2

R(?i)uby Case-insensitive while matching "uby"

3

R(?i:uby) Same as above

4

*rub(?:y