Postgresql 中文操作指南
9.7. Pattern Matching #
PostgreSQL 提供了三种单独的方法用于模式匹配:传统的 SQL LIKE 运算符、较新的 SIMILAR TO 运算符(在 SQL:1999 中添加)以及 POSIX 风格的正则表达式。除了基本的“此字符串是否与此模式匹配?”运算符外,还有可用于提取或替换匹配子字符串以及在匹配位置拆分字符串的函数。
There are three separate approaches to pattern matching provided by PostgreSQL: the traditional SQL LIKE operator, the more recent SIMILAR TO operator (added in SQL:1999), and POSIX-style regular expressions. Aside from the basic “does this string match this pattern?” operators, functions are available to extract or replace matching substrings and to split a string at matching locations.
Tip
如果您有超出此范围的模式匹配需求,请考虑用 Perl 或 Tcl 编写一个用户自定义函数。
If you have pattern matching needs that go beyond this, consider writing a user-defined function in Perl or Tcl.
Caution
虽然大多数正则表达式搜索都可以非常快速地执行,但可以设计出处理时间和内存任意增加的正则表达式。对接受来自敌对来源的正则表达式搜索模式要保持警惕。如果必须这样做,建议实施语句超时。
While most regular-expression searches can be executed very quickly, regular expressions can be contrived that take arbitrary amounts of time and memory to process. Be wary of accepting regular-expression search patterns from hostile sources. If you must do so, it is advisable to impose a statement timeout.
使用_SIMILAR TO_ 模式进行搜索具有相同安全隐患,因为_SIMILAR TO_提供了许多功能,与 POSIX 样式的正则表达式相同。
Searches using SIMILAR TO patterns have the same security hazards, since SIMILAR TO provides many of the same capabilities as POSIX-style regular expressions.
LIKE 搜索比其他两个选项简单得多,因此在具有可能敌意的模式来源时使用起来更安全。
LIKE searches, being much simpler than the other two options, are safer to use with possibly-hostile pattern sources.
这三种模式匹配运算符都不支持非确定性排序规则。如果需要,应将不同的排序规则应用于表达式以解决此限制。
The pattern matching operators of all three kinds do not support nondeterministic collations. If required, apply a different collation to the expression to work around this limitation.
9.7.1. LIKE #
string LIKE pattern [ESCAPE escape-character]
string NOT LIKE pattern [ESCAPE escape-character]
如果_string_与提供的_pattern_匹配,则_LIKE_ 表达式返回 true。(正如预期的那样,如果_LIKE_ 返回 true,则_NOT LIKE_ 表达式返回 false,反之亦然。等效表达式为 NOT (_string LIKE pattern_)。
The LIKE expression returns true if the string matches the supplied pattern. (As expected, the NOT LIKE expression returns false if LIKE returns true, and vice versa. An equivalent expression is NOT (_string LIKE pattern)_.)
如果 pattern 不包含百分号或下划线,则模式只表示字符串本身;在这种情况下, LIKE 的作用就像等式运算符。下划线 ( ) in _pattern 表示 (匹配) 任何单个字符;百分号 ( % ) 匹配任何 0 个或多个字符序列。
If pattern does not contain percent signs or underscores, then the pattern only represents the string itself; in that case LIKE acts like the equals operator. An underscore () in pattern stands for (matches) any single character; a percent sign (%_) matches any sequence of zero or more characters.
举例:
Some examples:
'abc' LIKE 'abc' true
'abc' LIKE 'a%' true
'abc' LIKE '_b_' true
'abc' LIKE 'c' false
LIKE 模式匹配始终涵盖整个字符串。因此,如果要匹配字符串中的任意序列,则该模式必须以百分号开头和结尾。
LIKE pattern matching always covers the entire string. Therefore, if it’s desired to match a sequence anywhere within a string, the pattern must start and end with a percent sign.
要匹配不匹配其他字符的文字下划线或百分号,pattern 中的相应字符必须前缀有转义字符。默认转义字符是反斜杠,但可以使用_ESCAPE_ 子句选择不同的转义字符。要匹配转义字符本身,请编写两个转义字符。
To match a literal underscore or percent sign without matching other characters, the respective character in pattern must be preceded by the escape character. The default escape character is the backslash but a different one can be selected by using the ESCAPE clause. To match the escape character itself, write two escape characters.
Note
如果您已关闭 standard_conforming_strings,则在字符串常量中编写的任何反斜杠都需要加倍。有关更多信息,请参阅 Section 4.1.2.1。
If you have standard_conforming_strings turned off, any backslashes you write in literal string constants will need to be doubled. See Section 4.1.2.1 for more information.
也可以通过编写_ESCAPE ''_ 来选择没有转义字符。这会有效禁用转义机制,从而无法关闭模式中下划线和百分号的特殊含义。
It’s also possible to select no escape character by writing ESCAPE ''. This effectively disables the escape mechanism, which makes it impossible to turn off the special meaning of underscore and percent signs in the pattern.
根据 SQL 标准,省略_ESCAPE_意味着没有转义字符(而不是默认为反斜杠),并且不允许长度为零的_ESCAPE_ 值。因此,PostgreSQL 在这方面的行为稍微有点非标。
According to the SQL standard, omitting ESCAPE means there is no escape character (rather than defaulting to a backslash), and a zero-length ESCAPE value is disallowed. PostgreSQL’s behavior in this regard is therefore slightly nonstandard.
可以用_ILIKE_ 关键字代替_LIKE_,以根据活动区域设置使匹配对大小写不敏感。这不在 SQL 标准中,而是 PostgreSQL 的扩展。
The key word ILIKE can be used instead of LIKE to make the match case-insensitive according to the active locale. This is not in the SQL standard but is a PostgreSQL extension.
~~ 运算符等同于 LIKE,而 ~~* 对应于 ILIKE。还有 !~~ 和 !~~* 运算符,它们分别表示 NOT LIKE 和 NOT ILIKE。所有这些运算符都是 PostgreSQL 特有的。你可能在 EXPLAIN 输出和类似的地方看到这些运算符名称,因为解析器实际上将 LIKE 等翻译为这些运算符。
The operator ~~ is equivalent to LIKE, and ~~* corresponds to ILIKE. There are also !~~ and !~~* operators that represent NOT LIKE and NOT ILIKE, respectively. All of these operators are PostgreSQL-specific. You may see these operator names in EXPLAIN output and similar places, since the parser actually translates LIKE et al. to these operators.
在 PostgreSQL 语法中,LIKE、ILIKE、NOT LIKE 和 NOT ILIKE 短语通常被视为运算符;例如,它们可以用于 expression operator ANY (subquery) 构造中,尽管那里不能包含 ESCAPE 子句。在某些罕见的情况下,可能需要使用底层运算符名称。
The phrases LIKE, ILIKE, NOT LIKE, and NOT ILIKE are generally treated as operators in PostgreSQL syntax; for example they can be used in expression operator ANY (subquery) constructs, although an ESCAPE clause cannot be included there. In some obscure cases it may be necessary to use the underlying operator names instead.
另请参阅 starts-with 运算符 ^@ 和对应的 starts_with() 函数,它们在只需要匹配字符串开头的情况下很有用。
Also see the starts-with operator ^@ and the corresponding starts_with() function, which are useful in cases where simply matching the beginning of a string is needed.
9.7.2. SIMILAR TO Regular Expressions #
string SIMILAR TO pattern [ESCAPE escape-character]
string NOT SIMILAR TO pattern [ESCAPE escape-character]
SIMILAR TO 运算符根据其模式是否与给定的字符串匹配,返回 true 或 false。它类似于 LIKE,除了它使用 SQL 标准的正则表达式定义来解释模式。SQL 正则表达式是对 LIKE 表示法和通用 (POSIX) 正则表达式表示法的奇怪交叉。
The SIMILAR TO operator returns true or false depending on whether its pattern matches the given string. It is similar to LIKE, except that it interprets the pattern using the SQL standard’s definition of a regular expression. SQL regular expressions are a curious cross between LIKE notation and common (POSIX) regular expression notation.
与 LIKE 一样,SIMILAR TO 运算符仅在其模式与整个字符串匹配时才成功;这与常见的正则表达式行为不同,其中模式可以匹配字符串的任何部分。也像 LIKE 一样,SIMILAR TO 使用 _ 和 % 作为通配符,分别表示任何单个字符和任何字符串(它们可以与 POSIX 正则表达式中的 . 和 .* 相媲美)。
Like LIKE, the SIMILAR TO operator succeeds only if its pattern matches the entire string; this is unlike common regular expression behavior where the pattern can match any part of the string. Also like LIKE, SIMILAR TO uses _ and % as wildcard characters denoting any single character and any string, respectively (these are comparable to . and .* in POSIX regular expressions).
除了从 LIKE 借来的这些功能之外,SIMILAR TO 还支持从 POSIX 正则表达式中借来的这些模式匹配元字符:
In addition to these facilities borrowed from LIKE, SIMILAR TO supports these pattern-matching metacharacters borrowed from POSIX regular expressions:
请注意,句点 (.) 不是 SIMILAR TO 的元字符。
Notice that the period (.) is not a metacharacter for SIMILAR TO.
与 LIKE 一样,反斜杠会禁用这些元字符的任何特殊含义。可以使用 ESCAPE 指定不同的转义字符,或者通过编写 ESCAPE '' 来禁用转义功能。
As with LIKE, a backslash disables the special meaning of any of these metacharacters. A different escape character can be specified with ESCAPE, or the escape capability can be disabled by writing ESCAPE ''.
根据 SQL 标准,省略_ESCAPE_意味着没有转义字符(而不是默认为反斜杠),并且不允许长度为零的_ESCAPE_ 值。因此,PostgreSQL 在这方面的行为稍微有点非标。
According to the SQL standard, omitting ESCAPE means there is no escape character (rather than defaulting to a backslash), and a zero-length ESCAPE value is disallowed. PostgreSQL’s behavior in this regard is therefore slightly nonstandard.
另一个非标准扩展是,转义符后面跟着字母或数字会提供对 POSIX 正则表达式定义的转义序列的访问;请参阅下文的 Table 9.20、 Table 9.21 和 Table 9.22。
Another nonstandard extension is that following the escape character with a letter or digit provides access to the escape sequences defined for POSIX regular expressions; see Table 9.20, Table 9.21, and Table 9.22 below.
举例:
Some examples:
'abc' SIMILAR TO 'abc' true
'abc' SIMILAR TO 'a' false
'abc' SIMILAR TO '%(b|d)%' true
'abc' SIMILAR TO '(b|c)%' false
'-abc-' SIMILAR TO '%\mabc\M%' true
'xabcy' SIMILAR TO '%\mabc\M%' false
具有三个参数的 substring 函数提供提取与 SQL 正则表达式模式匹配的子字符串。可以按照标准 SQL 语法编写该函数:
The substring function with three parameters provides extraction of a substring that matches an SQL regular expression pattern. The function can be written according to standard SQL syntax:
substring(string similar pattern escape escape-character)
或使用现已过时的 SQL:1999 语法:
or using the now obsolete SQL:1999 syntax:
substring(string from pattern for escape-character)
或作为普通的三个参数函数:
or as a plain three-argument function:
substring(string, pattern, escape-character)
与 SIMILAR TO 一样,指定模式必须匹配整个数据字符串,否则函数将失败并返回 null。为了指示模式的哪一部分与匹配的数据子字符串相关,模式应包含转义字符后跟双引号 (") 的两个出现。当匹配成功时,将返回匹配模式中该部分文本的文本。
As with SIMILAR TO, the specified pattern must match the entire data string, or else the function fails and returns null. To indicate the part of the pattern for which the matching data sub-string is of interest, the pattern should contain two occurrences of the escape character followed by a double quote ("). The text matching the portion of the pattern between these separators is returned when the match is successful.
转义双引号分隔符实际上将 substring 的模式划分为三个独立的正则表达式;例如,三个部分中的任何部分中的竖线 (|) 都只影响该部分。此外,在关于有多少数据字符串与哪种模式匹配存在任何歧义时,这些正则表达式中的第一个和第三个被定义为匹配最小可能的文本量,而不是最大量。(在 POSIX 术语中,第一个和第三个正则表达式被迫成为非贪婪的。)
The escape-double-quote separators actually divide substring's pattern into three independent regular expressions; for example, a vertical bar (|) in any of the three sections affects only that section. Also, the first and third of these regular expressions are defined to match the smallest possible amount of text, not the largest, when there is any ambiguity about how much of the data string matches which pattern. (In POSIX parlance, the first and third regular expressions are forced to be non-greedy.)
作为对 SQL 标准的扩展,PostgreSQL 允许只有一个转义双引号分隔符,在这种情况下,第三个正则表达式被视为为空;或者没有分隔符,在这种情况下,第一个和第三个正则表达式被视为为空。
As an extension to the SQL standard, PostgreSQL allows there to be just one escape-double-quote separator, in which case the third regular expression is taken as empty; or no separators, in which case the first and third regular expressions are taken as empty.
一些示例,使用 #" 来分隔返回字符串:
Some examples, with #" delimiting the return string:
substring('foobar' similar '%#"o_b#"%' escape '#') oob
substring('foobar' similar '#"o_b#"%' escape '#') NULL
9.7.3. POSIX Regular Expressions #
Table 9.16 列出了用于使用 POSIX 正则表达式进行模式匹配的可用运算符。
Table 9.16 lists the available operators for pattern matching using POSIX regular expressions.
Table 9.16. Regular Expression Match Operators
Operator Description Example(s) |
text ~ text → boolean String matches regular expression, case sensitively 'thomas' ~ 't.*ma' → t |
text ~ text → boolean String matches regular expression, case-insensitively 'thomas' ~ 'T.*ma' → t |
text !~ text → boolean String does not match regular expression, case sensitively 'thomas' !~ 't.*max' → t |
text !~ text → boolean String does not match regular expression, case-insensitively 'thomas' !~ 'T.*ma' → f |
与 LIKE 和 SIMILAR TO 操作符相比,POSIX 正则表达式在模式匹配中提供了更强大的手段。像 egrep、sed 或 awk 这样的许多 Unix 工具使用的模式匹配语言与此处描述的语言类似。
POSIX regular expressions provide a more powerful means for pattern matching than the LIKE and SIMILAR TO operators. Many Unix tools such as egrep, sed, or awk use a pattern matching language that is similar to the one described here.
正则表达式是一个字符序列,是对字符串集的简要定义(一个 regular set)。字符串被认为与正则表达式匹配,如果它是正则表达式描述的正则集的成员。正则表达式与 LIKE 相同,模式字符与字符串字符完全匹配,除非它们是正则表达式语言中的特殊字符——但正则表达式使用与 LIKE 不同的特殊字符。与 LIKE 模式不同,正则表达式被允许与字符串中的任何位置匹配,除非正则表达式被明确地固定在字符串的开始或结尾。
A regular expression is a character sequence that is an abbreviated definition of a set of strings (a regular set). A string is said to match a regular expression if it is a member of the regular set described by the regular expression. As with LIKE, pattern characters match string characters exactly unless they are special characters in the regular expression language — but regular expressions use different special characters than LIKE does. Unlike LIKE patterns, a regular expression is allowed to match anywhere within a string, unless the regular expression is explicitly anchored to the beginning or end of the string.
举例:
Some examples:
'abcd' ~ 'bc' true
'abcd' ~ 'a.c' true — dot matches any character
'abcd' ~ 'a.*d' true — * repeats the preceding pattern item
'abcd' ~ '(b|x)' true — | means OR, parentheses group
'abcd' ~ '^a' true — ^ anchors to start of string
'abcd' ~ '^(b|c)' false — would match except for anchoring
POSIX 模式语言在下面有更详细的描述。
The POSIX pattern language is described in much greater detail below.
substring 函数有两个参数,substring(_string(从 pattern 开始),提供了对匹配 POSIX 正则表达式模式的子串的提取。如果不存在匹配项,它返回 null,否则返回与模式匹配的文本的第一部分。但是,如果模式包含任何圆括号,则返回与第一个有括号的子表达式相匹配的文本部分(左括号在前)。如果你想在括号中使用括号而不触发此异常,你可以在整个表达式周围加上括号。如果你需要在要提取的子表达式之前的模式中使用括号,请参阅下面描述的非捕获括号。
The substring function with two parameters, substring(_string from pattern)_, provides extraction of a substring that matches a POSIX regular expression pattern. It returns null if there is no match, otherwise the first portion of the text that matched the pattern. But if the pattern contains any parentheses, the portion of the text that matched the first parenthesized subexpression (the one whose left parenthesis comes first) is returned. You can put parentheses around the whole expression if you want to use parentheses within it without triggering this exception. If you need parentheses in the pattern before the subexpression you want to extract, see the non-capturing parentheses described below.
举例:
Some examples:
substring('foobar' from 'o.b') oob
substring('foobar' from 'o(.)b') o
regexp_count 函数计算 POSIX 正则表达式模式与字符串匹配的位置数。其语法是 regexp_count(string, pattern [, start [, flags ]]). 通常,在 string 中从字符串的开头搜索_pattern_,但如果提供了 start 参数,则从该字符索引开始搜索。flags 参数是一个可选的文本字符串,其中包含零个或多个单字母标记,用于更改函数的行为。例如,在 flags 中包含 i 可以指定不区分大小写的匹配。支持的标记在 Table 9.24 中进行了描述。
The regexp_count function counts the number of places where a POSIX regular expression pattern matches a string. It has the syntax regexp_count(string, pattern [, start [, flags ]]). pattern is searched for in string, normally from the beginning of the string, but if the start parameter is provided then beginning from that character index. The flags parameter is an optional text string containing zero or more single-letter flags that change the function’s behavior. For example, including i in flags specifies case-insensitive matching. Supported flags are described in Table 9.24.
举例:
Some examples:
regexp_count('ABCABCAXYaxy', 'A.') 3
regexp_count('ABCABCAXYaxy', 'A.', 1, 'i') 4
regexp_instr 函数返回 POSIX 正则表达式模式与字符串的第 N'次匹配的起始或结束位置,如果不存在此类匹配,则返回零。其语法是 regexp_instr(string, pattern [, start [, N [, endoption [, flags [, subexpr ]]]]]]). 通常,在 string 中从字符串的开头搜索_pattern_,但如果提供了 start 参数,则从该字符索引开始搜索。如果指定了 N,则会找到模式的第_N_'次匹配,否则会找到第一次匹配。如果省略 endoption 参数或指定为零,则函数将返回匹配的第一个字符位置。否则,endoption 必须为 1,并且函数将返回匹配后的字符位置。flags 参数是一个可选的文本字符串,其中包含零个或多个单字母标记,用于更改函数的行为。支持的标记在 Table 9.24 中进行了描述。对于包含括号子表达式的模式,subexpr 是一个整数,表示哪一个子表达式是感兴趣的:结果标识匹配该子表达式的子字符串的位置。子表达式按其左括号的顺序进行编号。当省略或将 subexpr 设置为零时,结果标识整个匹配位置,而不管是否存在括号子表达式。
The regexp_instr function returns the starting or ending position of the N'th match of a POSIX regular expression pattern to a string, or zero if there is no such match. It has the syntax regexp_instr(string, pattern [, start [, N [, endoption [, flags [, subexpr ]]]]]). pattern is searched for in string, normally from the beginning of the string, but if the start parameter is provided then beginning from that character index. If N is specified then the N'th match of the pattern is located, otherwise the first match is located. If the endoption parameter is omitted or specified as zero, the function returns the position of the first character of the match. Otherwise, endoption must be one, and the function returns the position of the character following the match. The flags parameter is an optional text string containing zero or more single-letter flags that change the function’s behavior. Supported flags are described in Table 9.24. For a pattern containing parenthesized subexpressions, subexpr is an integer indicating which subexpression is of interest: the result identifies the position of the substring matching that subexpression. Subexpressions are numbered in the order of their leading parentheses. When subexpr is omitted or zero, the result identifies the position of the whole match regardless of parenthesized subexpressions.
举例:
Some examples:
regexp_instr('number of your street, town zip, FR', '[^,]+', 1, 2)
23
regexp_instr('ABCDEFGHI', '(c..)(...)', 1, 1, 0, 'i', 2)
6
regexp_like 函数检查 POSIX 正则表达式模式的匹配是否发生在字符串中,返回布尔 true 或 false。它的语法是 regexp_like(string, pattern [, flags ])。flags 参数是一个可选项文本串,包含零个或多个单字母标识,以更改此函数的行为。 Table 9.24 中描述了受支持标识。如果没有指定标识,则此函数具有与 ~ 运算符相同的结果。如果仅指定 i 标识,则该函数具有与 ~* 运算符相同的结果。
The regexp_like function checks whether a match of a POSIX regular expression pattern occurs within a string, returning boolean true or false. It has the syntax regexp_like(string, pattern [, flags ]). The flags parameter is an optional text string containing zero or more single-letter flags that change the function’s behavior. Supported flags are described in Table 9.24. This function has the same results as the ~ operator if no flags are specified. If only the i flag is specified, it has the same results as the ~* operator.
举例:
Some examples:
regexp_like('Hello World', 'world') false
regexp_like('Hello World', 'world', 'i') true
regexp_match 函数返回 POSIX 正则表达式模式针对一个字符串的首次匹配内匹配子串的文本数组。它的语法是 regexp_match(string, pattern [, flags ])。如果没有匹配,则结果为 NULL。如果找到匹配,并且 pattern 不包含任何带括号的子表达式,则结果为包含匹配整个模式的子串的单元素文本数组。如果找到匹配,并且 pattern 包含带括号的子表达式,则结果是文本数组,其 n 元素是匹配 pattern 的 n 带括号的子表达式的子串(不计“非捕获”括号;有关详细信息,请参见下文)。flags 参数是一个可选项文本串,包含零个或多个单字母标识,以更改此函数的行为。 Table 9.24 中描述了受支持标识。
The regexp_match function returns a text array of matching substring(s) within the first match of a POSIX regular expression pattern to a string. It has the syntax regexp_match(string, pattern [, flags ]). If there is no match, the result is NULL. If a match is found, and the pattern contains no parenthesized subexpressions, then the result is a single-element text array containing the substring matching the whole pattern. If a match is found, and the pattern contains parenthesized subexpressions, then the result is a text array whose n'th element is the substring matching the n'th parenthesized subexpression of the pattern (not counting “non-capturing” parentheses; see below for details). The flags parameter is an optional text string containing zero or more single-letter flags that change the function’s behavior. Supported flags are described in Table 9.24.
举例:
Some examples:
SELECT regexp_match('foobarbequebaz', 'bar.*que');
regexp_match
--------------
{barbeque}
(1 row)
SELECT regexp_match('foobarbequebaz', '(bar)(beque)');
regexp_match
--------------
{bar,beque}
(1 row)
Tip
在只需提取整个匹配子字符串或 NULL 表示不匹配的情况下,最佳方法是使用 regexp_substr()。但是,regexp_substr() 仅存在于 PostgreSQL 版本 15 及更高版本中。在使用较旧版本时,你可以提取 regexp_match() 结果中的第一个元素,例如:
In the common case where you just want the whole matching substring or NULL for no match, the best solution is to use regexp_substr(). However, regexp_substr() only exists in PostgreSQL version 15 and up. When working in older versions, you can extract the first element of regexp_match()'s result, for example:
SELECT (regexp_match('foobarbequebaz', 'bar.*que'))[1];
regexp_match
--------------
barbeque
(1 row)
regexp_matches 函数返回 POSIX 正则表达式模式针对字符串的匹配内匹配子串的文本数组的集合。它具有与 regexp_match 相同的语法。如果没有任何匹配,则此函数不返回行;如果有一个匹配并且没有给出 g 标识,则返回一行;如果存在 N 个匹配并且给出了 g 标识,则返回 N 行。每个返回的行都是文本数组,其中包含整个匹配子串或匹配 pattern 的带括号的子表达式的子串,就像上面为 regexp_match 所描述的那样。regexp_matches 接受 Table 9.24 中显示的所有标识,加上命令它返回所有匹配(不仅仅是第一个匹配)的 g 标识。
The regexp_matches function returns a set of text arrays of matching substring(s) within matches of a POSIX regular expression pattern to a string. It has the same syntax as regexp_match. This function returns no rows if there is no match, one row if there is a match and the g flag is not given, or N rows if there are N matches and the g flag is given. Each returned row is a text array containing the whole matched substring or the substrings matching parenthesized subexpressions of the pattern, just as described above for regexp_match. regexp_matches accepts all the flags shown in Table 9.24, plus the g flag which commands it to return all matches, not just the first one.
举例:
Some examples:
SELECT regexp_matches('foo', 'not there');
regexp_matches
----------------
(0 rows)
SELECT regexp_matches('foobarbequebazilbarfbonk', '(b[^b]+)(b[^b]+)', 'g');
regexp_matches
----------------
{bar,beque}
{bazil,barf}
(2 rows)
Tip
在大多数情况下,regexp_matches() 应与 g 标志一起使用,因为如果你只想获取第一个匹配,使用 regexp_match() 更容易且更高效。但是,regexp_match() 仅存在于 PostgreSQL 版本 10 及更高版本中。在使用较旧版本时,一个常见的技巧是将 regexp_matches() 调用放入子选择中,例如:
In most cases regexp_matches() should be used with the g flag, since if you only want the first match, it’s easier and more efficient to use regexp_match(). However, regexp_match() only exists in PostgreSQL version 10 and up. When working in older versions, a common trick is to place a regexp_matches() call in a sub-select, for example:
SELECT col1, (SELECT regexp_matches(col2, '(bar)(beque)')) FROM tab;
如果存在匹配,则产生文本数组,否则 NULL,如同 regexp_match() 所做的一样。如果没有子选择,则对于没有匹配的表行,该查询完全不会产生任何输出,这通常不是期望的行为。
This produces a text array if there’s a match, or NULL if not, the same as regexp_match() would do. Without the sub-select, this query would produce no output at all for table rows without a match, which is typically not the desired behavior.
regexp_replace 函数提供替换符合 POSIX 正则表达式模式的子字符串的新文本。其语法为 regexp_replace ( source , pattern , replacement [, start [, N ]] [, flags ])。(请注意,必须同时指定 N 和 start ,但无论如何都可以指定 flags 。)如果没有与 pattern 匹配的项,则返回未更改 source 字符串。如果有匹配的项,则返回 source 字符串,其中 replacement 字符串替换了匹配的子字符串。 replacement 字符串可以包含 __n ,其中 n 为 1 至 9,以表示应插入与模式的 n 个带括号的子表达式匹配的源子字符串,并且可以包含 & 以指示应插入与整个模式相匹配的子字符串。如果您需要在替换文本中放置一个反斜杠文字,请写入 \\ 。通常从字符串的开头在 string 中搜索 pattern ,但如果提供了 start 参数,则从该字符索引位置开始搜索。默认情况下,只替换第一个与模式匹配的项。如果指定 N 且 N 大于 0,则替换第 N 个模式匹配项。如果指定 g 标志,或者指定 N 且为 0,则替换 start 位置或 start 位置之后的全部匹配项。( g 标志在指定 N 时被忽略。) flags 参数是可选文本字符串,内含零个或多个可更改函数行为的单字母标志。 Table 9.24 中介绍了受支持的标志(不包括 g )。
The regexp_replace function provides substitution of new text for substrings that match POSIX regular expression patterns. It has the syntax regexp_replace(source, pattern, replacement [, start [, N ]] [, flags ]). (Notice that N cannot be specified unless start is, but flags can be given in any case.) The source string is returned unchanged if there is no match to the pattern. If there is a match, the source string is returned with the replacement string substituted for the matching substring. The replacement string can contain _n, where _n is 1 through 9, to indicate that the source substring matching the n'th parenthesized subexpression of the pattern should be inserted, and it can contain \& to indicate that the substring matching the entire pattern should be inserted. Write \\ if you need to put a literal backslash in the replacement text. pattern is searched for in string, normally from the beginning of the string, but if the start parameter is provided then beginning from that character index. By default, only the first match of the pattern is replaced. If N is specified and is greater than zero, then the N'th match of the pattern is replaced. If the g flag is given, or if N is specified and is zero, then all matches at or after the start position are replaced. (The g flag is ignored when N is specified.) The flags parameter is an optional text string containing zero or more single-letter flags that change the function’s behavior. Supported flags (though not g) are described in Table 9.24.
举例:
Some examples:
regexp_replace('foobarbaz', 'b..', 'X')
fooXbaz
regexp_replace('foobarbaz', 'b..', 'X', 'g')
fooXX
regexp_replace('foobarbaz', 'b(..)', 'X\1Y', 'g')
fooXarYXazY
regexp_replace('A PostgreSQL function', 'a|e|i|o|u', 'X', 1, 0, 'i')
X PXstgrXSQL fXnctXXn
regexp_replace('A PostgreSQL function', 'a|e|i|o|u', 'X', 1, 3, 'i')
A PostgrXSQL function
regexp_split_to_table 函数使用 POSIX 正则表达式模式作为分隔符来拆分字符串。它的语法是 regexp_split_to_table(string, pattern [, flags ])。如果没有 pattern 匹配,则函数返回 string。如果至少有一个匹配,则对于每个匹配,它返回从上次匹配的末尾(或字符串的开始)到本次匹配的开始的文本。当不再有匹配时,它返回从上次匹配的末尾到字符串末尾的文本。flags 参数是一个可选项文本串,包含零个或多个单字母标识,以更改此函数的行为。regexp_split_to_table 支持 Table 9.24 中描述的标识。
The regexp_split_to_table function splits a string using a POSIX regular expression pattern as a delimiter. It has the syntax regexp_split_to_table(string, pattern [, flags ]). If there is no match to the pattern, the function returns the string. If there is at least one match, for each match it returns the text from the end of the last match (or the beginning of the string) to the beginning of the match. When there are no more matches, it returns the text from the end of the last match to the end of the string. The flags parameter is an optional text string containing zero or more single-letter flags that change the function’s behavior. regexp_split_to_table supports the flags described in Table 9.24.
regexp_split_to_array 函数的行为与 regexp_split_to_table 相同,不同之处在于 regexp_split_to_array 将其结果作为 text 数组返回。它的语法是 regexp_split_to_array(string, pattern [, flags ])。这些参数与 regexp_split_to_table 的参数相同。
The regexp_split_to_array function behaves the same as regexp_split_to_table, except that regexp_split_to_array returns its result as an array of text. It has the syntax regexp_split_to_array(string, pattern [, flags ]). The parameters are the same as for regexp_split_to_table.
举例:
Some examples:
SELECT foo FROM regexp_split_to_table('the quick brown fox jumps over the lazy dog', '\s+') AS foo;
foo
-------
the
quick
brown
fox
jumps
over
the
lazy
dog
(9 rows)
SELECT regexp_split_to_array('the quick brown fox jumps over the lazy dog', '\s+');
regexp_split_to_array
-----------------------------------------------
{the,quick,brown,fox,jumps,over,the,lazy,dog}
(1 row)
SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo;
foo
-----
t
h
e
q
u
i
c
k
b
r
o
w
n
f
o
x
(16 rows)
如最后一个示例所示,regexp split 函数忽略在字符串的开头或结尾或紧跟前一个匹配之后出现的零长度匹配。这与其他 regexp 函数实现的严格 regexp 匹配定义相反,但实际上通常是最方便的行为。Perl 等其他软件系统也使用类似的定义。
As the last example demonstrates, the regexp split functions ignore zero-length matches that occur at the start or end of the string or immediately after a previous match. This is contrary to the strict definition of regexp matching that is implemented by the other regexp functions, but is usually the most convenient behavior in practice. Other software systems such as Perl use similar definitions.
regexp_substr 函数返回匹配 POSIX 正则表达式模式的子串,或在没有匹配时返回 NULL。它的语法是 regexp_substr(string, pattern [, start [, N [, flags [, subexpr ]]]]).通常情况下, pattern 是从字符串的开头开始在 string 中搜索的,但是如果提供了 start 参数,则从此字符索引开始搜索。如果指定了 N,则返回模式的 N 匹配,否则返回第一个匹配。flags 参数是一个可选项文本串,包含零个或多个单字母标识,以更改此函数的行为。 Table 9.24 中描述了受支持标识。对于包含带括号的子表达式的模式,subexpr 是指定感兴趣子表达式的整数:结果是匹配该子表达式的子串。子表达式按其左括号的顺序进行编号。当省略 subexpr 或使其为零时,结果是整个匹配,而不管是否有带括号的子表达式。
The regexp_substr function returns the substring that matches a POSIX regular expression pattern, or NULL if there is no match. It has the syntax regexp_substr(string, pattern [, start [, N [, flags [, subexpr ]]]]). pattern is searched for in string, normally from the beginning of the string, but if the start parameter is provided then beginning from that character index. If N is specified then the N'th match of the pattern is returned, otherwise the first match is returned. The flags parameter is an optional text string containing zero or more single-letter flags that change the function’s behavior. Supported flags are described in Table 9.24. For a pattern containing parenthesized subexpressions, subexpr is an integer indicating which subexpression is of interest: the result is the substring matching that subexpression. Subexpressions are numbered in the order of their leading parentheses. When subexpr is omitted or zero, the result is the whole match regardless of parenthesized subexpressions.
举例:
Some examples:
regexp_substr('number of your street, town zip, FR', '[^,]+', 1, 2)
town zip
regexp_substr('ABCDEFGHI', '(c..)(...)', 1, 1, 'i', 2)
FGH
9.7.3.1. Regular Expression Details #
PostgreSQL 的正则表达式是使用 Henry Spencer 编写的软件包实现的。以下是正则表达式的描述,其中大部分直接从其手册中复制而来。
PostgreSQL’s regular expressions are implemented using a software package written by Henry Spencer. Much of the description of regular expressions below is copied verbatim from his manual.
根据 POSIX 1003.2 定义的正则表达式(RE)有两种形式:extended RE 或 ERE(大致相当于 egrep)和 basic RE 或 BRE(大致相当于 ed)。PostgreSQL 支持这两种形式,还实现了一些不在 POSIX 标准中的扩展,但由于在 Perl 和 Tcl 等编程语言中可用而被广泛使用。使用这些非 POSIX 扩展的 RE 在本文档中称为 advanced RE 或 ARE。ARE 几乎是 ERE 的精确超集,但 BRE 存在多个记法不兼容(以及限制更多)。我们首先描述 ARE 和 ERE 形式,注意仅适用于 ARE 的功能,然后描述 BRE 的不同之处。
Regular expressions (REs), as defined in POSIX 1003.2, come in two forms: extended REs or EREs (roughly those of egrep), and basic REs or BREs (roughly those of ed). PostgreSQL supports both forms, and also implements some extensions that are not in the POSIX standard, but have become widely used due to their availability in programming languages such as Perl and Tcl. REs using these non-POSIX extensions are called advanced REs or AREs in this documentation. AREs are almost an exact superset of EREs, but BREs have several notational incompatibilities (as well as being much more limited). We first describe the ARE and ERE forms, noting features that apply only to AREs, and then describe how BREs differ.
Note
PostgreSQL 最初总是假定正则表达式遵循 ARE 规则。但是,可以通过在 RE 模式前添加 _embedded option_来选择更有限的 ERE 或 BRE 规则,如 Section 9.7.3.4中所述。这对于与期望完全遵循 POSIX 1003.2 规则的应用程序兼容可能很有用。
PostgreSQL always initially presumes that a regular expression follows the ARE rules. However, the more limited ERE or BRE rules can be chosen by prepending an embedded option to the RE pattern, as described in Section 9.7.3.4. This can be useful for compatibility with applications that expect exactly the POSIX 1003.2 rules.
正则表达式定义为一个或多个 branches,以 | 分隔。它匹配与其中一个分支相匹配的任何内容。
A regular expression is defined as one or more branches, separated by |. It matches anything that matches one of the branches.
分支是零个或多个 quantified atoms 或 constraints,串联而成。它匹配第一个的匹配,后跟第二个的匹配,依此类推;空分支匹配空字符串。
A branch is zero or more quantified atoms or constraints, concatenated. It matches a match for the first, followed by a match for the second, etc.; an empty branch matches the empty string.
量化的原子是一个 atom,后面可能跟一个 quantifier。如果没有量词,则它匹配一个原子匹配。如果有量词,则它可以匹配原子的多个匹配。atom 可以是 Table 9.17 中显示的任何可能性之一。可能的量词及其含义在 Table 9.18 中显示。
A quantified atom is an atom possibly followed by a single quantifier. Without a quantifier, it matches a match for the atom. With a quantifier, it can match some number of matches of the atom. An atom can be any of the possibilities shown in Table 9.17. The possible quantifiers and their meanings are shown in Table 9.18.
constraint 匹配空字符串,但只有在满足特定条件时才匹配。可以在原子可以使用的位置使用约束,但该约束不能后跟量词。 Table 9.19 中显示了简单的约束;稍后将描述更多约束。
A constraint matches an empty string, but matches only when specific conditions are met. A constraint can be used where an atom could be used, except it cannot be followed by a quantifier. The simple constraints are shown in Table 9.19; some more constraints are described later.
Table 9.17. Regular Expression Atoms
Atom |
Description |
(re) |
(where re is any regular expression) matches a match for re, with the match noted for possible reporting |
(?:re) |
as above, but the match is not noted for reporting (a “non-capturing” set of parentheses) (AREs only) |
. |
matches any single character |
[chars] |
a bracket expression, matching any one of the chars (see Section 9.7.3.2 for more detail) |
__k |
(where k is a non-alphanumeric character) matches that character taken as an ordinary character, e.g., \\ matches a backslash character |
__c |
where c is alphanumeric (possibly followed by other characters) is an escape, see Section 9.7.3.3 (AREs only; in EREs and BREs, this matches c) |
{ |
when followed by a character other than a digit, matches the left-brace character {; when followed by a digit, it is the beginning of a bound (see below) |
x |
where x is a single character with no other significance, matches that character |
RE 不能以反斜杠 (\) 结尾。
An RE cannot end with a backslash (\).
Note
如果您已关闭 standard_conforming_strings,则在字符串常量中编写的任何反斜杠都需要加倍。有关更多信息,请参阅 Section 4.1.2.1。
If you have standard_conforming_strings turned off, any backslashes you write in literal string constants will need to be doubled. See Section 4.1.2.1 for more information.
Table 9.18. Regular Expression Quantifiers
Quantifier |
Matches |
* |
a sequence of 0 or more matches of the atom |
+ |
a sequence of 1 or more matches of the atom |
? |
a sequence of 0 or 1 matches of the atom |
{m} |
a sequence of exactly m matches of the atom |
{m,} |
a sequence of m or more matches of the atom |
{m,n} |
a sequence of m through n (inclusive) matches of the atom; m cannot exceed n |
*? |
non-greedy version of * |
+? |
non-greedy version of + |
?? |
non-greedy version of ? |
{m}? |
non-greedy version of {m} |
{m,}? |
non-greedy version of {m,} |
{m,n}? |
non-greedy version of {m,n} |
使用 {…} 的形式称为 bounds 。绑定中的数字 m 和 n 是无符号十进制整数,允许的值范围为 0 至 255(包括)。
The forms using {…} are known as bounds. The numbers m and n within a bound are unsigned decimal integers with permissible values from 0 to 255 inclusive.
Non-greedy 量词(仅在 ARE 中可用)与它们对应的普通 (greedy)对应项匹配的可能性相同,但优先选择最小的数字,而不是最大的数字。有关更多详细信息,请参见 Section 9.7.3.5。
Non-greedy quantifiers (available in AREs only) match the same possibilities as their corresponding normal (greedy) counterparts, but prefer the smallest number rather than the largest number of matches. See Section 9.7.3.5 for more detail.
Note
量词不能紧跟另一个量词,例如 ** 是无效的。量词不能位于表达式的开头或子表达式开头,也不能紧跟 ^ 或 |。
A quantifier cannot immediately follow another quantifier, e.g., ** is invalid. A quantifier cannot begin an expression or subexpression or follow ^ or |.
Table 9.19. Regular Expression Constraints
Constraint |
Description |
^ |
matches at the beginning of the string |
$ |
matches at the end of the string |
(?=re) |
positive lookahead matches at any point where a substring matching re begins (AREs only) |
(?!re) |
negative lookahead matches at any point where no substring matching re begins (AREs only) |
(?⇐re) |
positive lookbehind matches at any point where a substring matching re ends (AREs only) |
(?<!re) |
negative lookbehind matches at any point where no substring matching re ends (AREs only) |
先行和后续约束不能包含 back references(参见 Section 9.7.3.3),且它们内部的所有括号都视为非捕获。
Lookahead and lookbehind constraints cannot contain back references (see Section 9.7.3.3), and all parentheses within them are considered non-capturing.
9.7.3.2. Bracket Expressions #
bracket expression 是一个字符列表,由 [] 包括。它通常匹配列表中的任何单个字符(但请参见下文)。如果列表以 ^ 开头,它将匹配列表中其余的 not。如果列表中的两个字符由 - 分隔,则这表示整理序列中这两个字符(包括此二者)之间的字符的全部范围,,例如,ASCII 中的 [0-9] 匹配任何十进制数字。两个范围共享一个端点属于非法情况,例如 a-c-e。范围极度依赖排序序列,因此可移植程序应避免依赖于它们。
A bracket expression is a list of characters enclosed in []. It normally matches any single character from the list (but see below). If the list begins with ^, it matches any single character not from the rest of the list. If two characters in the list are separated by -, this is shorthand for the full range of characters between those two (inclusive) in the collating sequence, e.g., [0-9] in ASCII matches any decimal digit. It is illegal for two ranges to share an endpoint, e.g., a-c-e. Ranges are very collating-sequence-dependent, so portable programs should avoid relying on them.
若要将文字 ] 包含在列表中,请将其设为第一个字符(在 ^ 之后,如果使用该字符)。若要包含文字 -,请将其设为第一个或最后一个字符,或范围的第二个端点。若要使用文字 - 作为范围的第一个端点,请将其包括在 [. 和 .] 中,使其成为排序元素(请参见下文)。除这些字符外,一些使用 [ 的组合(请参见下一段)和转义字符(仅限 ARE)外,所有其他特殊字符在括号表达式中都会失去其特殊意义。尤其是,按照 ERE 或 BRE 规则 \ 不是特殊字符,尽管在 ARE 中它很特殊(作为转义符)。
To include a literal ] in the list, make it the first character (after ^, if that is used). To include a literal -, make it the first or last character, or the second endpoint of a range. To use a literal - as the first endpoint of a range, enclose it in [. and .] to make it a collating element (see below). With the exception of these characters, some combinations using [ (see next paragraphs), and escapes (AREs only), all other special characters lose their special significance within a bracket expression. In particular, \ is not special when following ERE or BRE rules, though it is special (as introducing an escape) in AREs.
在括号表达式中,由 [. 和 .] 包含的排序元素(字符、整理为单个字符的多字符序列,或任一元素的排序序列名称)表示该排序元素字符的序列。该序列被视为括号表达式列表的单个元素。这样,包含多字符排序元素的括号表达式就可以匹配多个字符,例如,如果排序序列包括 ch 排序元素,则 RE [[.ch.]]*c 匹配 chchcc 的前五个字符。
Within a bracket expression, a collating element (a character, a multiple-character sequence that collates as if it were a single character, or a collating-sequence name for either) enclosed in [. and .] stands for the sequence of characters of that collating element. The sequence is treated as a single element of the bracket expression’s list. This allows a bracket expression containing a multiple-character collating element to match more than one character, e.g., if the collating sequence includes a ch collating element, then the RE [[.ch.]]*c matches the first five characters of chchcc.
Note
PostgreSQL 目前不支持多字符整理元素。此信息描述了可能的未来行为。
PostgreSQL currently does not support multi-character collating elements. This information describes possible future behavior.
在括号表达式中,由 [= 和 =] 包含的排序元素是一个 equivalence class,它代表所有相当于该排序元素的排序元素(包括本身)构成的字符序列。(如果没有其他等效的排序元素,则处理方式与包络分隔符为 [. 和 .] 的情况相同。)例如,如果 o 和 ^ 是等价类别的成员,则 [[=o=]]、[[=^=]] 和 [o^] 是同义词。等价类别不能是范围的端点。
Within a bracket expression, a collating element enclosed in [= and =] is an equivalence class, standing for the sequences of characters of all collating elements equivalent to that one, including itself. (If there are no other equivalent collating elements, the treatment is as if the enclosing delimiters were [. and .].) For example, if o and ^ are the members of an equivalence class, then [[=o=]], [[=^=]], and [o^] are all synonymous. An equivalence class cannot be an endpoint of a range.
在括号表达式中,由 [: 和 :] 包含的字符类名称代表属于该类的所有字符的列表。字符类不能用作范围的端点。POSIX 标准定义了以下字符类名称:alnum(字母和数字)、alpha(字母)、blank(空格和制表符)、cntrl(控制字符)、digit(数字)、graph(空格除外的可打印字符)、lower(小写字母)、print(可打印字符,包括空格)、punct(标点)、space(任何空白)、upper(大写字母)、xdigit(十六进制数字)。这些标准字符类的行为通常在 7 位 ASCII 集中的字符在各个平台上是一致的。某个特定的非 ASCII 字符是否被认为属于其中一个类,这取决于用于正则表达式函数或运算符的 collation(请参见 Section 24.2),或者默认情况下取决于数据库的 LC_CTYPE 区域设置(请参见 Section 24.1)。即使在同名的区域设置中,非 ASCII 字符的分类也可能因平台而异。(但是,C 区域设置绝不会将任何非 ASCII 字符视为属于其中任何一个类。)除了这些标准字符类之外,PostgreSQL 定义了 word 字符类,它与 alnum 相同,此外还包含下划线 (_) 字符,以及 ascii 字符类,它只包含 7 位 ASCII 集。
Within a bracket expression, the name of a character class enclosed in [: and :] stands for the list of all characters belonging to that class. A character class cannot be used as an endpoint of a range. The POSIX standard defines these character class names: alnum (letters and numeric digits), alpha (letters), blank (space and tab), cntrl (control characters), digit (numeric digits), graph (printable characters except space), lower (lower-case letters), print (printable characters including space), punct (punctuation), space (any white space), upper (upper-case letters), and xdigit (hexadecimal digits). The behavior of these standard character classes is generally consistent across platforms for characters in the 7-bit ASCII set. Whether a given non-ASCII character is considered to belong to one of these classes depends on the collation that is used for the regular-expression function or operator (see Section 24.2), or by default on the database’s LC_CTYPE locale setting (see Section 24.1). The classification of non-ASCII characters can vary across platforms even in similarly-named locales. (But the C locale never considers any non-ASCII characters to belong to any of these classes.) In addition to these standard character classes, PostgreSQL defines the word character class, which is the same as alnum plus the underscore (_) character, and the ascii character class, which contains exactly the 7-bit ASCII set.
括号表达式有两种特殊情况:括号表达式 [[:<:]] 和 [[:>:]] 是约束,分别匹配单词开头和结尾处的空字符串。单词定义为不以单词字符开头或结尾的单词字符序列。单词字符是属于 word 字符类的任何字符,即任何字母、数字或下划线。这是一个扩展,与 POSIX 1003.2 兼容但不被指定,并且应谨慎用于旨在移植到其他系统的软件中。下面描述的约束转义字符通常更可取;它们不更标准,但更容易键入。
There are two special cases of bracket expressions: the bracket expressions [[:<:]] and [[:>:]] are constraints, matching empty strings at the beginning and end of a word respectively. A word is defined as a sequence of word characters that is neither preceded nor followed by word characters. A word character is any character belonging to the word character class, that is, any letter, digit, or underscore. This is an extension, compatible with but not specified by POSIX 1003.2, and should be used with caution in software intended to be portable to other systems. The constraint escapes described below are usually preferable; they are no more standard, but are easier to type.
9.7.3.3. Regular Expression Escapes #
Escapes 是以 \ 之后跟字母数字字符开头的特殊序列。转义有若干种类:字符输入、类简写、约束转义和反向引用。一个 \ 后跟一个字母数字字符但不是一个有效转义对于 ARE 是非法的。在 ERE 中,没有转义:在括号表达式外部,一个 \ 后跟一个字母数字字符仅仅代表该字符作为普通字符,而在括号表达式内部,\ 是一个普通字符。(后者正是 ERE 和 ARE 之间的实际差异。)
Escapes are special sequences beginning with \ followed by an alphanumeric character. Escapes come in several varieties: character entry, class shorthands, constraint escapes, and back references. A \ followed by an alphanumeric character but not constituting a valid escape is illegal in AREs. In EREs, there are no escapes: outside a bracket expression, a \ followed by an alphanumeric character merely stands for that character as an ordinary character, and inside a bracket expression, \ is an ordinary character. (The latter is the one actual incompatibility between EREs and AREs.)
Character-entry escapes 的存在是为了更容易在 RE 中指定不可打印字符和其他不方便的字符。它们显示在 Table 9.20 中。
Character-entry escapes exist to make it easier to specify non-printing and other inconvenient characters in REs. They are shown in Table 9.20.
Class-shorthand escapes 提供了对某些常用字符类的简写。它们显示在 Table 9.21 中。
Class-shorthand escapes provide shorthands for certain commonly-used character classes. They are shown in Table 9.21.
constraint escape 是一个约束,它匹配在满足特定条件时形成的空字符串,用转义字符写成。它们显示在 Table 9.22 中。
A constraint escape is a constraint, matching the empty string if specific conditions are met, written as an escape. They are shown in Table 9.22.
back reference ( __n ) 匹配由数字 n 指定的前一个圆括号子表达式匹配的字符串(请参阅 Table 9.23 )。例如, ([bc])\1 匹配 bb 或 cc ,但不匹配 bc 或 cb 。子表达式必须完全位于正则表达式中的反向引用之前。子表达式按照其左括号的顺序进行编号。非捕获括号不定义子表达式。反向引用只考虑由引用的子表达式匹配的字符串字符,而不考虑其中包含的任何约束。例如, (^\d)\1 将匹配 22 。
A back reference (_n) matches the same string matched by the previous parenthesized subexpression specified by the number _n (see Table 9.23). For example, ([bc])\1 matches bb or cc but not bc or cb. The subexpression must entirely precede the back reference in the RE. Subexpressions are numbered in the order of their leading parentheses. Non-capturing parentheses do not define subexpressions. The back reference considers only the string characters matched by the referenced subexpression, not any constraints contained in it. For example, (^\d)\1 will match 22.
Table 9.20. Regular Expression Character-Entry Escapes
Escape |
Description |
\a |
alert (bell) character, as in C |
\b |
backspace, as in C |
\B |
synonym for backslash (\) to help reduce the need for backslash doubling |
\c__X |
(where X is any character) the character whose low-order 5 bits are the same as those of X, and whose other bits are all zero |
\e |
the character whose collating-sequence name is ESC, or failing that, the character with octal value 033 |
\f |
form feed, as in C |
\n |
newline, as in C |
\r |
carriage return, as in C |
\t |
horizontal tab, as in C |
\u__wxyz |
(where wxyz is exactly four hexadecimal digits) the character whose hexadecimal value is 0x__wxyz |
\U__stuvwxyz |
(where stuvwxyz is exactly eight hexadecimal digits) the character whose hexadecimal value is 0x__stuvwxyz |
\v |
vertical tab, as in C |
\x__hhh |
(where hhh is any sequence of hexadecimal digits) the character whose hexadecimal value is 0x__hhh (a single character no matter how many hexadecimal digits are used) |
\0 |
the character whose value is 0 (the null byte) |
__xy |
(where xy is exactly two octal digits, and is not a back reference) the character whose octal value is 0__xy |
__xyz |
(where xyz is exactly three octal digits, and is not a back reference) the character whose octal value is 0__xyz |
十六进制数字为 0-9、a-f 和 A-F。八进制数字为 0-7。
Hexadecimal digits are 0-9, a-f, and A-F. Octal digits are 0-7.
指定 ASCII 范围 (0-127) 之外的值的数字字符输入转义的含义依赖于数据库编码。当编码为 UTF-8 时,转义值等同于 Unicode 代码点,例如 \u1234 表示字符 U+1234。对于其他多字节编码,字符输入转义通常仅指定字符的字节值的串联。如果转义值与数据库编码中的任何合法字符都不对应,则不会引发错误,但它永远不会匹配任何数据。
Numeric character-entry escapes specifying values outside the ASCII range (0–127) have meanings dependent on the database encoding. When the encoding is UTF-8, escape values are equivalent to Unicode code points, for example \u1234 means the character U+1234. For other multibyte encodings, character-entry escapes usually just specify the concatenation of the byte values for the character. If the escape value does not correspond to any legal character in the database encoding, no error will be raised, but it will never match any data.
字符项转义始终被视为普通字符。例如,\135 在 ASCII 中是 ],但 \135 不终止括号表达式。
The character-entry escapes are always taken as ordinary characters. For example, \135 is ] in ASCII, but \135 does not terminate a bracket expression.
Table 9.21. Regular Expression Class-Shorthand Escapes
Escape |
Description |
\d |
|
\s |
|
\w |
|
\D |
matches any non-digit, like [^[:digit:]] |
\S |
matches any non-whitespace character, like [^[:space:]] |
\W |
matches any non-word character, like [^[:word:]] |
类简写转义字符在括号表达式中也能用,尽管上面显示的定义在该上下文中在语法上并不是完全有效的。例如,[a-c\d] 等同于 [a-c[:digit:]]。
The class-shorthand escapes also work within bracket expressions, although the definitions shown above are not quite syntactically valid in that context. For example, [a-c\d] is equivalent to [a-c[:digit:]].
Table 9.22. Regular Expression Constraint Escapes
Escape |
Description |
\A |
matches only at the beginning of the string (see Section 9.7.3.5 for how this differs from ^) |
\m |
matches only at the beginning of a word |
\M |
matches only at the end of a word |
\y |
matches only at the beginning or end of a word |
\Y |
matches only at a point that is not the beginning or end of a word |
\Z |
matches only at the end of the string (see Section 9.7.3.5 for how this differs from $) |
单词的定义如上文中的 [[:<:]] 和 [[:>:]] 中的规范。括号表达式中约束转义是非法的。
A word is defined as in the specification of [[:<:]] and [[:>:]] above. Constraint escapes are illegal within bracket expressions.
Table 9.23. Regular Expression Back References
Escape |
Description |
__m |
(where m is a nonzero digit) a back reference to the m'th subexpression |
__mnn |
(where m is a nonzero digit, and nn is some more digits, and the decimal value mnn is not greater than the number of closing capturing parentheses seen so far) a back reference to the mnn'th subexpression |
Note
八进制字符输入转义和反向引用之间存在固有模糊性,这是由以下启发式解决的,如上所述。前导零始终表示八进制转义。单个非零数字,后跟没有其他数字,始终视为反向引用。非零开头的多位数字序列在其后是适当子表达式(即该数字在反向引用的合法范围内)的情况下视为反向引用,否则视为八进制。
There is an inherent ambiguity between octal character-entry escapes and back references, which is resolved by the following heuristics, as hinted at above. A leading zero always indicates an octal escape. A single non-zero digit, not followed by another digit, is always taken as a back reference. A multi-digit sequence not starting with a zero is taken as a back reference if it comes after a suitable subexpression (i.e., the number is in the legal range for a back reference), and otherwise is taken as octal.
9.7.3.4. Regular Expression Metasyntax #
除了上面描述的主语法以外,还有某些特殊形式和各种语法工具可用。
In addition to the main syntax described above, there are some special forms and miscellaneous syntactic facilities available.
RE 可以从两个 director 前缀之一开始。如果 RE 以 *:, the rest of the RE is taken as an ARE. (This normally has no effect in PostgreSQL, since REs are assumed to be AREs; but it does have an effect if ERE or BRE mode had been specified by the flags parameter to a regex function.) If an RE begins with *= 开头,则 RE 的其余部分将视为文字字符串,其中所有字符都被视为常规字符。
An RE can begin with one of two special director prefixes. If an RE begins with :, the rest of the RE is taken as an ARE. (This normally has no effect in PostgreSQL, since REs are assumed to be AREs; but it does have an effect if ERE or BRE mode had been specified by the flags parameter to a regex function.) If an RE begins with =, the rest of the RE is taken to be a literal string, with all characters considered ordinary characters.
ARE 可以以 embedded options 开头:一个序列 (?_xyz)_ (其中 xyz 为一个或多个字母字符)指定影响正则表达式 (RE) 其余部分的选项。这些选项将覆盖任何先前确定的选项——特别是它们可以覆盖正则表达式运算符隐含的大小写敏感行为或正则表达式函数的 flags 参数。可用选项字母显示在 Table 9.24 中。请注意,这些相同的选项字母在正则表达式函数的 flags 参数中使用。
An ARE can begin with embedded options: a sequence (?xyz) (where xyz is one or more alphabetic characters) specifies options affecting the rest of the RE. These options override any previously determined options — in particular, they can override the case-sensitivity behavior implied by a regex operator, or the flags parameter to a regex function. The available option letters are shown in Table 9.24. Note that these same option letters are used in the flags parameters of regex functions.
Table 9.24. ARE Embedded-Option Letters
Option |
Description |
b |
rest of RE is a BRE |
c |
case-sensitive matching (overrides operator type) |
e |
rest of RE is an ERE |
i |
case-insensitive matching (see Section 9.7.3.5) (overrides operator type) |
m |
historical synonym for n |
n |
newline-sensitive matching (see Section 9.7.3.5) |
p |
partial newline-sensitive matching (see Section 9.7.3.5) |
q |
rest of RE is a literal (“quoted”) string, all ordinary characters |
s |
non-newline-sensitive matching (default) |
t |
tight syntax (default; see below) |
w |
inverse partial newline-sensitive (“weird”) matching (see Section 9.7.3.5) |
x |
expanded syntax (see below) |
嵌入式选项在序列结束时生效 )。它们只能出现在 ARE 的开头(如果有的话,在 ***: 之后)。
Embedded options take effect at the ) terminating the sequence. They can appear only at the start of an ARE (after the ***: director if any).
除了通常的 (tight) RE 语法(其中所有字符都重要),还有一个 expanded 语法,可以通过指定嵌入式 x 选项来获取。在展开的语法中,RE 中的空格字符将被忽略,并且 # 和以下换行符(或 RE 的末尾)之间的所有字符也将被忽略。这允许对复杂的 RE 进行段落和注释。基本规则有三个例外:
In addition to the usual (tight) RE syntax, in which all characters are significant, there is an expanded syntax, available by specifying the embedded x option. In the expanded syntax, white-space characters in the RE are ignored, as are all characters between a # and the following newline (or the end of the RE). This permits paragraphing and commenting a complex RE. There are three exceptions to that basic rule:
出于这个目的,空格字符是空白、制表符、换行符和属于 space 字符类的任何字符。
For this purpose, white-space characters are blank, tab, newline, and any character that belongs to the space character class.
最后,在 ARE 中,在方括号表达式外部,序列 (?#_ttt)_ (其中 ttt 是任何不包含 ) 的文本)是一个注释,完全忽略。同样,这在多字符符号的字符之间不被允许,如 (?: 。此类注释更多是一个历史文物,而不是一个有用的工具,不建议使用;改用扩展语法。
Finally, in an ARE, outside bracket expressions, the sequence (?#ttt) (where ttt is any text not containing a )) is a comment, completely ignored. Again, this is not allowed between the characters of multi-character symbols, like (?:. Such comments are more a historical artifact than a useful facility, and their use is deprecated; use the expanded syntax instead.
None 如果一个初始 ***= 导向已经指定将用户的输入视为一个文字字符串而不是 RE,这些元语法扩展的可用性将是有效的。
None of these metasyntax extensions is available if an initial ***= director has specified that the user’s input be treated as a literal string rather than as an RE.
9.7.3.5. Regular Expression Matching Rules #
如果 RE 识别給定字符串中超过一个子字符串,则 RE 匹配字符串中最先开始的子字符串。如果 RE 匹配以该点开始超过一个子字符串,则根据 RE 是 greedy 还是 non-greedy,将采用最长可能的匹配或最短可能的匹配。
In the event that an RE could match more than one substring of a given string, the RE matches the one starting earliest in the string. If the RE could match more than one substring starting at that point, either the longest possible match or the shortest possible match will be taken, depending on whether the RE is greedy or non-greedy.
RE 是贪婪的还是非贪婪的由以下规则确定:
Whether an RE is greedy or not is determined by the following rules:
上述规则不仅将贪婪属性与单独的量化原子关联,还将贪婪属性与包含量化原子的分支和整个 RE 关联。这意味着匹配是以分支或整个 RE 匹配最长或最短可能的子字符串的方式完成的 as a whole。一旦确定整个匹配的长度,根据该子表达式的贪婪属性确定与任何特定子表达式匹配的部分,RE 中较早开始的子表达式比较后开始的子表达式优先。
The above rules associate greediness attributes not only with individual quantified atoms, but with branches and entire REs that contain quantified atoms. What that means is that the matching is done in such a way that the branch, or whole RE, matches the longest or shortest possible substring as a whole. Once the length of the entire match is determined, the part of it that matches any particular subexpression is determined on the basis of the greediness attribute of that subexpression, with subexpressions starting earlier in the RE taking priority over ones starting later.
这是表示其含义的一个示例:
An example of what this means:
SELECT SUBSTRING('XY1234Z', 'Y*([0-9]{1,3})');
Result: 123
SELECT SUBSTRING('XY1234Z', 'Y*?([0-9]{1,3})');
Result: 1
在第一个案例中,由于 Y* 是贪婪的,所以整个 RE 是贪婪的。它可以从 Y 开始匹配,并且匹配从那里开始的最长可能的字符串,即 Y123。输出是该字符串的圆括号部分,或 123。在第二个案例中,由于 Y*? 是非贪婪的,所以整个 RE 是非贪婪的。它可以从 Y 开始匹配,并且匹配从那里开始的最短可能的字符串,即 Y1。子表达式 [0-9]{1,3} 是贪婪的,但它不能改变对整体匹配长度的决策;因此它被迫仅匹配 1。
In the first case, the RE as a whole is greedy because Y* is greedy. It can match beginning at the Y, and it matches the longest possible string starting there, i.e., Y123. The output is the parenthesized part of that, or 123. In the second case, the RE as a whole is non-greedy because Y*? is non-greedy. It can match beginning at the Y, and it matches the shortest possible string starting there, i.e., Y1. The subexpression [0-9]{1,3} is greedy but it cannot change the decision as to the overall match length; so it is forced to match just 1.
总之,当 RE 同时包含贪婪和非贪婪子表达式时,根据分配给整个 RE 的属性,总匹配长度尽可能长或尽可能短。分配给子表达式的属性仅影响相对于彼此她们可以“吃掉”多少那个匹配。
In short, when an RE contains both greedy and non-greedy subexpressions, the total match length is either as long as possible or as short as possible, according to the attribute assigned to the whole RE. The attributes assigned to the subexpressions only affect how much of that match they are allowed to “eat” relative to each other.
定量词 {1,1} 和 {1,1}? 可分别用于强制子表达式或整个正则表达式的贪婪或非贪婪。这在你需要让整个正则表达式具有不同于从其元素推导出的贪婪属性时很有用。例如,假设我们尝试将包含一些数字的字符串分离为数字及之前和之后的各部分。我们可以尝试像这样:
The quantifiers {1,1} and {1,1}? can be used to force greediness or non-greediness, respectively, on a subexpression or a whole RE. This is useful when you need the whole RE to have a greediness attribute different from what’s deduced from its elements. As an example, suppose that we are trying to separate a string containing some digits into the digits and the parts before and after them. We might try to do that like this:
SELECT regexp_match('abc01234xyz', '(.*)(\d+)(.*)');
Result: {abc0123,4,xyz}
这不起作用:第一个 .* 是贪婪的,因此它会“吃掉”它所能吃掉的所有内容,让 \d+ 匹配在最后一个可能的位置,即最后一个数字。我们可以尝试通过使其为非贪婪来修复:
That didn’t work: the first .* is greedy so it “eats” as much as it can, leaving the \d+ to match at the last possible place, the last digit. We might try to fix that by making it non-greedy:
SELECT regexp_match('abc01234xyz', '(.*?)(\d+)(.*)');
Result: {abc,0,""}
这也不起作用,因为现在整个正则表达式都是非贪婪的,因此它会尽快结束总体匹配。我们可以强制整个正则表达式为贪婪来得到我们想要的东西:
That didn’t work either, because now the RE as a whole is non-greedy and so it ends the overall match as soon as possible. We can get what we want by forcing the RE as a whole to be greedy:
SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
Result: {abc,01234,xyz}
分开控制正则表达式的总体贪婪与它的各个组件的贪婪,可让我们能够高度灵活地处理可变长度模式。
Controlling the RE’s overall greediness separately from its components' greediness allows great flexibility in handling variable-length patterns.
决定什么是一个较长或较短的匹配时,匹配长度以字符衡量,不是按排序元素衡量。空字符串被认为比根本不匹配更长。例如:bb* 匹配 abbbc 的中间三个字符;(week|wee)(night|knights) 匹配 weeknights 的所有十个字符;当 (.).* is matched against abc the parenthesized subexpression matches all three characters; and when (a)* 与 bc 匹配时,整个正则表达式和带括号的子表达式都匹配空字符串。
When deciding what is a longer or shorter match, match lengths are measured in characters, not collating elements. An empty string is considered longer than no match at all. For example: bb* matches the three middle characters of abbbc; (week|wee)(night|knights) matches all ten characters of weeknights; when (.).* is matched against abc the parenthesized subexpression matches all three characters; and when (a)* is matched against bc both the whole RE and the parenthesized subexpression match an empty string.
如果指定不区分大小写的匹配,其效果就好像全部大小写区别都已从字母中消失一样。当作为普通字符出现在方括号表达式外的存在多个大小写的字母时,它将有效地转换为包含这两个大小写的方括号表达式,例如,x_变为[xX]。当其出现在方括号表达式内时,它的所有大小写对应项都将添加到方括号表达式中,例如,[x]变为[xX],而[x]变为[xX]_。
If case-independent matching is specified, the effect is much as if all case distinctions had vanished from the alphabet. When an alphabetic that exists in multiple cases appears as an ordinary character outside a bracket expression, it is effectively transformed into a bracket expression containing both cases, e.g., x becomes [xX]. When it appears inside a bracket expression, all case counterparts of it are added to the bracket expression, e.g., [x] becomes [xX] and [^x] becomes [^xX].
如果指定了区分换行符的匹配,则_.和使用^的方括号表达式将绝不会匹配换行符(因此,除非 RE 明确包括换行符,否则匹配将不会越过行),且^和$将在换行符后和换行符前匹配空字符串,除了分别在字符串的开头和结尾进行匹配。但 ARE 转义字符\A_和_\Z_继续匹配字符串_only_的开头或结尾。此外,无论此模式如何,字符类别速记符号_\D_和_\W_都将匹配换行符。(在 PostgreSQL 14 之前,它们在区分换行符的模式中不匹配换行符。编写_[[:digit:]]或[[:word:]]_可获得旧行为。)
If newline-sensitive matching is specified, . and bracket expressions using ^ will never match the newline character (so that matches will not cross lines unless the RE explicitly includes a newline) and ^ and $ will match the empty string after and before a newline respectively, in addition to matching at beginning and end of string respectively. But the ARE escapes \A and \Z continue to match beginning or end of string only. Also, the character class shorthands \D and \W will match a newline regardless of this mode. (Before PostgreSQL 14, they did not match newlines when in newline-sensitive mode. Write [^[:digit:]] or [^[:word:]] to get the old behavior.)
如果指定部分区分换行符的匹配,这会影响 . 和使用 ^ 的括号表达式,与区分换行符的匹配一致,但不会影响 ^ 和 $。
If partial newline-sensitive matching is specified, this affects . and bracket expressions as with newline-sensitive matching, but not ^ and $.
如果指定逆部分区分换行符的匹配,这会影响 ^ 和 $,与区分换行符的匹配一致,但不会影响 . 和括号表达式。这不是很有用,但出于对称性的考虑还是提供了它。
If inverse partial newline-sensitive matching is specified, this affects ^ and $ as with newline-sensitive matching, but not . and bracket expressions. This isn’t very useful but is provided for symmetry.
9.7.3.6. Limits and Compatibility #
在此实现中,没有对正则表达式的长度施加任何特定的限制。但是,旨在高度可移植的程序不应采用长于 256 字节的正则表达式,因为兼容 POSIX 的实现可能会拒绝接受此类正则表达式。
No particular limit is imposed on the length of REs in this implementation. However, programs intended to be highly portable should not employ REs longer than 256 bytes, as a POSIX-compliant implementation can refuse to accept such REs.
ARE 的唯一实际与 POSIX ERE 不兼容的特性是 \ 在括号表达式中没有失去它的特殊含义。所有其他 ARE 特性都使用在 POSIX ERE 中是非法或具有未定义或未指定效果的语法;*** 指导器的语法同样也不属于 BRE 或 ERE 的 POSIX 语法。
The only feature of AREs that is actually incompatible with POSIX EREs is that \ does not lose its special significance inside bracket expressions. All other ARE features use syntax which is illegal or has undefined or unspecified effects in POSIX EREs; the *** syntax of directors likewise is outside the POSIX syntax for both BREs and EREs.
许多 ARE 扩展从 Perl 借用而来,但其中一些已被更改以清理它们,并且一些 Perl 扩展不存在。值得注意的不兼容性包括 \b、\B、对尾随换行符缺乏特殊处理、对受区分换行符匹配影响的事物增加了补集括号表达式、对前瞻/后顾约束中的括号和反向引用的限制,以及最长/最短匹配(而不是首次匹配)匹配语义。
Many of the ARE extensions are borrowed from Perl, but some have been changed to clean them up, and a few Perl extensions are not present. Incompatibilities of note include \b, \B, the lack of special treatment for a trailing newline, the addition of complemented bracket expressions to the things affected by newline-sensitive matching, the restrictions on parentheses and back references in lookahead/lookbehind constraints, and the longest/shortest-match (rather than first-match) matching semantics.
9.7.3.7. Basic Regular Expressions #
BRE 与 ERE 在几个方面存在差异。在 BRE 中,|、+_和?是普通字符,并且没有等效的功能。边界的分隔符为\{和\}_,而_{和}_本身是普通字符。嵌套子表达式的圆括号为_\(和\),而(和)本身是普通字符。^是普通字符,但 RE 的开头或带括号的子表达式的开头除外,$是普通字符,但 RE 的结尾或带括号的子表达式的结尾除外,而*是普通字符,如果它出现在 RE 的开头或带括号的子表达式的开头(可能在^开头之后)。最后,可以使用个位数反向引用,并且<_和_>_分别为_[[:<:]]和[[:>:]]_的同义词;在 BRE 中没有其他转义字符可用。
BREs differ from EREs in several respects. In BREs, |, +, and ? are ordinary characters and there is no equivalent for their functionality. The delimiters for bounds are \{ and \}, with { and } by themselves ordinary characters. The parentheses for nested subexpressions are \( and \), with ( and ) by themselves ordinary characters. ^ is an ordinary character except at the beginning of the RE or the beginning of a parenthesized subexpression, $ is an ordinary character except at the end of the RE or the end of a parenthesized subexpression, and * is an ordinary character if it appears at the beginning of the RE or the beginning of a parenthesized subexpression (after a possible leading ^). Finally, single-digit back references are available, and \< and \> are synonyms for [[:<:]] and [[:>:]] respectively; no other escapes are available in BREs.
9.7.3.8. Differences from SQL Standard and XQuery #
自 SQL:2008 起,SQL 标准包括根据 XQuery 正则表达式标准执行模式匹配的正则表达式运算符和函数:
Since SQL:2008, the SQL standard includes regular expression operators and functions that performs pattern matching according to the XQuery regular expression standard:
PostgreSQL 目前不实现这些运算符和函数。你可以在每种情况下获得近似相等的函数,如 Table 9.25中所示。(此表已省略了双方的各种可选从句。)
PostgreSQL does not currently implement these operators and functions. You can get approximately equivalent functionality in each case as shown in Table 9.25. (Various optional clauses on both sides have been omitted in this table.)
Table 9.25. Regular Expression Functions Equivalencies
SQL standard |
PostgreSQL |
string_ LIKE_REGEX _pattern |
regexp_like(_string, pattern)_ or string_ ~ _pattern |
OCCURRENCES_REGEX(_pattern IN string)_ |
regexp_count(_string, pattern)_ |
POSITION_REGEX(_pattern IN string)_ |
regexp_instr(_string, pattern)_ |
SUBSTRING_REGEX(_pattern IN string)_ |
regexp_substr(_string, pattern)_ |
TRANSLATE_REGEX(_pattern IN string WITH replacement)_ |
regexp_replace(_string, pattern, replacement)_ |
PostgreSQL 提供的正则表达式函数在许多其他 SQL 实现中也可以使用,而 SQL 标准函数的实现并不像它们那样广泛。正则表达式语法的一些细节可能因每个实现而异。
Regular expression functions similar to those provided by PostgreSQL are also available in a number of other SQL implementations, whereas the SQL-standard functions are not as widely implemented. Some of the details of the regular expression syntax will likely differ in each implementation.
SQL 标准运算符和函数使用 XQuery 正则表达式,它与上面描述的 ARE 语法非常接近。现有的基于 POSIX 的正则表达式特性与 XQuery 正则表达式之间的显著差别包括:
The SQL-standard operators and functions use XQuery regular expressions, which are quite close to the ARE syntax described above. Notable differences between the existing POSIX-based regular-expression feature and XQuery regular expressions include: