Postgresql 中文操作指南

24.2. Collation Support #

排序功能允许按列甚至按操作指定数据的排序顺序和字符分类行为。这缓解了以下限制:在创建数据库后,不能更改其 LC_COLLATELC_CTYPE 设置。

The collation feature allows specifying the sort order and character classification behavior of data per-column, or even per-operation. This alleviates the restriction that the LC_COLLATE and LC_CTYPE settings of a database cannot be changed after its creation.

24.2.1. Concepts #

从概念上来讲,可并列数据类型的所有表达式都具有排序规则。(内置的可并列数据类型有 textvarcharchar 。用户定义的基本类型也可以标记为可并列,当然,可并列数据类型上的 [role="bare"]glossary.html#GLOSSARY-DOMAIN domain 也是可并列的。)如果表达式是列引用,则表达式的排序规则是该列已定义的排序规则。如果表达式是常量,则该排序规则是常量的数据类型的默认排序规则。如下所述,更复杂的表达式的排序规则来自其输入的排序规则。

Conceptually, every expression of a collatable data type has a collation. (The built-in collatable data types are text, varchar, and char. User-defined base types can also be marked collatable, and of course a [role="bare"]glossary.html#GLOSSARY-DOMAINdomain over a collatable data type is collatable.) If the expression is a column reference, the collation of the expression is the defined collation of the column. If the expression is a constant, the collation is the default collation of the data type of the constant. The collation of a more complex expression is derived from the collations of its inputs, as described below.

表达式的排序规则可以是“默认”排序规则,这意味着针对数据库定义的区域设置。表达式的排序规则也可能是不确定的。在这些情况下,排序操作和其他需要知道排序规则的操作将失败。

The collation of an expression can be the “default” collation, which means the locale settings defined for the database. It is also possible for an expression’s collation to be indeterminate. In such cases, ordering operations and other operations that need to know the collation will fail.

当数据库系统必须执行排序或字符分类时,它将使用输入表达式的排序规则。例如,这发生在 ORDER BY 子句和函数或运算符调用(如 <)中。要应用于 ORDER BY 子句的排序规则仅仅是排序键的排序规则。要应用于函数或运算符调用的排序规则源自参数,如下所述。除了比较运算符外,排序规则还被用于在大小写字母之间转换的函数(如 lowerupperinitcap)、模式匹配运算符以及 to_char 和相关函数。

When the database system has to perform an ordering or a character classification, it uses the collation of the input expression. This happens, for example, with ORDER BY clauses and function or operator calls such as <. The collation to apply for an ORDER BY clause is simply the collation of the sort key. The collation to apply for a function or operator call is derived from the arguments, as described below. In addition to comparison operators, collations are taken into account by functions that convert between lower and upper case letters, such as lower, upper, and initcap; by pattern matching operators; and by to_char and related functions.

对于函数或运算符调用,通过检查参数排序规则而派生的排序规则在运行时用于执行指定的操作。如果函数或运算符调用的结果是可排序的数据类型,则排序规则在解析时还用作函数或运算符表达式的定义的排序规则,以防有需要了解其排序规则的周围表达式。

For a function or operator call, the collation that is derived by examining the argument collations is used at run time for performing the specified operation. If the result of the function or operator call is of a collatable data type, the collation is also used at parse time as the defined collation of the function or operator expression, in case there is a surrounding expression that requires knowledge of its collation.

表达式的 collation derivation 可以是隐式的或显式的。当一个表达式中出现多个不同的排序规则时,这种区别会影响排序规则的组合方式。当使用 COLLATE 子句时,就会发生显式的排序规则派生;所有其他排序规则派生都是隐式的。当需要组合多个排序规则时,例如在函数调用中,会使用以下规则:

The collation derivation of an expression can be implicit or explicit. This distinction affects how collations are combined when multiple different collations appear in an expression. An explicit collation derivation occurs when a COLLATE clause is used; all other collation derivations are implicit. When multiple collations need to be combined, for example in a function call, the following rules are used:

例如,考虑此表定义:

For example, consider this table definition:

CREATE TABLE test1 (
    a text COLLATE "de_DE",
    b text COLLATE "es_ES",
    ...
);

那么在

Then in

SELECT a < 'foo' FROM test1;

@{16} 比较是根据 de_DE 规则执行的,因为表达式将隐式派生的排序规则与默认排序规则组合在一起。但是在

the < comparison is performed according to de_DE rules, because the expression combines an implicitly derived collation with the default collation. But in

SELECT a < ('foo' COLLATE "fr_FR") FROM test1;

比较使用 fr_FR 规则执行,因为显式的排序规则派生覆盖了隐式派生。此外,给定

the comparison is performed using fr_FR rules, because the explicit collation derivation overrides the implicit one. Furthermore, given

SELECT a < b FROM test1;

解析器无法确定应用哪个排序规则,因为 ab 列的隐式排序规则冲突。由于 < 运算符确实需要知道要使用哪个排序规则,因此这将导致错误。可以通过将显式排序规则说明符附加到任一输入表达式来解决此错误,如下所示:

the parser cannot determine which collation to apply, since the a and b columns have conflicting implicit collations. Since the < operator does need to know which collation to use, this will result in an error. The error can be resolved by attaching an explicit collation specifier to either input expression, thus:

SELECT a < b COLLATE "de_DE" FROM test1;

或等效地

or equivalently

SELECT a COLLATE "de_DE" < b FROM test1;

另一方面,在结构上类似的情况下

On the other hand, the structurally similar case

SELECT a || b FROM test1;

不会导致错误,因为 || 运算符不关心排序规则:无论排序规则如何,其结果都是相同的。

does not result in an error, because the || operator does not care about collations: its result is the same regardless of the collation.

分配给函数或运算符的组合输入表达式的排序规则也被认为适用于函数或运算符的结果(如果函数或运算符提供可排序数据类型的结果)。所以在

The collation assigned to a function or operator’s combined input expressions is also considered to apply to the function or operator’s result, if the function or operator delivers a result of a collatable data type. So, in

SELECT * FROM test1 ORDER BY a || 'foo';

将根据 de_DE 规则执行排序。但这个查询:

the ordering will be done according to de_DE rules. But this query:

SELECT * FROM test1 ORDER BY a || b;

会导致错误,因为即使 || 运算符不需要知道排序规则,但 ORDER BY 子句需要。与以前一样,冲突可以用显式排序规则说明符解决:

results in an error, because even though the || operator doesn’t need to know a collation, the ORDER BY clause does. As before, the conflict can be resolved with an explicit collation specifier:

SELECT * FROM test1 ORDER BY a || b COLLATE "fr_FR";

24.2.2. Managing Collations #

排序规则是将 SQL 名称映射到操作系统中安装的库提供的区域设置的 SQL 架构对象。排序规则定义有一个 provider,用于指定哪个库提供区域设置数据。一个标准提供程序名称是 libc,它使用操作系统 C 库提供的区域设置。这些区域设置是操作系统提供的大多数工具使用的区域设置。另一个提供程序是 icu,它使用外部 ICU 库。只有在构建 PostgreSQL 时配置了对 ICU 的支持时,才能使用 ICU 区域设置。

A collation is an SQL schema object that maps an SQL name to locales provided by libraries installed in the operating system. A collation definition has a provider that specifies which library supplies the locale data. One standard provider name is libc, which uses the locales provided by the operating system C library. These are the locales used by most tools provided by the operating system. Another provider is icu, which uses the external ICU library. ICU locales can only be used if support for ICU was configured when PostgreSQL was built.

libc 提供的并列对象映射到 LC_COLLATELC_CTYPE 设置的组合,为 setlocale() 系统库函数调用所接受。(顾名思义,并列的主要目的是设置 LC_COLLATE,它控制排序顺序。但是在实践中,很少需要设置与 LC_COLLATE 不同的 LC_CTYPE ,因此将这些合并到一个概念中比为每个表达式创建另一个设置 LC_CTYPE 的基础架构更为方便。)此外,libc 并列与字符集编码相关(参见 Section 24.3)。相同并列名称可能存在于不同的编码中。

A collation object provided by libc maps to a combination of LC_COLLATE and LC_CTYPE settings, as accepted by the setlocale() system library call. (As the name would suggest, the main purpose of a collation is to set LC_COLLATE, which controls the sort order. But it is rarely necessary in practice to have an LC_CTYPE setting that is different from LC_COLLATE, so it is more convenient to collect these under one concept than to create another infrastructure for setting LC_CTYPE per expression.) Also, a libc collation is tied to a character set encoding (see Section 24.3). The same collation name may exist for different encodings.

icu 提供的对比对象会映射到 ICU 库提供的已命名对比器。ICU 不支持单独的“对比”和“字符类型”设置,因此它们始终相同。此外,ICU 对比与编码无关,所以数据库中给定名称的 ICU 对比始终只有一个。

A collation object provided by icu maps to a named collator provided by the ICU library. ICU does not support separate “collate” and “ctype” settings, so they are always the same. Also, ICU collations are independent of the encoding, so there is always only one ICU collation of a given name in a database.

24.2.2.1. Standard Collations #

在所有平台上,已命名的 defaultCPOSIX 对比均可用。根据操作系统支持也可使用其他对比。default 对比会在数据库创建时选择 LC_COLLATELC_CTYPE 指定的值。CPOSIX 对比都指定了“传统 C”行为,其中只有 ASCII 字母“A”到“Z”被视为字母,排序严格按照字符代码字节值进行。

On all platforms, the collations named default, C, and POSIX are available. Additional collations may be available depending on operating system support. The default collation selects the LC_COLLATE and LC_CTYPE values specified at database creation time. The C and POSIX collations both specify “traditional C” behavior, in which only the ASCII letters “A” through “Z” are treated as letters, and sorting is done strictly by character code byte values.

Note

CPOSIX 区域设置的行为可能会有所不同,具体取决于数据库编码。

The C and POSIX locales may behave differently depending on the database encoding.

此外,还提供了两个 SQL 标准对比名称:

Additionally, two SQL standard collation names are available:

  • unicode

    • This collation sorts using the Unicode Collation Algorithm with the Default Unicode Collation Element Table. It is available in all encodings. ICU support is required to use this collation. (This collation has the same behavior as the ICU root locale; see und-x-icu (for “undefined”).)

  • ucs_basic

    • This collation sorts by Unicode code point. It is only available for encoding UTF8. (This collation has the same behavior as the libc locale specification C in UTF8 encoding.)

24.2.2.2. Predefined Collations #

如果操作系统提供在单个程序(newlocale 和相关函数)内使用多个区域设置的支持,或者如果 ICU 支持已配置,那么在初始化数据库集群时,initdb 将使用操作系统中找到的所有区域设置填充系统目录 pg_collation 中的对比。

If the operating system provides support for using multiple locales within a single program (newlocale and related functions), or if support for ICU is configured, then when a database cluster is initialized, initdb populates the system catalog pg_collation with collations based on all the locales it finds in the operating system at the time.

要检查当前可用的区域设置,请使用查询 SELECT * FROM pg_collation 或 psql 中的命令 \dOS+

To inspect the currently available locales, use the query SELECT * FROM pg_collation, or the command \dOS+ in psql.

例如,操作系统可能提供了名为 de_DE.utf8 的区域设置。initdb 然后将创建针对编码为 UTF8 的对比,名为 de_DE.utf8,且 LC_COLLATELC_CTYPE 均设置为 de_DE.utf8。它还会从名称中去掉 .utf8 标签创建对比。因此,您也可以使用名称 de_DE 中的对比,它书写起来不那么繁琐,并且使名称对编码的依赖性降低。请注意,尽管如此,对比名称的初始集合仍取决于平台。

For example, the operating system might provide a locale named de_DE.utf8. initdb would then create a collation named de_DE.utf8 for encoding UTF8 that has both LC_COLLATE and LC_CTYPE set to de_DE.utf8. It will also create a collation with the .utf8 tag stripped off the name. So you could also use the collation under the name de_DE, which is less cumbersome to write and makes the name less encoding-dependent. Note that, nevertheless, the initial set of collation names is platform-dependent.

libc 提供的默认排序规则集直接映射到操作系统中安装的语言环境,可以使用 locale -a 命令列出这些语言环境。如果需要 libc 排序规则(其为 LC_COLLATELC_CTYPE 提供了不同的值),或者在数据库系统初始化后操作系统中安装了新的语言环境,则可以使用 CREATE COLLATION 命令创建新的排序规则。还可以使用 pg_import_system_collations() 函数一次性导入新的操作系统语言环境。

The default set of collations provided by libc map directly to the locales installed in the operating system, which can be listed using the command locale -a. In case a libc collation is needed that has different values for LC_COLLATE and LC_CTYPE, or if new locales are installed in the operating system after the database system was initialized, then a new collation may be created using the CREATE COLLATION command. New operating system locales can also be imported en masse using the pg_import_system_collations() function.

在任何特定的数据库内,只有使用该数据库编码的对比是相关的。pg_collation 中的其他条目都会被忽略。因此,剥离的对比名称(如 de_DE)可在给定的一个数据库中被视为唯一的,即使它在全局范围内不是唯一的。建议使用剥离的对比名称,因为如果您决定更改成另一种数据库编码,它将使您需要更改的内容减少一个。但是请注意,无论数据库编码如何,都可以使用 defaultCPOSIX 对比。

Within any particular database, only collations that use that database’s encoding are of interest. Other entries in pg_collation are ignored. Thus, a stripped collation name such as de_DE can be considered unique within a given database even though it would not be unique globally. Use of the stripped collation names is recommended, since it will make one fewer thing you need to change if you decide to change to another database encoding. Note however that the default, C, and POSIX collations can be used regardless of the database encoding.

PostgreSQL 认为不同的对比对象不兼容,即使它们具有相同的属性。因此,例如,

PostgreSQL considers distinct collation objects to be incompatible even when they have identical properties. Thus for example,

SELECT a COLLATE "C" < b COLLATE "POSIX" FROM test1;

将引发错误,即使 CPOSIX 对比具有相同行为。因此,不建议混合使用剥离的对比名称和未剥离的对比名称。

will draw an error even though the C and POSIX collations have identical behaviors. Mixing stripped and non-stripped collation names is therefore not recommended.

使用 ICU,枚举所有可能的区域设置名称是不明智的。ICU 对区域设置使用特定的命名系统,但为区域设置命名的方式比实际不同的区域设置要多得多。initdb 使用 ICU API 提取一组不同的区域设置,以填充对比的初始集合。ICU 提供的对比在 SQL 环境中创建,名称采用 BCP 47 语言标签格式,并附加“专用”扩展 -x-icu,以将它们与 libc 区域设置区分开来。

With ICU, it is not sensible to enumerate all possible locale names. ICU uses a particular naming system for locales, but there are many more ways to name a locale than there are actually distinct locales. initdb uses the ICU APIs to extract a set of distinct locales to populate the initial set of collations. Collations provided by ICU are created in the SQL environment with names in BCP 47 language tag format, with a “private use” extension -x-icu appended, to distinguish them from libc locales.

以下是一些可能创建的对照示例:

Here are some example collations that might be created:

  • de-x-icu #

    • German collation, default variant

  • de-AT-x-icu #

    • German collation for Austria, default variant

    • (There are also, say, de-DE-x-icu or de-CH-x-icu, but as of this writing, they are equivalent to de-x-icu.)

  • und-x-icu (for “undefined”) #

    • ICU “root” collation. Use this to get a reasonable language-agnostic sort order.

某些(不常使用)的编码不受 ICU 支持。当数据库编码是其中之一时,pg_collation 中的 ICU 排序条目将被忽略。尝试使用该条目会导致类似于 "排序 de-x-icu 对于编码 WIN874 不存在" 的错误。

Some (less frequently used) encodings are not supported by ICU. When the database encoding is one of these, ICU collation entries in pg_collation are ignored. Attempting to use one will draw an error along the lines of “collation "de-x-icu" for encoding "WIN874" does not exist”.

24.2.2.3. Creating New Collation Objects #

如果标准和预定义排序规则不够用,则用户可以使用 SQL 命令 CREATE COLLATION 创建自己的排序规则对象。

If the standard and predefined collations are not sufficient, users can create their own collation objects using the SQL command CREATE COLLATION.

标准和预定义的排序位于 pg_catalog 架构中,如同所有预定义对象那样。用户定义的排序应在用户架构中创建。这还可确保这些排序由 pg_dump 保存。

The standard and predefined collations are in the schema pg_catalog, like all predefined objects. User-defined collations should be created in user schemas. This also ensures that they are saved by pg_dump.

可以按照如下方式创建新的 libc 排序:

New libc collations can be created like this:

CREATE COLLATION german (provider = libc, locale = 'de_DE');

此命令中 locale 子句可接受的确切值取决于操作系统。在类似 Unix 的系统中,locale -a 会显示一个列表。

The exact values that are acceptable for the locale clause in this command depend on the operating system. On Unix-like systems, the command locale -a will show a list.

由于预定义的 libc 排序规则已经在数据库实例初始化时包含了操作系统中定义的所有排序规则,因此通常不需要手动创建新的排序规则。原因可能是希望使用不同的命名系统(在这种情况下,请参见 Section 24.2.2.3.3 ),或者操作系统已升级为提供新的语言环境定义(在这种情况下,请参见 pg_import_system_collations() )。

Since the predefined libc collations already include all collations defined in the operating system when the database instance is initialized, it is not often necessary to manually create new ones. Reasons might be if a different naming system is desired (in which case see also Section 24.2.2.3.3) or if the operating system has been upgraded to provide new locale definitions (in which case see also pg_import_system_collations()).

可以按照如下方式创建 ICU 排序:

ICU collations can be created like:

CREATE COLLATION german (provider = icu, locale = 'de-DE');

ICU 语言区域设置指定为 BCP 47 Language Tag,但也可以接受大多数 libc 风格语言区域设置名称。尽可能,lib-c 风格语言区域设置名称将转换为语言标记。

ICU locales are specified as a BCP 47 Language Tag, but can also accept most libc-style locale names. If possible, libc-style locale names are transformed into language tags.

通过在语言标记中包含并列属性,新的 ICU 并列可以广泛自定义并列行为。有关详情和示例,请参见 Section 24.2.3

New ICU collations can customize collation behavior extensively by including collation attributes in the language tag. See Section 24.2.3 for details and examples.

命令 CREATE COLLATION 还可以用于从现有排序规则创建一个新的排序规则,这对于在应用程序中使用与操作系统无关的排序规则名称、创建兼容性名称或在更易读的名称下使用 ICU 提供的排序规则非常有用。例如:

The command CREATE COLLATION can also be used to create a new collation from an existing collation, which can be useful to be able to use operating-system-independent collation names in applications, create compatibility names, or use an ICU-provided collation under a more readable name. For example:

CREATE COLLATION german FROM "de_DE";
CREATE COLLATION french FROM "fr-x-icu";

24.2.2.4. Nondeterministic Collations #

并列要么是 deterministic 要么是 nondeterministic 。确定性的并列使用确定的比较,这意味着只有当比较的字符串由相同的字节序列组成时,它才将字符串视为相同。非确定性比较即使由不同的字节组成,也可能将字符串确定为相同。典型的情况包括不区分大小写的比较、不区分重音的比较以及比较不同 Unicode 标准形式的字符串。由并列提供程序实际实现此类不敏感的比较;这个确定性的标记仅确定是否要使用按字节比较来打破平局。有关此术语的更多信息,请参阅 Unicode Technical Standard 10

A collation is either deterministic or nondeterministic. A deterministic collation uses deterministic comparisons, which means that it considers strings to be equal only if they consist of the same byte sequence. Nondeterministic comparison may determine strings to be equal even if they consist of different bytes. Typical situations include case-insensitive comparison, accent-insensitive comparison, as well as comparison of strings in different Unicode normal forms. It is up to the collation provider to actually implement such insensitive comparisons; the deterministic flag only determines whether ties are to be broken using bytewise comparison. See also Unicode Technical Standard 10 for more information on the terminology.

要创建非确定性排序,请将属性 deterministic = false 指定为 CREATE COLLATION,例如:

To create a nondeterministic collation, specify the property deterministic = false to CREATE COLLATION, for example:

CREATE COLLATION ndcoll (provider = icu, locale = 'und', deterministic = false);

此示例将以非确定性方式使用标准 Unicode 排序。尤其是,这将允许正确比较不同规范形式的字符串。更有趣的示例利用上述 ICU 自定义功能。例如:

This example would use the standard Unicode collation in a nondeterministic way. In particular, this would allow strings in different normal forms to be compared correctly. More interesting examples make use of the ICU customization facilities explained above. For example:

CREATE COLLATION case_insensitive (provider = icu, locale = 'und-u-ks-level2', deterministic = false);
CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-true', deterministic = false);

所有标准和预定义的排序都是确定性的,所有用户定义的排序默认也是确定性的。虽然非确定性的排序更加“正确”,尤其是在考虑 Unicode 的全部能力及其许多特殊情况时,它们也有一些缺点。最重要的一点是,使用它们会影响性能。尤其是要注意,B 树无法在使用非确定性排序的索引中使用重复数据消除。此外,某些操作无法与非确定性排序一起使用,例如模式匹配操作。因此,仅应在特别需要时使用它们。

All standard and predefined collations are deterministic, all user-defined collations are deterministic by default. While nondeterministic collations give a more “correct” behavior, especially when considering the full power of Unicode and its many special cases, they also have some drawbacks. Foremost, their use leads to a performance penalty. Note, in particular, that B-tree cannot use deduplication with indexes that use a nondeterministic collation. Also, certain operations are not possible with nondeterministic collations, such as pattern matching operations. Therefore, they should be used only in cases where they are specifically wanted.

Tip

为了处理不同 Unicode 规范形式的文本,也可以使用函数/表达式 normalizeis normalized 来预处理或检查字符串,而不是使用非确定性排序。每种方法都有不同的权衡取舍。

To deal with text in different Unicode normalization forms, it is also an option to use the functions/expressions normalize and is normalized to preprocess or check the strings, instead of using nondeterministic collations. There are different trade-offs for each approach.

24.2.3. ICU Custom Collations #

ICU 允许通过在语言标记中将排序设置作为新排序的一部分来广泛地控制排序行为。这些设置可以修改排序顺序以满足各种需要。例如:

ICU allows extensive control over collation behavior by defining new collations with collation settings as a part of the language tag. These settings can modify the collation order to suit a variety of needs. For instance:

-- ignore differences in accents and case
CREATE COLLATION ignore_accent_case (provider = icu, deterministic = false, locale = 'und-u-ks-level1');
SELECT 'Å' = 'A' COLLATE ignore_accent_case; -- true
SELECT 'z' = 'Z' COLLATE ignore_accent_case; -- true

-- upper case letters sort before lower case.
CREATE COLLATION upper_first (provider = icu, locale = 'und-u-kf-upper');
SELECT 'B' < 'b' COLLATE upper_first; -- true

-- treat digits numerically and ignore punctuation
CREATE COLLATION num_ignore_punct (provider = icu, deterministic = false, locale = 'und-u-ka-shifted-kn');
SELECT 'id-45' < 'id-123' COLLATE num_ignore_punct; -- true
SELECT 'w;x*y-z' = 'wxyz' COLLATE num_ignore_punct; -- true

Section 24.2.3.2中介绍了许多可用选项,或查看 Section 24.2.3.5 了解更多详情。

Many of the available options are described in Section 24.2.3.2, or see Section 24.2.3.5 for more details.

24.2.3.1. ICU Comparison Levels #

ICU 中两个字符串(并列)的比较由多级进程决定,其中文本特征分为“级别”。每个级别的处理由 collation settings 控制。较高级别对应于更精细的文本特征。

Comparison of two strings (collation) in ICU is determined by a multi-level process, where textual features are grouped into "levels". Treatment of each level is controlled by the collation settings. Higher levels correspond to finer textual features.

Table 24.1 展示了在给定级别确定相等性时哪些文本特征差异被视为显著的。Unicode 字符 U+2063 是一个不可见的分割符,如表中所示,在低于 identic 的所有比较级别中都会忽略它。

Table 24.1 shows which textual feature differences are considered significant when determining equality at the given level. The Unicode character U+2063 is an invisible separator, and as seen in the table, is ignored for at all levels of comparison less than identic.

Table 24.1. ICU Collation Levels

Level

Description

'f' = 'f'

'ab' = U&'a\2063b'

'x-y' = 'x_y'

'g' = 'G'

'n' = 'ñ'

'y' = 'z'

level1

Base Character

true

true

true

true

true

false

level2

Accents

true

true

true

true

false

false

level3

Case/Variants

true

true

true

false

false

false

level4

Punctuation

true

true

false

false

false

false

identic

All

true

false

false

false

false

false

在各等级中,即使在完全规范化关闭的情况下,也会执行基本规范化。例如,'á' 可能由代码点 U&'\0061\0301' 或单个代码点 U&'\00E1' 组成,并且即使在 identic 等级中,这些序列也会被视为相等。要将代码点表示中的任何差异视为不同,请使用 deterministic 设置为 true 创建的排序规则。

At every level, even with full normalization off, basic normalization is performed. For example, 'á' may be composed of the code points U&'\0061\0301' or the single code point U&'\00E1', and those sequences will be considered equal even at the identic level. To treat any difference in code point representation as distinct, use a collation created with deterministic set to true.

CREATE COLLATION level3 (provider = icu, deterministic = false, locale = 'und-u-ka-shifted-ks-level3');
CREATE COLLATION level4 (provider = icu, deterministic = false, locale = 'und-u-ka-shifted-ks-level4');
CREATE COLLATION identic (provider = icu, deterministic = false, locale = 'und-u-ka-shifted-ks-identic');

-- invisible separator ignored at all levels except identic
SELECT 'ab' = U&'a\2063b' COLLATE level4; -- true
SELECT 'ab' = U&'a\2063b' COLLATE identic; -- false

-- punctuation ignored at level3 but not at level 4
SELECT 'x-y' = 'x_y' COLLATE level3; -- true
SELECT 'x-y' = 'x_y' COLLATE level4; -- false

24.2.3.2. Collation Settings for an ICU Locale #

Table 24.2 展示了可用的并列设置,这些设置可以用作语言标记的一部分来自定义并列。

Table 24.2 shows the available collation settings, which can be used as part of a language tag to customize a collation.

Table 24.2. ICU Collation Settings

Key

Values

Default

Description

co

emoji, phonebk, standard, …​

standard

Collation type. See Section 24.2.3.5 for additional options and details.

ka

noignore, shifted

noignore

If set to shifted, causes some characters (e.g. punctuation or space) to be ignored in comparison. Key ks must be set to level3 or lower to take effect. Set key kv to control which character classes are ignored.

kb

true, false

false

Backwards comparison for the level 2 differences. For example, locale und-u-kb sorts 'àe' before 'aé'.

kc

true, false

false

Separates case into a "level 2.5" that falls between accents and other level 3 features.

If set to true and ks is set to level1, will ignore accents but take case into account.

kf

upper, lower, false

false

If set to upper, upper case sorts before lower case. If set to lower, lower case sorts before upper case. If set to false, the sort depends on the rules of the locale.

kn

true, false

false

If set to true, numbers within a string are treated as a single numeric value rather than a sequence of digits. For example, 'id-45' sorts before 'id-123'.

kk

true, false

false

Enable full normalization; may affect performance. Basic normalization is performed even when set to false. Locales for languages that require full normalization typically enable it by default.

Full normalization is important in some cases, such as when multiple accents are applied to a single character. For example, the code point sequences U&'\0065\0323\0302' and U&'\0065\0302\0323' represent an e with circumflex and dot-below accents applied in different orders. With full normalization on, these code point sequences are treated as equal; otherwise they are unequal.

kr

space, punct, symbol, currency, digit, script-id

Set to one or more of the valid values, or any BCP 47 script-id, e.g. latn ("Latin") or grek ("Greek"). Multiple values are separated by "-".

Redefines the ordering of classes of characters; those characters belonging to a class earlier in the list sort before characters belonging to a class later in the list. For instance, the value digit-currency-space (as part of a language tag like und-u-kr-digit-currency-space) sorts punctuation before digits and spaces.

ks

level1, level2, level3, level4, identic

level3

Sensitivity (or "strength") when determining equality, with level1 the least sensitive to differences and identic the most sensitive to differences. See Table 24.1 for details.

kv

space, punct, symbol, currency

punct

Classes of characters ignored during comparison at level 3. Setting to a later value includes earlier values; e.g. symbol also includes punct and space in the characters to be ignored. Key ka must be set to shifted and key ks must be set to level3 or lower to take effect.

默认设置可能取决于语言区域设置。上表并非旨在做到全面。有关其他选项和详细信息,请参阅 Section 24.2.3.5

Defaults may depend on locale. The above table is not meant to be complete. See Section 24.2.3.5 for additional options and details.

Note

对于许多校对设置,您必须使用_false_设置_deterministic_来创建校对设置,才能达到理想效果(见 Section 24.2.2.4)。此外,只有当密钥_ka_设置为_shifted_时,某些设置才能生效(见 Table 24.2)。

For many collation settings, you must create the collation with deterministic set to false for the setting to have the desired effect (see Section 24.2.2.4). Additionally, some settings only take effect when the key ka is set to shifted (see Table 24.2).

24.2.3.3. Collation Settings Examples #

  • CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de-u-co-phonebk'); #

    • German collation with phone book collation type

  • CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = 'und-u-co-emoji'); #

    • Root collation with Emoji collation type, per Unicode Technical Standard #51

  • CREATE COLLATION latinlast (provider = icu, locale = 'en-u-kr-grek-latn'); #

    • Sort Greek letters before Latin ones. (The default is Latin before Greek.)

  • CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper'); #

    • Sort upper-case letters before lower-case letters. (The default is lower-case letters first.)

  • CREATE COLLATION special (provider = icu, locale = 'en-u-kf-upper-kr-grek-latn'); #

    • Combines both of the above options.

24.2.3.4. ICU Tailoring Rules #

如果上面显示的并列设置所提供的选项不足以使用,则可以使用调整规则来更改并列元素的顺序,其语法详细介绍在 https://unicode-org.github.io/icu/userguide/collation/customization/ 中。

If the options provided by the collation settings shown above are not sufficient, the order of collation elements can be changed with tailoring rules, whose syntax is detailed at https://unicode-org.github.io/icu/userguide/collation/customization/.

这个小示例创建了一个基于根区域设置的排序规则,并附加了定制规则:

This small example creates a collation based on the root locale with a tailoring rule:

CREATE COLLATION custom (provider = icu, locale = 'und', rules = '&V << w <<< W');

使用此规则,字母“W”按顺序排在字母“V”之后,但被视为次要差异,类似于重音。此类规则包含在某些语言的区域设置定义中。(当然,如果区域设置定义已经包含所需的规则,则不必再明确指定这些规则。)

With this rule, the letter “W” is sorted after “V”, but is treated as a secondary difference similar to an accent. Rules like this are contained in the locale definitions of some languages. (Of course, if a locale definition already contains the desired rules, then they don’t need to be specified again explicitly.)

下面是一个更为复杂的示例。以下语句设置了一个名为 ebcdic 的整理规则,其规则是按照 EBCDIC 编码对 US-ASCII 字符进行排序。

Here is a more complex example. The following statement sets up a collation named ebcdic with rules to sort US-ASCII characters in the order of the EBCDIC encoding.

CREATE COLLATION ebcdic (provider = icu, locale = 'und',
rules = $$
& ' ' < '.' < '<' < '(' < '+' < \|
< '&' < '!' < '$' < '*' < ')' < ';'
< '-' < '/' < ',' < '%' < '_' < '>' < '?'
< '`' < ':' < '#' < '@' < \' < '=' < '"'
<*a-r < '~' <*s-z < '^' < '[' < ']'
< '{' <*A-I < '}' <*J-R < '\' <*S-Z <*0-9
$$);

SELECT c
FROM (VALUES ('a'), ('b'), ('A'), ('B'), ('1'), ('2'), ('!'), ('^')) AS x(c)
ORDER BY c COLLATE ebcdic;
 c
---
 !
 a
 b
 ^
 A
 B
 1
 2

24.2.3.5. External References for ICU #

本部分 ( Section 24.2.3) 仅简要概述了 ICU 行为和语言标记。有关技术细节、其他选项和新行为,请参阅以下文档:

This section (Section 24.2.3) is only a brief overview of ICU behavior and language tags. Refer to the following documents for technical details, additional options, and new behavior: