Postgresql 中文操作指南

38.13. User-Defined Types #

Section 38.2中所述,可以扩展 PostgreSQL 以支持新的数据类型。本节介绍如何定义新基本类型,这些数据类型是在 SQL 语言级别下定义的数据类型。创建新基本类型需要实现函数来使用低级语言(通常为 C)操作该类型。

As described in Section 38.2, PostgreSQL can be extended to support new data types. This section describes how to define new base types, which are data types defined below the level of the SQL language. Creating a new base type requires implementing functions to operate on the type in a low-level language, usually C.

本节中的示例可以在源发行版的 src/tutorial 目录中的 complex.sqlcomplex.c 中找到。有关运行示例的说明,请参阅该目录中的 README 文件。

The examples in this section can be found in complex.sql and complex.c in the src/tutorial directory of the source distribution. See the README file in that directory for instructions about running the examples.

用户定义类型必须始终具有输入和输出函数。这些函数确定类型在字符串中的显示方式(以便用户输入和向用户输出)以及类型如何在内存中组织。输入函数采用以 null 结尾的字符串作为参数,并返回类型的内部(内存中)表示形式。输出函数采用类型的内部表示形式作为参数,并返回以 null 结尾的字符串。如果我们想对类型执行不仅仅是存储该类型,我们必须提供其他函数来实现我们希望对此类型执行的任何操作。

A user-defined type must always have input and output functions. These functions determine how the type appears in strings (for input by the user and output to the user) and how the type is organized in memory. The input function takes a null-terminated character string as its argument and returns the internal (in memory) representation of the type. The output function takes the internal representation of the type as argument and returns a null-terminated character string. If we want to do anything more with the type than merely store it, we must provide additional functions to implement whatever operations we’d like to have for the type.

假设我们要定义表示复数的 complex 类型。在内存中表示复数的自然方式是以下 C 结构:

Suppose we want to define a type complex that represents complex numbers. A natural way to represent a complex number in memory would be the following C structure:

typedef struct Complex {
    double      x;
    double      y;
} Complex;

由于此结构太大,无法放入单个 Datum 值,因此我们需要将其变成按引用传递的类型。

We will need to make this a pass-by-reference type, since it’s too large to fit into a single Datum value.

作为该类型的外部字符串表示形式,我们选择 (x,y) 形式的字符串。

As the external string representation of the type, we choose a string of the form (x,y).

输入和输出函数通常不难写,尤其是输出函数。但是,在定义类型的外部字符串表示形式时,请记住您最终必须为该表示形式编写一个完整且健壮的解析器作为您的输入函数。例如:

The input and output functions are usually not hard to write, especially the output function. But when defining the external string representation of the type, remember that you must eventually write a complete and robust parser for that representation as your input function. For instance:

PG_FUNCTION_INFO_V1(complex_in);

Datum
complex_in(PG_FUNCTION_ARGS)
{
    char       *str = PG_GETARG_CSTRING(0);
    double      x,
                y;
    Complex    *result;

    if (sscanf(str, " ( %lf , %lf )", &x, &y) != 2)
        ereport(ERROR,
                (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
                 errmsg("invalid input syntax for type %s: \"%s\"",
                        "complex", str)));

    result = (Complex *) palloc(sizeof(Complex));
    result->x = x;
    result->y = y;
    PG_RETURN_POINTER(result);
}

输出函数可以简单为:

The output function can simply be:

PG_FUNCTION_INFO_V1(complex_out);

Datum
complex_out(PG_FUNCTION_ARGS)
{
    Complex    *complex = (Complex *) PG_GETARG_POINTER(0);
    char       *result;

    result = psprintf("(%g,%g)", complex->x, complex->y);
    PG_RETURN_CSTRING(result);
}

您应注意将输入和输出函数制作成对方的逆函数。如果不这样做,当您需要将数据转储到文件然后将其读回时,您将遇到严重的问题。包含浮点数时,这是一个特别常见的问题。

You should be careful to make the input and output functions inverses of each other. If you do not, you will have severe problems when you need to dump your data into a file and then read it back in. This is a particularly common problem when floating-point numbers are involved.

用户定义类型可以选择提供二进制输入和输出例程。二进制 I/O 通常比文本 I/O 更快,但可移植性较差。与文本 I/O 一样,您需要准确定义外部二进制表示形式。大多数内置数据类型都尝试提供与机器无关的二进制表示形式。对于 complex,我们将依赖于 float8 类型的二进制 I/O 转换器:

Optionally, a user-defined type can provide binary input and output routines. Binary I/O is normally faster but less portable than textual I/O. As with textual I/O, it is up to you to define exactly what the external binary representation is. Most of the built-in data types try to provide a machine-independent binary representation. For complex, we will piggy-back on the binary I/O converters for type float8:

PG_FUNCTION_INFO_V1(complex_recv);

Datum
complex_recv(PG_FUNCTION_ARGS)
{
    StringInfo  buf = (StringInfo) PG_GETARG_POINTER(0);
    Complex    *result;

    result = (Complex *) palloc(sizeof(Complex));
    result->x = pq_getmsgfloat8(buf);
    result->y = pq_getmsgfloat8(buf);
    PG_RETURN_POINTER(result);
}

PG_FUNCTION_INFO_V1(complex_send);

Datum
complex_send(PG_FUNCTION_ARGS)
{
    Complex    *complex = (Complex *) PG_GETARG_POINTER(0);
    StringInfoData buf;

    pq_begintypsend(&buf);
    pq_sendfloat8(&buf, complex->x);
    pq_sendfloat8(&buf, complex->y);
    PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
}

一旦编写了 I/O 函数并将它们编译成共享库,我们可以在 SQL 中定义 complex 类型。首先,我们将其声明为外壳类型:

Once we have written the I/O functions and compiled them into a shared library, we can define the complex type in SQL. First we declare it as a shell type:

CREATE TYPE complex;

这种方法可以用作占位符,允许我们在定义其实例 I/O 函数时引用该类型。现在我们可以定义 I/O 函数:

This serves as a placeholder that allows us to reference the type while defining its I/O functions. Now we can define the I/O functions:

CREATE FUNCTION complex_in(cstring)
    RETURNS complex
    AS 'filename'
    LANGUAGE C IMMUTABLE STRICT;

CREATE FUNCTION complex_out(complex)
    RETURNS cstring
    AS 'filename'
    LANGUAGE C IMMUTABLE STRICT;

CREATE FUNCTION complex_recv(internal)
   RETURNS complex
   AS 'filename'
   LANGUAGE C IMMUTABLE STRICT;

CREATE FUNCTION complex_send(complex)
   RETURNS bytea
   AS 'filename'
   LANGUAGE C IMMUTABLE STRICT;

最后,我们可以提供数据类型的完整定义:

Finally, we can provide the full definition of the data type:

CREATE TYPE complex (
   internallength = 16,
   input = complex_in,
   output = complex_out,
   receive = complex_recv,
   send = complex_send,
   alignment = double
);

当你定义一个新的基本类型时,PostgreSQL 会自动为该类型的数组提供支持。数组类型通常与基本类型同名,并添加了前缀下划线字符 ( _ )。

When you define a new base type, PostgreSQL automatically provides support for arrays of that type. The array type typically has the same name as the base type with the underscore character (_) prepended.

数据类型存在后,我们可以声明额外的函数,以对数据类型提供有用的操作。然后可以在函数之上定义运算符,并且如果需要,可以创建运算符类以支持对数据类型的索引编制。以下各节中将讨论这些额外的层。

Once the data type exists, we can declare additional functions to provide useful operations on the data type. Operators can then be defined atop the functions, and if needed, operator classes can be created to support indexing of the data type. These additional layers are discussed in following sections.

如果数据类型的内部表示是可变长的,则内部表示必须遵循可变长数据的标准布局:前四个字节必须是从未直接访问的_char[4]字段(通常命名为_vl_len_)。必须使用_SET_VARSIZE()_宏将数据体的总大小(包括长度字段本身)存储在此字段中,并使用_VARSIZE()_宏检索它。(这些宏存在是因为长度字段可能根据平台进行编码。)

If the internal representation of the data type is variable-length, the internal representation must follow the standard layout for variable-length data: the first four bytes must be a char[4] field which is never accessed directly (customarily named vl_len_). You must use the SET_VARSIZE() macro to store the total size of the datum (including the length field itself) in this field and VARSIZE() to retrieve it. (These macros exist because the length field may be encoded depending on platform.)

有关更多详细信息,请参见 CREATE TYPE 命令的描述。

For further details see the description of the CREATE TYPE command.

38.13.1. TOAST Considerations #

如果数据类型的值(以内部形式)大小各异,通常希望将该数据类型制作成可 TOAST 的(见 Section 73.2)。即使值总是太小而无法压缩或外部存储,也应这样做,因为 TOAST 还可以通过减少头开销节省小数据的空间。

If the values of your data type vary in size (in internal form), it’s usually desirable to make the data type TOAST-able (see Section 73.2). You should do this even if the values are always too small to be compressed or stored externally, because TOAST can save space on small data too, by reducing header overhead.

为支持 TOAST 存储,对数据类型进行操作的 C 函数始终必须小心地使用 PG_DETOAST_DATUM 解包其接收到的所有 TOAST 值。(此详细信息通常通过定义特定于类型的 GETARG_DATATYPE_P 宏来隐藏。)然后,在运行 CREATE TYPE 命令时,将内部长度指定为 variable,并选择 plain 以外的某些合适的存储选项。

To support TOAST storage, the C functions operating on the data type must always be careful to unpack any toasted values they are handed by using PG_DETOAST_DATUM. (This detail is customarily hidden by defining type-specific GETARG_DATATYPE_P macros.) Then, when running the CREATE TYPE command, specify the internal length as variable and select some appropriate storage option other than plain.

如果数据对齐不重要(对于特定函数来说如此,或者因为数据类型无论如何都指定了字节对齐),那么可以避免 PG_DETOAST_DATUM 带来的一些开销。你可以改用 PG_DETOAST_DATUM_PACKED(通常通过定义 GETARG_DATATYPE_PP 宏来隐藏)并使用宏 VARSIZE_ANY_EXHDRVARDATA_ANY 访问潜在的已打包数据集。同样,即使数据类型定义指定了对齐方式,这些宏返回的数据也不会对齐。如果对齐方式很重要,则必须使用常规 PG_DETOAST_DATUM 接口。

If data alignment is unimportant (either just for a specific function or because the data type specifies byte alignment anyway) then it’s possible to avoid some of the overhead of PG_DETOAST_DATUM. You can use PG_DETOAST_DATUM_PACKED instead (customarily hidden by defining a GETARG_DATATYPE_PP macro) and using the macros VARSIZE_ANY_EXHDR and VARDATA_ANY to access a potentially-packed datum. Again, the data returned by these macros is not aligned even if the data type definition specifies an alignment. If the alignment is important you must go through the regular PG_DETOAST_DATUM interface.

Note

旧代码通常将 vl_len__声明为 _int32_字段,而不是 _char[4]。只要结构定义具有至少 _int32_对齐的其他字段,这就可以了。但是,在使用可能未对齐的数据时使用此类结构定义是危险的;编译器可能会认为数据实际上是对齐的,这会导致在对齐方面严格的架构上崩溃。

Older code frequently declares vl_len_ as an int32 field instead of char[4]. This is OK as long as the struct definition has other fields that have at least int32 alignment. But it is dangerous to use such a struct definition when working with a potentially unaligned datum; the compiler may take it as license to assume the datum actually is aligned, leading to core dumps on architectures that are strict about alignment.

TOAST 支持启用的另一个特性是,可能会存在比磁盘上存储的格式更便于使用的一个 expanded 内存中数据表示。常规或“平面”varlena 存储格式最终只是一堆字节;例如,它不能包含指针,因为它可能会被复制到内存中的其他位置。对于复杂的数据类型而言,平面格式可能开销很大,因此 PostgreSQL 提供了一种将平面格式“扩展”为更适合计算的表示形式,然后在数据类型的函数之间以内存方式传递该格式。

Another feature that’s enabled by TOAST support is the possibility of having an expanded in-memory data representation that is more convenient to work with than the format that is stored on disk. The regular or “flat” varlena storage format is ultimately just a blob of bytes; it cannot for example contain pointers, since it may get copied to other locations in memory. For complex data types, the flat format may be quite expensive to work with, so PostgreSQL provides a way to “expand” the flat format into a representation that is more suited to computation, and then pass that format in-memory between functions of the data type.

若要使用扩展存储,数据类型必须定义一个扩展格式,该格式遵循 src/include/utils/expandeddatum.h 中给出的规则,并提供函数以将平面 varlena 值“扩展”为扩展格式,并将扩展格式“扁平化”回常规 varlena 表示。然后确保所有针对该数据类型编写的 C 函数都可以接受任一表示,可以立即在接收时将其中一个转换为另一个来实现。这不要求立即修复数据类型的所有现有函数,因为标准 PG_DETOAST_DATUM 宏被定义为将扩展输入转换为常规平面格式。因此,使用平面 varlena 格式的现有函数将继续使用扩展输入,尽管效率稍低;无需转换它们,直到(除非)更高的性能变得重要。

To use expanded storage, a data type must define an expanded format that follows the rules given in src/include/utils/expandeddatum.h, and provide functions to “expand” a flat varlena value into expanded format and “flatten” the expanded format back to the regular varlena representation. Then ensure that all C functions for the data type can accept either representation, possibly by converting one into the other immediately upon receipt. This does not require fixing all existing functions for the data type at once, because the standard PG_DETOAST_DATUM macro is defined to convert expanded inputs into regular flat format. Therefore, existing functions that work with the flat varlena format will continue to work, though slightly inefficiently, with expanded inputs; they need not be converted until and unless better performance is important.

通常,能够使用扩展表示的 C 函数可分为两类:只能处理扩展格式的函数以及能够处理扩展或平面 varlena 输入的函数。前者更容易编写,但总体效率可能较低,因为将平面输入转换为扩展格式以供单个函数使用可能会比针对扩展格式执行操作省的开销更大。当只需要处理扩展格式时,可以将平面输入转换为扩展格式的操作隐藏在参数获取宏中,以便函数看起来比使用传统 varlena 输入的函数复杂不了多少。为了处理两种类型的输入,编写一个参数获取函数,该函数将解除外部、短头和压缩 varlena 输入的 TOAST,但不解除扩展输入的 TOAST。可以将这样的函数定义为返回指向平面 varlena 格式和扩展格式的联合体的指针。调用者可以使用 VARATT_IS_EXPANDED_HEADER() 宏来确定他们收到了哪种格式。

C functions that know how to work with an expanded representation typically fall into two categories: those that can only handle expanded format, and those that can handle either expanded or flat varlena inputs. The former are easier to write but may be less efficient overall, because converting a flat input to expanded form for use by a single function may cost more than is saved by operating on the expanded format. When only expanded format need be handled, conversion of flat inputs to expanded form can be hidden inside an argument-fetching macro, so that the function appears no more complex than one working with traditional varlena input. To handle both types of input, write an argument-fetching function that will detoast external, short-header, and compressed varlena inputs, but not expanded inputs. Such a function can be defined as returning a pointer to a union of the flat varlena format and the expanded format. Callers can use the VARATT_IS_EXPANDED_HEADER() macro to determine which format they received.

TOAST 基础架构不仅允许将常规 varlena 值与扩展值区分开来,还可以区分指向扩展值的“读写”和“只读”指针。只需要检查扩展值或只会以安全且语义不可见的方式更改该值的 C 函数无需关心它们收到哪种类型的指针。生成输入值的修改版本并收到读写指针时,允许修改该输入值的扩展值,但在收到只读指针时不得修改该输入值;在那种情况下,它们必须先复制该值,生成一个新的要修改的值。构造了一个新扩展值的 C 函数始终应为其返回一个读写指针。此外,修改读写扩展值的 C 函数在中途失败时应注意使该值保持在正确的状态。

The TOAST infrastructure not only allows regular varlena values to be distinguished from expanded values, but also distinguishes “read-write” and “read-only” pointers to expanded values. C functions that only need to examine an expanded value, or will only change it in safe and non-semantically-visible ways, need not care which type of pointer they receive. C functions that produce a modified version of an input value are allowed to modify an expanded input value in-place if they receive a read-write pointer, but must not modify the input if they receive a read-only pointer; in that case they have to copy the value first, producing a new value to modify. A C function that has constructed a new expanded value should always return a read-write pointer to it. Also, a C function that is modifying a read-write expanded value in-place should take care to leave the value in a sane state if it fails partway through.

有关处理扩展值的示例,请参阅标准数组基础架构,特别是 src/backend/utils/adt/array_expanded.c

For examples of working with expanded values, see the standard array infrastructure, particularly src/backend/utils/adt/array_expanded.c.