Postgresql 中文操作指南
73.2. TOAST #
本部分概述了TOAST(超大属性存储技术)。
This section provides an overview of TOAST (The Oversized-Attribute Storage Technique).
PostgreSQL使用固定的页面大小(通常为8 kB),并且不允许元组跨越多个页面。因此,无法直接存储非常大的字段值。为了克服此限制,将大字段值压缩和/或分解为多个物理行。对于用户来说,这是透明的,对大部分后端代码的影响很小。该技术亲切地称为TOAST(或“自切片面包以来最好的东西”)。TOAST基础架构还用于改进对内存中大数据值的处理。
PostgreSQL uses a fixed page size (commonly 8 kB), and does not allow tuples to span multiple pages. Therefore, it is not possible to store very large field values directly. To overcome this limitation, large field values are compressed and/or broken up into multiple physical rows. This happens transparently to the user, with only small impact on most of the backend code. The technique is affectionately known as TOAST (or “the best thing since sliced bread”). The TOAST infrastructure is also used to improve handling of large data values in-memory.
只有某些数据类型支持 TOAST——没有必要对无法生成大字段值的数据类型施加开销。为了支持 TOAST,数据类型必须具有可变长度( varlena )表示,通常情况下,任何存储值的第一个四字节单词包含该值的总长度(字节为单位),包括它本身。TOAST 不限制数据类型表示的其余部分。统称为 TOASTed values 的特殊表示通过修改或重新解释这个初始长度单词来工作。因此,支持可 TOAST 数据类型的 C 级函数必须小心它们处理潜在 TOAST 输入值的方式:在 detoasted 之前,输入可能实际上并不由一个四字节长度单词和内容组成。(这通常是通过在对输入值执行任何操作之前调用 PG_DETOAST_DATUM 来完成的,但在某些情况下,更有效的方法是可能的。有关更多详细信息,请参见 Section 38.13.1 。)
Only certain data types support TOAST — there is no need to impose the overhead on data types that cannot produce large field values. To support TOAST, a data type must have a variable-length (varlena) representation, in which, ordinarily, the first four-byte word of any stored value contains the total length of the value in bytes (including itself). TOAST does not constrain the rest of the data type’s representation. The special representations collectively called TOASTed values work by modifying or reinterpreting this initial length word. Therefore, the C-level functions supporting a TOAST-able data type must be careful about how they handle potentially TOASTed input values: an input might not actually consist of a four-byte length word and contents until after it’s been detoasted. (This is normally done by invoking PG_DETOAST_DATUM before doing anything with an input value, but in some cases more efficient approaches are possible. See Section 38.13.1 for more detail.)
TOAST 占用了varlena长度词(大端机器上的高位,小端机器上的低位)的两比特,从而将可TOAST数据类型的任何值的逻辑大小限制为1 GB(230-1字节)。当两个比特都为零时,该值是数据类型的普通非TOAST化值,并且长度词的其余比特以字节为单位给出了数据的总大小(包括长度词)。当最高位或最低位被设置时,该值仅具有单字节头,而不是常规四字节头,并且该字节的其余比特以字节为单位给出了数据的总大小(包括长度字节)。该备用方法支持空间高效地存储小于127字节的值,同时仍允许数据类型根据需要增长到1 GB。具有单字节头的值与任何特定边界没有对齐,而具有四字节头的值与至少四个字节边界对齐;省略对齐填充以提供与短值相比相当显著的额外空间节省。作为特殊情况,如果单字节头的其余比特都为零(对于自包含长度而言这是不可能的),则该值是脱机数据的指针,有多种可能的备用方案,如下所述。此类_TOAST pointer_的类型和大小由存储在数据第二字节中的代码确定。最后,当最高位或最低位清除,但相邻的比特被设置时,数据的原始内容已压缩,并且在使用之前必须对其进行解压缩。在这种情况下,四字节长度词的其余比特给出了压缩数据的总大小,而不是原始数据。注意,脱机数据也可以进行压缩,但varlena头不表示是否已经发生—相反,TOAST指针的内容会告诉这一点。
TOAST usurps two bits of the varlena length word (the high-order bits on big-endian machines, the low-order bits on little-endian machines), thereby limiting the logical size of any value of a TOAST-able data type to 1 GB (230 - 1 bytes). When both bits are zero, the value is an ordinary un-TOASTed value of the data type, and the remaining bits of the length word give the total datum size (including length word) in bytes. When the highest-order or lowest-order bit is set, the value has only a single-byte header instead of the normal four-byte header, and the remaining bits of that byte give the total datum size (including length byte) in bytes. This alternative supports space-efficient storage of values shorter than 127 bytes, while still allowing the data type to grow to 1 GB at need. Values with single-byte headers aren’t aligned on any particular boundary, whereas values with four-byte headers are aligned on at least a four-byte boundary; this omission of alignment padding provides additional space savings that is significant compared to short values. As a special case, if the remaining bits of a single-byte header are all zero (which would be impossible for a self-inclusive length), the value is a pointer to out-of-line data, with several possible alternatives as described below. The type and size of such a TOAST pointer are determined by a code stored in the second byte of the datum. Lastly, when the highest-order or lowest-order bit is clear but the adjacent bit is set, the content of the datum has been compressed and must be decompressed before use. In this case the remaining bits of the four-byte length word give the total size of the compressed datum, not the original data. Note that compression is also possible for out-of-line data but the varlena header does not tell whether it has occurred — the content of the TOAST pointer tells that, instead.
对于每一列,可以通过在_CREATE TABLE_或_ALTER TABLE_设置_COMPRESSION_列选项来选择用于行内或行外压缩数据的压缩技术。对于没有明确设置的列,其默认设置是在插入数据时咨询 default_toast_compression参数。
The compression technique used for either in-line or out-of-line compressed data can be selected for each column by setting the COMPRESSION column option in CREATE TABLE or ALTER TABLE. The default for columns with no explicit setting is to consult the default_toast_compression parameter at the time data is inserted.
如前所述,有多种类型的 TOAST 指针数据。最古老和最常见类型是存储在 TOAST table 中的离线数据,它与包含 TOAST 指针数据本身的表是分开的,但与之相关联。当要存储在磁盘上的元组太大而无法按原样进行存储时,TOAST 管理代码(在 access/common/toast_internals.c 中)会创建这些 on-disk 指针数据。更多详细信息显示在 Section 73.2.1 中。或者,TOAST 指针数据可以包含对存储在内存中其他位置的离线数据的指针。这些数据本质上是短时生存的,而且永远不会出现在磁盘上,但它们对于避免复制和冗余处理大数据值非常有用。更多详细信息显示在 Section 73.2.2 中。
As mentioned, there are multiple types of TOAST pointer datums. The oldest and most common type is a pointer to out-of-line data stored in a TOAST table that is separate from, but associated with, the table containing the TOAST pointer datum itself. These on-disk pointer datums are created by the TOAST management code (in access/common/toast_internals.c) when a tuple to be stored on disk is too large to be stored as-is. Further details appear in Section 73.2.1. Alternatively, a TOAST pointer datum can contain a pointer to out-of-line data that appears elsewhere in memory. Such datums are necessarily short-lived, and will never appear on-disk, but they are very useful for avoiding copying and redundant processing of large data values. Further details appear in Section 73.2.2.
73.2.1. Out-of-Line, On-Disk TOAST Storage #
如果表的任何列都可以TOAST化,则该表将具有关联的TOAST表,其OID存储在表的_pg_class__reltoastrelid_项中。脱机TOAST化值保存在TOAST表中,如下面更详细地所述。
If any of the columns of a table are TOAST-able, the table will have an associated TOAST table, whose OID is stored in the table’s pg_class.reltoastrelid entry. On-disk TOASTed values are kept in the TOAST table, as described in more detail below.
脱机值在(如果使用的话在压缩之后)被分成最多_TOAST_MAX_CHUNK_SIZE_字节的块(默认情况下选择此值,以便四个块行可以放入一个页面,使其大约为2000字节)。每个块都作为属于拥有者表中的TOAST表中的单独行进行存储。每个TOAST表都具有列_chunk_id_(用于识别特定TOAST化值OID)、chunk_seq(用于块值内的序列号)和_chunk_data_(块的实际数据)。chunk_id_和_chunk_seq_上的唯一索引提供了对值的快速检索。因此,表示脱机磁盘TOAST化值的指针数据需要存储要查找的TOAST表的OID和特定值(其_chunk_id)的OID。为了方便,指针数据还存储逻辑数据大小(原始未压缩数据长度)、物理存储大小(如果应用压缩则不同)以及使用的压缩方法(如果有)。考虑到varlena头字节,脱机TOAST指针数据的总大小因此为18字节,与所表示值的实际大小无关。
Out-of-line values are divided (after compression if used) into chunks of at most TOAST_MAX_CHUNK_SIZE bytes (by default this value is chosen so that four chunk rows will fit on a page, making it about 2000 bytes). Each chunk is stored as a separate row in the TOAST table belonging to the owning table. Every TOAST table has the columns chunk_id (an OID identifying the particular TOASTed value), chunk_seq (a sequence number for the chunk within its value), and chunk_data (the actual data of the chunk). A unique index on chunk_id and chunk_seq provides fast retrieval of the values. A pointer datum representing an out-of-line on-disk TOASTed value therefore needs to store the OID of the TOAST table in which to look and the OID of the specific value (its chunk_id). For convenience, pointer datums also store the logical datum size (original uncompressed data length), physical stored size (different if compression was applied), and the compression method used, if any. Allowing for the varlena header bytes, the total size of an on-disk TOAST pointer datum is therefore 18 bytes regardless of the actual size of the represented value.
只有当要存储在表中的行值宽度超过_TOAST_TUPLE_THRESHOLD_字节(通常为2 kB)时,才会触发TOAST管理代码。TOAST代码将压缩和/或移动字段值,使其脱机,直到行值小于_TOAST_TUPLE_TARGET_字节(通常也为2 kB,可调整)或无法再获得更多收益。在UPDATE操作期间,未更改字段的值通常保持原样;因此,如果没有任何脱机值更改,则对具有脱机值的行的UPDATE不会产生TOAST成本。
The TOAST management code is triggered only when a row value to be stored in a table is wider than TOAST_TUPLE_THRESHOLD bytes (normally 2 kB). The TOAST code will compress and/or move field values out-of-line until the row value is shorter than TOAST_TUPLE_TARGET bytes (also normally 2 kB, adjustable) or no more gains can be had. During an UPDATE operation, values of unchanged fields are normally preserved as-is; so an UPDATE of a row with out-of-line values incurs no TOAST costs if none of the out-of-line values change.
TOAST管理代码识别了四种将可TOAST化的列存储在磁盘上的不同策略:
The TOAST management code recognizes four different strategies for storing TOAST-able columns on disk:
每个可 TOAST 数据类型都为该数据类型列指定一个默认策略,但给定表列的策略可以使用 ALTER TABLE … SET STORAGE 更改。
Each TOAST-able data type specifies a default strategy for columns of that data type, but the strategy for a given table column can be altered with ALTER TABLE … SET STORAGE.
可以 ALTER TABLE … SET (toast_tuple_target = N) 的 TOAST_TUPLE_TARGET 为每个表进行调整
TOAST_TUPLE_TARGET can be adjusted for each table using ALTER TABLE … SET (toast_tuple_target = N)
与更直接的方法相比,此方案有许多优势,例如允许行值跨页面。假设通常通过与相对较小的键值进行比较来限定查询,执行者的大部分工作将使用主行条目来完成。只有在结果集发送给客户端时才提取 TOAST 化属性的大值(如果选中了该值)。因此,主表更小,与没有离线存储的情况相比,其更多行适合共享缓冲区缓存。排序组也会缩小,并且更多排序将完全在内存中进行。一个小测试表明,包含典型 HTML 页面及其 URL 的表存储在原始数据大小的一半左右,其中包括 TOAST 表,并且主表仅包含大约 10% 的整个数据(URL 和一些小型 HTML 页面)。与未 TOAST 化的比较表相比,运行时间没有差异,其中所有 HTML 页面都被缩减到 7 kB 以适应。
This scheme has a number of advantages compared to a more straightforward approach such as allowing row values to span pages. Assuming that queries are usually qualified by comparisons against relatively small key values, most of the work of the executor will be done using the main row entry. The big values of TOASTed attributes will only be pulled out (if selected at all) at the time the result set is sent to the client. Thus, the main table is much smaller and more of its rows fit in the shared buffer cache than would be the case without any out-of-line storage. Sort sets shrink also, and sorts will more often be done entirely in memory. A little test showed that a table containing typical HTML pages and their URLs was stored in about half of the raw data size including the TOAST table, and that the main table contained only about 10% of the entire data (the URLs and some small HTML pages). There was no run time difference compared to an un-TOASTed comparison table, in which all the HTML pages were cut down to 7 kB to fit.
73.2.2. Out-of-Line, In-Memory TOAST Storage #
TOAST 指针可以指向不在磁盘上而是当前服务器进程内存中的数据。显然此类指针不能长期存在,但它们仍然很有用。目前有两个子情况:指向 indirect 数据的指针和指向 expanded 数据的指针。
TOAST pointers can point to data that is not on disk, but is elsewhere in the memory of the current server process. Such pointers obviously cannot be long-lived, but they are nonetheless useful. There are currently two sub-cases: pointers to indirect data and pointers to expanded data.
间接 TOAST 指针只指向存储在内存中某个位置的非间接 varlena 值。此情况最初只是作为概念证明而创建的,但它目前用于逻辑解码中,以避免可能需要创建超过 1 GB 的物理元组(如将所有离线字段值提取到元组中将要执行的操作)。此情况的用途有限,因为指针数据的创建者完全负责在指针可能存在的时间内让引用的数据存活,并且没有基础设施来帮助完成此操作。
Indirect TOAST pointers simply point at a non-indirect varlena value stored somewhere in memory. This case was originally created merely as a proof of concept, but it is currently used during logical decoding to avoid possibly having to create physical tuples exceeding 1 GB (as pulling all out-of-line field values into the tuple might do). The case is of limited use since the creator of the pointer datum is entirely responsible that the referenced data survives for as long as the pointer could exist, and there is no infrastructure to help with this.
扩展 TOAST 指针适用于其磁盘表示不特别适合计算目的的复杂数据类型。作为一个示例,PostgreSQL 数组的标准 varlena 表示包括维数信息、如果有任何空元素则还有空位图,然后按顺序给出了所有元素的值。当元素类型本身是可变长度时,查找第 N 个元素的唯一方法是扫描所有前面元素。此表示与磁盘存储匹配,因为它非常紧凑,但对于数组计算,最好有一个“扩展”或“解构”表示,其中已标识了所有元素的起始位置。TOAST 指针机制通过允许传递引用 Datum 指向标准 varlena 值(磁盘表示)或指向内存中某个位置的扩展表示的 TOAST 指针来满足此需求。此扩展表示的详细信息取决于数据类型,但它必须具有标准标题并满足 src/include/utils/expandeddatum.h 中给出的其他 API 要求。对数据类型进行操作的 C 级函数可以选择处理任一表示。不知道扩展表示但只将 PG_DETOAST_DATUM 应用于其输入的函数将自动接收到传统的 varlena 表示;因此,对扩展表示的支持可以一次一个函数地逐步引入。
Expanded TOAST pointers are useful for complex data types whose on-disk representation is not especially suited for computational purposes. As an example, the standard varlena representation of a PostgreSQL array includes dimensionality information, a nulls bitmap if there are any null elements, then the values of all the elements in order. When the element type itself is variable-length, the only way to find the N'th element is to scan through all the preceding elements. This representation is appropriate for on-disk storage because of its compactness, but for computations with the array it’s much nicer to have an “expanded” or “deconstructed” representation in which all the element starting locations have been identified. The TOAST pointer mechanism supports this need by allowing a pass-by-reference Datum to point to either a standard varlena value (the on-disk representation) or a TOAST pointer that points to an expanded representation somewhere in memory. The details of this expanded representation are up to the data type, though it must have a standard header and meet the other API requirements given in src/include/utils/expandeddatum.h. C-level functions working with the data type can choose to handle either representation. Functions that do not know about the expanded representation, but simply apply PG_DETOAST_DATUM to their inputs, will automatically receive the traditional varlena representation; so support for an expanded representation can be introduced incrementally, one function at a time.
指向扩展值的 TOAST 指针进一步分解为 read-write 和 read-only 指针。指向的表示在两种情况下都是相同的,但接收读写指针的函数可以在原位修改引用的值,而接收只读指针的函数不能这样做;如果它想制作该值的修改版本,则它必须首先创建副本。这种区别和一些相关的约定可以在查询执行期间避免对扩展值进行不必要的复制。
TOAST pointers to expanded values are further broken down into read-write and read-only pointers. The pointed-to representation is the same either way, but a function that receives a read-write pointer is allowed to modify the referenced value in-place, whereas one that receives a read-only pointer must not; it must first create a copy if it wants to make a modified version of the value. This distinction and some associated conventions make it possible to avoid unnecessary copying of expanded values during query execution.
对于所有类型的内存中 TOAST 指针,TOAST 管理代码确保没有此类指针数据会意外存储在磁盘上。在存储之前,内存中 TOAST 指针会自动扩展为标准行内 varlena 值,如果包含元组会变得太大,则可能会将其转换为磁盘上 TOAST 指针。
For all types of in-memory TOAST pointer, the TOAST management code ensures that no such pointer datum can accidentally get stored on disk. In-memory TOAST pointers are automatically expanded to normal in-line varlena values before storage — and then possibly converted to on-disk TOAST pointers, if the containing tuple would otherwise be too big.