Postgresql 中文操作指南
73.6. Database Page Layout #
本节概述了 PostgreSQL 表和索引中使用的页面格式。 [17 ] 序列和 TOAST 表的格式与普通表相同。
This section provides an overview of the page format used within PostgreSQL tables and indexes.[17] Sequences and TOAST tables are formatted just like a regular table.
在以下说明中,假定一个 byte 包含 8 位。此外,术语 item 指存储在页面上的单独数据值。在表中,项目是一行;在索引中,项目是一个索引条目。
In the following explanation, a byte is assumed to contain 8 bits. In addition, the term item refers to an individual data value that is stored on a page. In a table, an item is a row; in an index, an item is an index entry.
每个表和索引都存储为一个 pages 数组,该数组具有一个固定的尺寸(通常为 8 kB,尽管在编译服务器时可以选择不同的页面尺寸)。在表中,所有页面在逻辑上都是等效的,因此可以将特定项目(行)存储在任何页面中。在索引中,第一个页面通常保留为 metapage,以保存控制信息,并且索引中可以有不同类型的页面,这取决于索引访问方法。
Every table and index is stored as an array of pages of a fixed size (usually 8 kB, although a different page size can be selected when compiling the server). In a table, all the pages are logically equivalent, so a particular item (row) can be stored in any page. In indexes, the first page is generally reserved as a metapage holding control information, and there can be different types of pages within the index, depending on the index access method.
Table 73.2显示了页面的整体布局。每个页面有五个部分。
Table 73.2 shows the overall layout of a page. There are five parts to each page.
Table 73.2. Overall Page Layout
Item |
Description |
PageHeaderData |
24 bytes long. Contains general information about the page, including free space pointers. |
ItemIdData |
Array of item identifiers pointing to the actual items. Each entry is an (offset,length) pair. 4 bytes per item. |
Free space |
The unallocated space. New item identifiers are allocated from the start of this area, new items from the end. |
Items |
The actual items themselves. |
Special space |
Index access method specific data. Different methods store different data. Empty in ordinary tables. |
每个页面的前 24 个字节包含页面头(PageHeaderData)。其格式在 Table 73.3中进行了详细说明。第一个字段跟踪与该页面相关的最新 WAL 条目。第二个字段在启用 data checksums时包含页面校验和。接下来是一个包含标志位的 2 字节字段。后面跟着三个 2 字节整数字段(pd_lower、pd_upper_和 _pd_special)。这些字段包含从页面开始到未分配空间开始、到未分配空间结束以及到特殊空间开始的字节偏移量。页面头接下来的 2 个字节,pd_pagesize_version,存储页面大小和版本指示符。从 PostgreSQL 8.3 开始,版本号为 4;PostgreSQL 8.1 和 8.2 使用版本号 3;PostgreSQL 8.0 使用版本号 2;PostgreSQL 7.3 和 7.4 使用版本号 1;早期版本使用版本号 0。(在这些版本中,基本页面布局和头格式没有发生改变,但是堆行头的布局发生了改变。)页面大小基本上只是作为一个交叉检查存在;在一个安装中,不支持拥有多个页面大小。最后一个字段是一个提示,表示整理页面是否有利可图:它跟踪页面上最旧的未整理 XMAX。
The first 24 bytes of each page consists of a page header (PageHeaderData). Its format is detailed in Table 73.3. The first field tracks the most recent WAL entry related to this page. The second field contains the page checksum if data checksums are enabled. Next is a 2-byte field containing flag bits. This is followed by three 2-byte integer fields (pd_lower, pd_upper, and pd_special). These contain byte offsets from the page start to the start of unallocated space, to the end of unallocated space, and to the start of the special space. The next 2 bytes of the page header, pd_pagesize_version, store both the page size and a version indicator. Beginning with PostgreSQL 8.3 the version number is 4; PostgreSQL 8.1 and 8.2 used version number 3; PostgreSQL 8.0 used version number 2; PostgreSQL 7.3 and 7.4 used version number 1; prior releases used version number 0. (The basic page layout and header format has not changed in most of these versions, but the layout of heap row headers has.) The page size is basically only present as a cross-check; there is no support for having more than one page size in an installation. The last field is a hint that shows whether pruning the page is likely to be profitable: it tracks the oldest un-pruned XMAX on the page.
Table 73.3. PageHeaderData Layout
Field |
Type |
Length |
Description |
pd_lsn |
PageXLogRecPtr |
8 bytes |
LSN: next byte after last byte of WAL record for last change to this page |
pd_checksum |
uint16 |
2 bytes |
Page checksum |
pd_flags |
uint16 |
2 bytes |
Flag bits |
pd_lower |
LocationIndex |
2 bytes |
Offset to start of free space |
pd_upper |
LocationIndex |
2 bytes |
Offset to end of free space |
pd_special |
LocationIndex |
2 bytes |
Offset to start of special space |
pd_pagesize_version |
uint16 |
2 bytes |
Page size and layout version number information |
pd_prune_xid |
TransactionId |
4 bytes |
Oldest unpruned XMAX on page, or zero if none |
所有详细信息都可以在 src/include/storage/bufpage.h 中找到。
All the details can be found in src/include/storage/bufpage.h.
在页面头后是项目标识符 (ItemIdData),每个标识符需要四个字节。项目标识符包含指向项目开始的字节偏移、其字节长度,以及影响其解释的几个属性位。根据需要从未分配空间的开始分配新的项目标识符。可以通过查看 pd_lower 来确定存在项目标识符的数量,该数量增加以分配一个新的标识符。在释放项目标识符之前永远不会对其移动,因此它的索引可以在长期基础上用于引用项目,即使项目本身在页面上移动以压缩空闲空间也是如此。实际上,PostgreSQL 创建的指向项目 (ItemPointer,也称为 CTID) 的每个指针都包括一个页码和一个项目标识符的索引。
Following the page header are item identifiers (ItemIdData), each requiring four bytes. An item identifier contains a byte-offset to the start of an item, its length in bytes, and a few attribute bits which affect its interpretation. New item identifiers are allocated as needed from the beginning of the unallocated space. The number of item identifiers present can be determined by looking at pd_lower, which is increased to allocate a new identifier. Because an item identifier is never moved until it is freed, its index can be used on a long-term basis to reference an item, even when the item itself is moved around on the page to compact free space. In fact, every pointer to an item (ItemPointer, also known as CTID) created by PostgreSQL consists of a page number and the index of an item identifier.
项目本身存储在从未分配空间的末端向后分配的空间中。确切的结构会根据表要包含的内容而异。表和序列都使用下面描述的结构 HeapTupleHeaderData。
The items themselves are stored in space allocated backwards from the end of unallocated space. The exact structure varies depending on what the table is to contain. Tables and sequences both use a structure named HeapTupleHeaderData, described below.
最后一节是“特殊节”,其中可以包含访问方法希望存储的任何内容。例如,b 树索引会存储到页面的左兄弟和右兄弟的链接,以及与索引结构相关的一些其他数据。常规表根本不使用特殊节(通过设置 pd_special 等于页面大小来指示)。
The final section is the “special section” which can contain anything the access method wishes to store. For example, b-tree indexes store links to the page’s left and right siblings, as well as some other data relevant to the index structure. Ordinary tables do not use a special section at all (indicated by setting pd_special to equal the page size).
Figure 73.1说明了这些部分如何在页面中布置。
Figure 73.1 illustrates how these parts are laid out in a page.
Figure 73.1. Page Layout
73.6.1. Table Row Layout #
所有的表行以相同的方式进行结构化。有一个固定大小的表头(在大多数机器上占用 23 个字节),后边是可选的空位图、可选的对象 ID 字段以及用户数据。详细介绍了表头 Table 73.4 中的内容。实际的用户数据(行的列)从 t_hoff 指示的偏移处开始,该偏移处必须始终是平台的 MAXALIGN 距离的倍数。仅当在 t_infomask 中设置 HEAP_HASNULL 位时,才会出现空位图。如果存在空位图,它会紧跟固定表头之后并占用足以有每列数据对应一个位(即,等于 t_infomask2 中属性数目的位数)的字节。在此位列表中:1 位表示非空,0 位表示空。当不存在位图时,假定所有列是非空的。仅当在 t_infomask 中设置 HEAP_HASOID_OLD 位时,才会出现对象 ID。如果存在对象 ID,它会紧挨 t_hoff 边界之前出现。为使 t_hoff 成为 MAXALIGN 倍数所需的任何填充内容将显示在空位图和对象 ID 之间。(反过来,这将确保对象 ID 的对齐方式合适。)
All table rows are structured in the same way. There is a fixed-size header (occupying 23 bytes on most machines), followed by an optional null bitmap, an optional object ID field, and the user data. The header is detailed in Table 73.4. The actual user data (columns of the row) begins at the offset indicated by t_hoff, which must always be a multiple of the MAXALIGN distance for the platform. The null bitmap is only present if the HEAP_HASNULL bit is set in t_infomask. If it is present it begins just after the fixed header and occupies enough bytes to have one bit per data column (that is, the number of bits that equals the attribute count in t_infomask2). In this list of bits, a 1 bit indicates not-null, a 0 bit is a null. When the bitmap is not present, all columns are assumed not-null. The object ID is only present if the HEAP_HASOID_OLD bit is set in t_infomask. If present, it appears just before the t_hoff boundary. Any padding needed to make t_hoff a MAXALIGN multiple will appear between the null bitmap and the object ID. (This in turn ensures that the object ID is suitably aligned.)
Table 73.4. HeapTupleHeaderData Layout
Field |
Type |
Length |
Description |
t_xmin |
TransactionId |
4 bytes |
insert XID stamp |
t_xmax |
TransactionId |
4 bytes |
delete XID stamp |
t_cid |
CommandId |
4 bytes |
insert and/or delete CID stamp (overlays with t_xvac) |
t_xvac |
TransactionId |
4 bytes |
XID for VACUUM operation moving a row version |
t_ctid |
ItemPointerData |
6 bytes |
current TID of this or newer row version |
t_infomask2 |
uint16 |
2 bytes |
number of attributes, plus various flag bits |
t_infomask |
uint16 |
2 bytes |
various flag bits |
t_hoff |
uint8 |
1 byte |
offset to user data |
所有详细内容都可以在 src/include/access/htup_details.h 中找到。
All the details can be found in src/include/access/htup_details.h.
解释实际数据只能使用从其他表中获得的信息进行,主要为 pg_attribute。识别字段位置所需要的关键值是 attlen 和 attalign。除了只有固定宽度字段且没有任何空值的情况下以外,没有办法直接获得特定属性。所有这些技巧都封装在函数 heap_getattr、fastgetattr 和 heap_getsysattr 中。
Interpreting the actual data can only be done with information obtained from other tables, mostly pg_attribute. The key values needed to identify field locations are attlen and attalign. There is no way to directly get a particular attribute, except when there are only fixed width fields and no null values. All this trickery is wrapped up in the functions heap_getattr, fastgetattr and heap_getsysattr.
为了读取数据,你需要依次检查每个属性。首先,根据空值位图检查字段是否为空。如果它是空的,则转到下一个字段。然后,确保您具有正确的对齐方式。如果该字段是固定宽度字段,那么所有字节都将简单地放置。如果该字段是可变长度字段(attlen = -1),那么它将更加复杂。所有可变长度数据类型共享通用的标头结构 struct varlena,其中包括存储值的总长度和一些标志位。根据标志,数据可以是内嵌的,也可以是在 TOAST 表中;它也可以被压缩(参见 Section 73.2)。
To read the data you need to examine each attribute in turn. First check whether the field is NULL according to the null bitmap. If it is, go to the next. Then make sure you have the right alignment. If the field is a fixed width field, then all the bytes are simply placed. If it’s a variable length field (attlen = -1) then it’s a bit more complicated. All variable-length data types share the common header structure struct varlena, which includes the total length of the stored value and some flag bits. Depending on the flags, the data can be either inline or in a TOAST table; it might be compressed, too (see Section 73.2).
[17 ] 实际上,表或索引访问方法并不需要使用这种页面格式。 heap 表访问方法始终使用此格式。所有现有的索引方法也使用基本格式,但索引元页面上保存的数据通常不遵循条目布局规则。
[17] Actually, use of this page format is not required for either table or index access methods. The heap table access method always uses this format. All the existing index methods also use the basic format, but the data kept on index metapages usually doesn’t follow the item layout rules.