Postgresql 中文操作指南
73.1. Database File Layout #
本部分从文件和目录的级别描述存储格式。
This section describes the storage format at the level of files and directories.
传统上,数据库群集中使用的配置和数据文件,都存储在群集的数据目录中,该目录通常称为 PGDATA(根据可用来定义该目录的环境变量的名称)。PGDATA 的常见位置是 /var/lib/pgsql/data。同一台机器上可以存在多个群集,由不同的服务器实例管理。
Traditionally, the configuration and data files used by a database cluster are stored together within the cluster’s data directory, commonly referred to as PGDATA (after the name of the environment variable that can be used to define it). A common location for PGDATA is /var/lib/pgsql/data. Multiple clusters, managed by different server instances, can exist on the same machine.
PGDATA_目录包含几个子目录和控制文件,如 Table 73.1中所示。除了这些必需项之外,群集配置文件 _postgresql.conf、_pg_hba.conf_和 _pg_ident.conf_传统上存储在 _PGDATA_中,尽管也有可能将它们放在其他地方。
The PGDATA directory contains several subdirectories and control files, as shown in Table 73.1. In addition to these required items, the cluster configuration files postgresql.conf, pg_hba.conf, and pg_ident.conf are traditionally stored in PGDATA, although it is possible to place them elsewhere.
Table 73.1. Contents of PGDATA
Table 73.1. Contents of PGDATA
Item |
Description |
PG_VERSION |
A file containing the major version number of PostgreSQL |
base |
Subdirectory containing per-database subdirectories |
current_logfiles |
File recording the log file(s) currently written to by the logging collector |
global |
Subdirectory containing cluster-wide tables, such as pg_database |
pg_commit_ts |
Subdirectory containing transaction commit timestamp data |
pg_dynshmem |
Subdirectory containing files used by the dynamic shared memory subsystem |
pg_logical |
Subdirectory containing status data for logical decoding |
pg_multixact |
Subdirectory containing multitransaction status data (used for shared row locks) |
pg_notify |
Subdirectory containing LISTEN/NOTIFY status data |
pg_replslot |
Subdirectory containing replication slot data |
pg_serial |
Subdirectory containing information about committed serializable transactions |
pg_snapshots |
Subdirectory containing exported snapshots |
pg_stat |
Subdirectory containing permanent files for the statistics subsystem |
pg_stat_tmp |
Subdirectory containing temporary files for the statistics subsystem |
pg_subtrans |
Subdirectory containing subtransaction status data |
pg_tblspc |
Subdirectory containing symbolic links to tablespaces |
pg_twophase |
Subdirectory containing state files for prepared transactions |
pg_wal |
Subdirectory containing WAL (Write Ahead Log) files |
pg_xact |
Subdirectory containing transaction commit status data |
postgresql.auto.conf |
A file used for storing configuration parameters that are set by ALTER SYSTEM |
postmaster.opts |
A file recording the command-line options the server was last started with |
postmaster.pid |
A lock file recording the current postmaster process ID (PID), cluster data directory path, postmaster start timestamp, port number, Unix-domain socket directory path (could be empty), first valid listen_address (IP address or *, or empty if not listening on TCP), and shared memory segment ID (this file is not present after server shutdown) |
对于群集中每个数据库,PGDATA_/base_ 内都有一个子目录,其名称以 pg_database 中的数据库的 OID 命名。此子目录是数据库文件的默认位置;具体来说是其系统目录存储在那里。
For each database in the cluster there is a subdirectory within PGDATA_/base_, named after the database’s OID in pg_database. This subdirectory is the default location for the database’s files; in particular, its system catalogs are stored there.
请注意,以下各节介绍了内置 heap table access method和内置 index access methods的行为。由于 PostgreSQL 的可扩展特性,其他访问方法可能有所不同。
Note that the following sections describe the behavior of the builtin heap table access method, and the builtin index access methods. Due to the extensible nature of PostgreSQL, other access methods might work differently.
每个表和索引都存储在一个单独的文件中。对于普通关系,这些文件以表或索引的 filenode 编号命名,可以在 pg_class . relfilenode 中找到该编号。但对于临时关系,文件名采用 t_BBBFFF, where BBB 的形式,其中 t_BBBFFF, where BBB 是创建该文件的后端的后台 ID,而 FFF 是文件节点号。在任何一种情况下,除了主文件(也称为主分支)之外,每个表和索引都有一个 free space map (参见 Section 73.3 ),其中存储了有关关系中可用的可用空间的信息。可用空间映射存储在一个以文件节点号加上后缀 fsm 命名的文件中。表还有一个 visibility map ,存储在具有后缀 vm 的分支中,以跟踪哪些页面已知没有无效元组。可见性映射在 Section 73.4 中作了进一步描述。未记录的表和索引有一个称为初始化分支的第三个分支,该分支存储在一个以后缀 _init 为分叉中(参见 Section 73.5 )。
Each table and index is stored in a separate file. For ordinary relations, these files are named after the table or index’s filenode number, which can be found in pg_class.relfilenode. But for temporary relations, the file name is of the form t_BBB_FFF, where _BBB is the backend ID of the backend which created the file, and FFF is the filenode number. In either case, in addition to the main file (a/k/a main fork), each table and index has a free space map (see Section 73.3), which stores information about free space available in the relation. The free space map is stored in a file named with the filenode number plus the suffix fsm. Tables also have a visibility map, stored in a fork with the suffix vm, to track which pages are known to have no dead tuples. The visibility map is described further in Section 73.4. Unlogged tables and indexes have a third fork, known as the initialization fork, which is stored in a fork with the suffix _init (see Section 73.5).
Caution
请注意,虽然表的 filenode 通常与它的 OID 匹配,但这 not 不一定是这种情况;某些操作(如 TRUNCATE、REINDEX、CLUSTER 和某些形式的 ALTER TABLE)可以改变 filenode 同时保留 OID。不要想当然地认为 filenode 和表 OID 相同。同时,对于某些系统目录(包括 pg_class 本身),pg_class.relfilenode 包含零。这些目录的实际 filenode 号存储在较低级别的数据结构中,并且可以使用 pg_relation_filenode() 函数获取。
Note that while a table’s filenode often matches its OID, this is not necessarily the case; some operations, like TRUNCATE, REINDEX, CLUSTER and some forms of ALTER TABLE, can change the filenode while preserving the OID. Avoid assuming that filenode and table OID are the same. Also, for certain system catalogs including pg_class itself, pg_class.relfilenode contains zero. The actual filenode number of these catalogs is stored in a lower-level data structure, and can be obtained using the pg_relation_filenode() function.
当表或索引超过 1 GB 时,它将被分成 1GB 大小的 segments。第一个区段的文件名与 filenode 相同;后续区段分别命名为 filenode.1、filenode.2 等。这种安排避免了在具有文件大小限制的平台上出现问题。(实际上,1 GB 只是默认区段大小。在构建 PostgreSQL 时,可以使用配置选项 —with-segsize 调整区段大小。)原则上,空闲空间映射和可见性映射 fork 也可能需要多个区段,尽管在实践中不可能出现这种情况。
When a table or index exceeds 1 GB, it is divided into gigabyte-sized segments. The first segment’s file name is the same as the filenode; subsequent segments are named filenode.1, filenode.2, etc. This arrangement avoids problems on platforms that have file size limitations. (Actually, 1 GB is just the default segment size. The segment size can be adjusted using the configuration option —with-segsize when building PostgreSQL.) In principle, free space map and visibility map forks could require multiple segments as well, though this is unlikely to happen in practice.
具有可能会包含较大条目的列的表将具有一个关联的 TOAST 表,该表用于将太大而无法放在表行中的字段值存储在外联机中。pg_class.reltoastrelid 从表(如果存在)链接至其 TOAST 表。有关详细信息,请参阅 Section 73.2。
A table that has columns with potentially large entries will have an associated TOAST table, which is used for out-of-line storage of field values that are too large to keep in the table rows proper. pg_class.reltoastrelid links from a table to its TOAST table, if any. See Section 73.2 for more information.
表和索引的内容在 Section 73.6 中作了进一步讨论。
The contents of tables and indexes are discussed further in Section 73.6.
表空间使场景变得更加复杂。每个用户定义的表空间在 PGDATA/pg_tblspc_ 目录内都有一个符号链接,它指向物理表空间目录(即在表空间的 CREATE TABLESPACE 命令中指定的位置)。此符号链接以表空间的 OID 命名。在物理表空间目录内,有一个子目录,其名称取决于 PostgreSQL 服务器版本,例如 PG_9.0_201008051。(使用此子目录的目的是,以便数据库的后续版本可以使用同样的 CREATE TABLESPACE 位置值而不会发生冲突。)在特定于版本的子目录内,对于表空间中有元素的每个数据库,都有一个以数据库的 OID 命名。表和索引存储在该目录内,使用 filenode 命名方案。pg_default 表空间不是通过 pg_tblspc 访问的,而是对应于 PGDATA/base_。类似地,pg_global 表空间不是通过 pg_tblspc 访问的,而是对应于 PGDATA_/global_。
Tablespaces make the scenario more complicated. Each user-defined tablespace has a symbolic link inside the PGDATA/pg_tblspc_ directory, which points to the physical tablespace directory (i.e., the location specified in the tablespace’s CREATE TABLESPACE command). This symbolic link is named after the tablespace’s OID. Inside the physical tablespace directory there is a subdirectory with a name that depends on the PostgreSQL server version, such as PG_9.0_201008051. (The reason for using this subdirectory is so that successive versions of the database can use the same CREATE TABLESPACE location value without conflicts.) Within the version-specific subdirectory, there is a subdirectory for each database that has elements in the tablespace, named after the database’s OID. Tables and indexes are stored within that directory, using the filenode naming scheme. The pg_default tablespace is not accessed through pg_tblspc, but corresponds to PGDATA/base_. Similarly, the pg_global tablespace is not accessed through pg_tblspc, but corresponds to PGDATA_/global_.
pg_relation_filepath() 函数显示任何关系的完整路径(相对于 PGDATA)。它通常可以用作记住上述许多规则的替代方法。但请记住,此函数仅提供关系主 fork 的第一区段的名称 — 你可能需要附加区段号和/或 fsm、vm 或 _init 以找到与关系相关的所有文件。
The pg_relation_filepath() function shows the entire path (relative to PGDATA) of any relation. It is often useful as a substitute for remembering many of the above rules. But keep in mind that this function just gives the name of the first segment of the main fork of the relation — you may need to append a segment number and/or fsm, vm, or _init to find all the files associated with the relation.
临时文件(用于诸如对超过内存容量的数据进行排序之类的操作)在 PGDATA /base/pgsql_tmp_ 中创建,或者如果为它们指定了 pg_default 以外的表空间,则在表空间目录的 pgsql_tmp 子目录中创建。临时文件的名称采用 pgsql_tmp_PPP .NNN 的形式,其中 _, where _PPP is the PID of the owning backend and NNN 区分该后端的不同临时文件。
Temporary files (for operations such as sorting more data than can fit in memory) are created within PGDATA/base/pgsql_tmp_, or within a pgsql_tmp subdirectory of a tablespace directory if a tablespace other than pg_default is specified for them. The name of a temporary file has the form pgsql_tmp_PPP.NNN, where _PPP is the PID of the owning backend and _NNN distinguishes different temporary files of that backend.