Postgresql 中文操作指南
72.1. Overview #
PostgreSQL 包含一个持久化磁盘哈希索引的实现,该实现完全支持崩溃恢复。任何数据类型都可以通过哈希索引进行索引,包括没有明确线性排序的数据类型。哈希索引仅存储所索引数据的哈希值,因此对要索引的数据列的大小没有限制。
PostgreSQL includes an implementation of persistent on-disk hash indexes, which are fully crash recoverable. Any data type can be indexed by a hash index, including data types that do not have a well-defined linear ordering. Hash indexes store only the hash value of the data being indexed, thus there are no restrictions on the size of the data column being indexed.
哈希索引仅支持单列索引,并且不允许唯一性检查。
Hash indexes support only single-column indexes and do not allow uniqueness checking.
哈希索引仅支持 = 运算符,因此指定范围操作的 WHERE 子句将无法利用哈希索引。
Hash indexes support only the = operator, so WHERE clauses that specify range operations will not be able to take advantage of hash indexes.
每个哈希索引元组仅存储 4 字节的哈希值,而不是实际的列值。因此,当对较长的数据项(如 UUID、URL 等)进行索引时,哈希索引可能比 B 树小得多。没有列值也使所有哈希索引扫描都具有损耗。哈希索引可以参与位图索引扫描和反向扫描。
Each hash index tuple stores just the 4-byte hash value, not the actual column value. As a result, hash indexes may be much smaller than B-trees when indexing longer data items such as UUIDs, URLs, etc. The absence of the column value also makes all hash index scans lossy. Hash indexes may take part in bitmap index scans and backward scans.
哈希索引最适合于对大表使用相等性扫描的 SELECT 和 UPDATE 繁重的工作负载。在 B 树索引中,搜索必须向下遍历树,直到找到叶页面。在包含数百万行的数据表中,此类遍历会增加对数据的访问时间。哈希索引中叶页面等效项称为桶页面。相比之下,哈希索引允许直接访问桶页面,从而有可能减少大型表中的索引访问时间。在大于 shared_buffers/RAM 的索引/数据上,这种“逻辑 I/O”的减少会更加明显。
Hash indexes are best optimized for SELECT and UPDATE-heavy workloads that use equality scans on larger tables. In a B-tree index, searches must descend through the tree until the leaf page is found. In tables with millions of rows, this descent can increase access time to data. The equivalent of a leaf page in a hash index is referred to as a bucket page. In contrast, a hash index allows accessing the bucket pages directly, thereby potentially reducing index access time in larger tables. This reduction in "logical I/O" becomes even more pronounced on indexes/data larger than shared_buffers/RAM.
哈希索引已被设计为能应对哈希值的不均匀分布。如果哈希值均匀分布,则直接访问桶页面效果很好。当插入项使桶页面变满时,会将其他溢出页面链接到该特定桶页面,从而局部扩展与哈希值匹配的索引元组的存储空间。在查询期间扫描哈希桶时,我们需要扫描所有溢出页面。因此,对于某些数据,不平衡的哈希索引在所需的块访问数方面实际上可能比 B 树更差。
Hash indexes have been designed to cope with uneven distributions of hash values. Direct access to the bucket pages works well if the hash values are evenly distributed. When inserts mean that the bucket page becomes full, additional overflow pages are chained to that specific bucket page, locally expanding the storage for index tuples that match that hash value. When scanning a hash bucket during queries, we need to scan through all of the overflow pages. Thus an unbalanced hash index might actually be worse than a B-tree in terms of number of block accesses required, for some data.
由于会发生溢出情况,我们可以说哈希索引最适合唯一、几乎唯一的数据或每哈希桶中行数较少的数据。一种避免问题的可能方法是使用部分索引条件从索引中排除高度非唯一的值,但这在很多情况下可能不合适。
As a result of the overflow cases, we can say that hash indexes are most suitable for unique, nearly unique data or data with a low number of rows per hash bucket. One possible way to avoid problems is to exclude highly non-unique values from the index using a partial index condition, but this may not be suitable in many cases.
与 B 树类似,哈希索引执行简单的索引元组删除操作。这是一项延迟维护操作,它会删除已知可以安全删除(其项标识符的 LP_DEAD 位已设置)的索引元组。如果插入操作发现页面上没有可用空间,我们会尝试删除死索引元组以避免创建新的溢出页面。如果页面当时被固定,则无法删除。死索引指针的删除也会在 VACUUM 期间进行。
Like B-Trees, hash indexes perform simple index tuple deletion. This is a deferred maintenance operation that deletes index tuples that are known to be safe to delete (those whose item identifier’s LP_DEAD bit is already set). If an insert finds no space is available on a page we try to avoid creating a new overflow page by attempting to remove dead index tuples. Removal cannot occur if the page is pinned at that time. Deletion of dead index pointers also occurs during VACUUM.
如果可以,VACUUM 还会尝试将索引元组压缩到尽可能少的溢出页面上,从而最大限度地减少溢出链。如果溢出页面变为空,则溢出页面可以回收以在其他桶中重复使用,但我们绝不会将它们返回给操作系统。目前没有办法缩减哈希索引,只能通过使用 REINDEX 重建它。也没有办法减少桶的数量。
If it can, VACUUM will also try to squeeze the index tuples onto as few overflow pages as possible, minimizing the overflow chain. If an overflow page becomes empty, overflow pages can be recycled for reuse in other buckets, though we never return them to the operating system. There is currently no provision to shrink a hash index, other than by rebuilding it with REINDEX. There is no provision for reducing the number of buckets, either.
哈希索引在索引行数增加时,可能扩展存储桶页面的数量。哈希键到存储桶号码的映射被选择,以便可以渐增扩展索引。在新存储桶被添加到索引时,准确地一个已存在的存储桶需要“拆分”,并且其一些元组将根据更新的键到存储桶号码映射被转移到新存储桶。
Hash indexes may expand the number of bucket pages as the number of rows indexed grows. The hash key-to-bucket-number mapping is chosen so that the index can be incrementally expanded. When a new bucket is to be added to the index, exactly one existing bucket will need to be "split", with some of its tuples being transferred to the new bucket according to the updated key-to-bucket-number mapping.
扩展在前景发生,这可能会增加用户插入的执行时间。因此,哈希索引可能不适合行数快速增加的表。
The expansion occurs in the foreground, which could increase execution time for user inserts. Thus, hash indexes may not be suitable for tables with rapidly increasing number of rows.