Hibernate Search 中文操作指南

17. Lucene backend

17.1. Basic configuration

Lucene 后端的所有配置属性都是可选的,但并非所有人都适合默认值。特别是,您可能想要 set the location of your indexes in the filesystem

All configuration properties of the Lucene backend are optional, but the defaults might not suit everyone. In particular, you might want to set the location of your indexes in the filesystem.

其他配置属性在本文档的相关部分中提到。您可以在 the Lucene backend configuration properties appendix 中找到可用属性的完整参考。

Other configuration properties are mentioned in the relevant parts of this documentation. You can find a full reference of available properties in the Lucene backend configuration properties appendix.

17.2. Index storage (Directory)

在 Lucene 中负责索引存储的组件是 org.apache.lucene.store.Directory。目录的实现确定了索引的存储位置:在文件系统,在 JVM 的堆栈中,…​

The component responsible for index storage in Lucene is the org.apache.lucene.store.Directory. The implementation of the directory determines where the index will be stored: on the filesystem, in the JVM’s heap, …​

默认情况下,Lucene 后端在文件系统中存储索引,位于 JVM 的工作目录中。

By default, the Lucene backend stores the indexes on the filesystem, in the JVM’s working directory.

可以按如下方式配置目录类型:

The type of directory can be configured as follows:

# To configure the defaults for all indexes:
hibernate.search.backend.directory.type = local-filesystem
# To configure a specific index:
hibernate.search.backend.indexes.<index-name>.directory.type = local-filesystem

有以下一些目录类型可用:

The following directory types are available:

  1. local-filesystem: Store the index on the local filesystem. See Local filesystem storage for details and configuration options.

  2. local-heap: Store the index in the local JVM heap. Local heap directories and all contained indexes are lost when the JVM shuts down. See Local heap storage for details and configuration options.

17.2.1. Local filesystem storage

local-filesystem 目录类型会将每个索引存储在配置的文件系统目录的一个子目录下。

The local-filesystem directory type will store each index in a subdirectory of a configured filesystem directory.

本地文件系统目录真正设计为仅供一台服务器和一个应用程序使用。

Local filesystem directories really are designed to be local to one server and one application.

特别是,它们不应该在多个 Hibernate Search 实例之间共享。即使网络共享允许共享索引的原始内容,从多个 Hibernate Search 使用相同的索引文件也需要更多内容:非独占锁,从一个节点向另一个节点路由写入请求…​ 这些附加功能在 local-filesystem 目录中根本不可用。

In particular, they should not be shared between multiple Hibernate Search instances. Even if network shares allow to share the raw content of indexes, using the same index files from multiple Hibernate Search would require more than that: non-exclusive locking, routing of write requests from one node to another, …​ These additional features are simply not available on local-filesystem directories.

如果您需要在多个 Hibernate Search 实例之间共享索引,则 Elasticsearch 后端将是更好的选择。有关详细信息,请参阅 Architecture

If you need to share indexes between multiple Hibernate Search instances, the Elasticsearch backend will be a better choice. Refer to Architecture for more information.

Index location

每个索引会在根目录下分配一个子目录。

Each index is assigned a subdirectory under a root directory.

默认情况下,根目录是 JVM 的工作目录。可以按如下方式配置:

By default, the root directory is the JVM’s working directory. It can be configured as follows:

# To configure the defaults for all indexes:
hibernate.search.backend.directory.root = /path/to/my/root
# To configure a specific index:
hibernate.search.backend.indexes.<index-name>.directory.root = /path/to/my/root

例如,使用上述配置,名为 Orderentity type 将被索引到目录 /path/to/my/root/Order/ 中。如果明确地为该实体分配了索引名称 orders(请参见 Entity/index mapping 中的 @Indexed(index = …​)),则会将其索引到目录 /path/to/my/root/orders/ 中。

For example, with the configuration above, an entity type named Order will be indexed in directory /path/to/my/root/Order/. If that entity is explicitly assigned the index name orders (see @Indexed(index = …​) in Entity/index mapping), it will instead be indexed in directory /path/to/my/root/orders/.

Filesystem access strategy

根据操作系统和架构自动决定访问文件系统的默认策略。在大多数情况下,它都应当运行良好。

The default strategy for accessing the filesystem is determined automatically based on the operating system and architecture. It should work well in most situations.

在需要其他文件系统访问策略的情况下,Hibernate Search 会公开一个配置属性:

For situations where a different filesystem access strategy is needed, Hibernate Search exposes a configuration property:

# To configure the defaults for all indexes:
hibernate.search.backend.directory.filesystem_access.strategy = auto
# To configure a specific index:
hibernate.search.backend.indexes.<index-name>.directory.filesystem_access.strategy = auto

允许的值有:

Allowed values are:

  1. auto: lets Lucene select the most appropriate implementation based on the operating system and architecture. This is the default value for the property.

  2. mmap: uses mmap for reading, and FSDirectory.FSIndexOutput for writing. See org.apache.lucene.store.MMapDirectory.

  3. nio: uses java.nio.channels.FileChannel's positional read for concurrent reading, and FSDirectory.FSIndexOutput for writing. See org.apache.lucene.store.NIOFSDirectory.

在更改此设置之前,请务必参考这些 Directory 实现的 Javadoc。提供更好性能的实现也会带来它们自身的问题。

Make sure to refer to Javadocs of these Directory implementations before changing this setting. Implementations offering better performance also bring issues of their own.

Other configuration options

local-filesystem 目录还允许配置 locking strategy

The local-filesystem directory also allows configuring a locking strategy.

17.2.2. Local heap storage

local-heap 目录类型会将索引存储在本地 JVM 堆中。

The local-heap directory type will store indexes in the local JVM’s heap.

因此,关闭 JVM 时,包含在 local-heap 目录中的索引将丢失。

As a result, indexes contained in a local-heap directory are lost when the JVM shuts down.

仅在测试配置中使用小型索引和低并发的情况下,才提供此目录类型,在这种情况下,它可以稍微提高性能。在需要使用较大索引和/或高并发的情况下, filesystem-based directory 将达到更好的性能。

This directory type is only provided for use in testing configurations with small indexes and low concurrency, where it could slightly improve performance. In setups requiring larger indexes and/or high concurrency, a filesystem-based directory will achieve better performance.

locking strategy 之外,local-heap 目录不会提供任何特定选项。

The local-heap directory does not offer any specific option beyond the locking strategy.

17.2.3. Locking strategy

为了写入索引,Lucene 需要获取一个锁以确保没有其他应用程序实例同时写入同一索引。每个目录类型都带有默认锁定策略,在大多数情况下都足够好。

In order to write to an index, Lucene needs to acquire a lock to ensure no other application instance writes to the same index concurrently. Each directory type comes with a default locking strategy that should work well enough in most situations.

对于需要其他锁定策略的那些(非常)罕见的情况,Hibernate Search 公开了一个配置属性:

For those (very) rare situations where a different locking strategy is needed, Hibernate Search exposes a configuration property:

# To configure the defaults for all indexes:
hibernate.search.backend.directory.locking.strategy = native-filesystem
# To configure a specific index:
hibernate.search.backend.indexes.<index-name>.directory.locking.strategy = native-filesystem

有以下策略可用:

The following strategies are available:

  1. simple-filesystem: Locks the index by creating a marker file and checking it before write operations. This implementation is very simple and based Java’s File API. If for some reason an application ends abruptly, the marker file will stay on the filesystem and will need to be removed manually.

此策略仅适用于基于文件系统的目录。

This strategy is only available for filesystem-based directories.

参见 org.apache.lucene.store.SimpleFSLockFactory

See org.apache.lucene.store.SimpleFSLockFactory.

  1. native-filesystem: Similarly to simple-filesystem, locks the index by creating a marker file, but using native OS file locks instead of Java’s File API, so that locks will be cleaned up if the application ends abruptly.

这是 local-filesystem 目录类型的默认策略。

This is the default strategy for the local-filesystem directory type.

该实现已知存在 NFS 问题:应该避免在网络共享上使用。

This implementation has known problems with NFS: it should be avoided on network shares.

此策略仅适用于基于文件系统的目录。

This strategy is only available for filesystem-based directories.

参见 org.apache.lucene.store.NativeFSLockFactory

See org.apache.lucene.store.NativeFSLockFactory.

  1. single-instance: Locks using a Java object held in the JVM’s heap. Since the lock is only accessible by the same JVM, this strategy will only work properly when it is known that only a single application will ever try to access the indexes.

这是 local-heap 目录类型的默认策略。

This is the default strategy for the local-heap directory type.

请参阅 org.apache.lucene.store.SingleInstanceLockFactory

See org.apache.lucene.store.SingleInstanceLockFactory.

  1. none: Does not use any lock. Concurrent writes from another application will result in index corruption. Test your application carefully and make sure you know what it means.

请参阅 org.apache.lucene.store.NoLockFactory

See org.apache.lucene.store.NoLockFactory.

17.3. Sharding

17.3.1. Basics

有关分片的初步介绍,包括它在 Hibernate Search 中的工作方式以及它的局限性是什么,请参阅 Sharding and routing

For a preliminary introduction to sharding, including how it works in Hibernate Search and what its limitations are, see Sharding and routing.

在 Lucene 后端中,分片默认处于禁用状态,但可以通过选择分片策略来启用它。有多个策略可用:

In the Lucene backend, sharding is disabled by default, but can be enabled by selecting a sharding strategy. Multiple strategies are available:

hash

# To configure the defaults for all indexes: hibernate.search.backend.sharding.strategy = hash hibernate.search.backend.sharding.number_of_shards = 2 # To configure a specific index: hibernate.search.backend.indexes.<index-name>.sharding.strategy = hash hibernate.search.backend.indexes.<index-name>.sharding.number_of_shards = 2_The _hash 策略需要通过 number_of_shards 属性设置指定数量的分片。

# To configure the defaults for all indexes: hibernate.search.backend.sharding.strategy = hash hibernate.search.backend.sharding.number_of_shards = 2 # To configure a specific index: hibernate.search.backend.indexes.<index-name>.sharding.strategy = hash hibernate.search.backend.indexes.<index-name>.sharding.number_of_shards = 2_The _hash strategy requires to set a number of shards through the number_of_shards property.

此策略将设置一个显式配置的分片数,从 0 到所选号码减一(例如,对于 2 个分片,将有分片“0”和分片“1”)。

This strategy will set up an explicitly configured number of shards, numbered from 0 to the chosen number minus one (e.g. for 2 shards, there will be shard "0" and shard "1").

在路由时,将对路由健进行哈希以将其分配给某个分片。如果路由键为 null,则将使用文档 ID。

When routing, the routing key will be hashed to assign it to a shard. If the routing key is null, the document ID will be used instead.

当没有明确的路由键 configured in the mapping,或者当路由键具有大量需要减少到较小数量的可能值(例如,“所有整数”),此策略适用。

This strategy is suitable when there is no explicit routing key configured in the mapping, or when the routing key has a large number of possible values that need to be brought down to a smaller number (e.g. "all integers").

explicit

# To configure the defaults for all indexes: hibernate.search.backend.sharding.strategy = explicit hibernate.search.backend.sharding.shard_identifiers = fr,en,de # To configure a specific index: hibernate.search.backend.indexes.<index-name>.sharding.strategy = explicit hibernate.search.backend.indexes.<index-name>.sharding.shard_identifiers = fr,en,de_The _explicit 策略需要通过 shard_identifiers 属性设置分片标识符列表。标识符必须作为包含以逗号分隔的多 分片标识符的字符串提供,或者作为包含分片标识符的 Collection<String> 提供。分片标识符可以是任何字符串。

# To configure the defaults for all indexes: hibernate.search.backend.sharding.strategy = explicit hibernate.search.backend.sharding.shard_identifiers = fr,en,de # To configure a specific index: hibernate.search.backend.indexes.<index-name>.sharding.strategy = explicit hibernate.search.backend.indexes.<index-name>.sharding.shard_identifiers = fr,en,de_The _explicit strategy requires to set a list of shard identifiers through the shard_identifiers property. The identifiers must be provided as a String containing multiple shard identifiers separated by commas, or a Collection<String> containing shard identifiers. A shard identifier can be any string.

此策略将为每个配置的分片标识符设置一个分片。

This strategy will set up one shard per configured shard identifier.

在路由时,将验证路由键以确保它与分片标识符完全匹配。如果匹配,则文档将被路由到该分片。如果不匹配,则将引发异常。路由键不能为 null,并且将忽略文档 ID。

When routing, the routing key will be validated to make sure it matches a shard identifier exactly. If it does, the document will be routed to that shard. If it does not, an exception will be thrown. The routing key cannot be null, and the document ID will be ignored.

当存在明确的路由键 configured in the mapping,且该路由键具有在启动应用程序之前已知的有限数量的可能值时,此策略适用。

This strategy is suitable when there is an explicit routing key configured in the mapping, and that routing key has a limited number of possible values that are known before starting the application.

17.3.2. Per-shard configuration

在某些情况下,特别是当使用 explicit 分片策略时,可能需要以稍有不同的方式配置某些分片。例如,其中一个分片可能包含大量但很少访问的数据,这些数据应该存储在不同的驱动器上。

In some cases, in particular when using the explicit sharding strategy, it may be necessary to configure some shards in a slightly different way. For example, one of the shards may hold massive, but infrequently-accessed data, which should be stored on a different drive.

可以通过为特定分片添加配置属性来实现此目的:

This can be achieved by adding configuration properties for a specific shard:

# Default configuration for all shards an index:
hibernate.search.backend.indexes.<index-name>.directory.root = /path/to/fast/drive/
# Configuration for a specific shard:
hibernate.search.backend.indexes.<index-name>.shards.<shard-identifier>.directory.root = /path/to/large/drive/

并不是所有设置都可以按分片覆盖;例如,不能按每个分片覆盖分片策略。

Not all settings can be overridden per shard; for example you can’t override the sharding strategy on a per-shard basis.

按每个分片覆盖主要用于与 directoryI/O 相关的设置。

Per-shard overriding is primarily intended for settings related to the directory and I/O.

有效的片标识符取决于分片策略:

Valid shard identifiers depend on the sharding strategy:

  1. For the hash strategy, each shard is assigned a positive integer, from 0 to the chosen number of shards minus one.

  2. For the explicit strategy, each shard is assigned one of the identifiers defined with the shard_identifiers property.

17.4. Index format compatibility

虽然 Hibernate Search 致力于提供向后兼容的 API,让您轻松将应用程序移植到较新版本,但它仍委托 Apache Lucene 处理索引写入和搜索。这会创建对 Lucene 索引格式的依赖关系。当然,Lucene 开发人员会尝试保持稳定的索引格式,但有时无法避免格式更改。在这些情况下,您要么必须重新索引所有数据,要么使用索引升级工具。有时,Lucene 也能够读取旧格式,因此您无需采取具体操作(除了对索引进行备份)。

While Hibernate Search strives to offer a backwards compatible API, making it easy to port your application to newer versions, it still delegates to Apache Lucene to handle the index writing and searching. This creates a dependency to the Lucene index format. The Lucene developers of course attempt to keep a stable index format, but sometimes a change in the format can not be avoided. In those cases you either have to re-index all your data or use an index upgrade tool. Sometimes, Lucene is also able to read the old format, so you don’t need to take specific actions (beside making backup of your index).

虽然索引格式不兼容是一种罕见事件,但 Lucene 的分析器实现可能会稍微改变其行为。这可能导致一些文档不再匹配,尽管它们以前曾经匹配。

While an index format incompatibility is a rare event, it can happen more often that Lucene’s Analyzer implementations might slightly change its behavior. This can lead to some documents not matching anymore, even though they used to.

为了避免这种分析器不兼容性,Hibernate Search 允许您配置分析器和其他 Lucene 类应遵循其行为的 Lucene 版本。

To avoid this analyzer incompatibility, Hibernate Search allows configuring to which version of Lucene the analyzers and other Lucene classes should conform their behavior.

此配置属性在后端级别设置:

This configuration property is set at the backend level:

hibernate.search.backend.lucene_version = LUCENE_8_1_1

根据您使用的 Lucene 的特定版本,您可能可以使用不同的选项:请参阅 lucene-core.jar 中包含的 org.apache.lucene.util.Version,以获取允许值的列表。

Depending on the specific version of Lucene you’re using, you might have different options available: see org.apache.lucene.util.Version contained in lucene-core.jar for a list of allowed values.

当未设置此选项时,Hibernate Search 将指示 Lucene 使用最新版本,这通常是新项目的最佳选择。不过,建议在配置中明确定义您正在使用的版本,以便在您进行升级时,Lucene 分析器不会改变行为。然后,您可以在以后更新此值,例如当您有机会从头开始重建索引时。

When this option is not set, Hibernate Search will instruct Lucene to use the latest version, which is usually the best option for new projects. Still, it’s recommended to define the version you’re using explicitly in the configuration, so that when you happen to upgrade, Lucene the analyzers will not change behavior. You can then choose to update this value at a later time, for example when you have the chance to rebuild the index from scratch.

在使用 Hibernate Search API 时,此设置将得到一致的应用,但是如果您也通过绕过 Hibernate Search 使用 Lucene(例如,在您自己实例化一个分析器时),请务必使用相同的值。

The setting will be applied consistently when using Hibernate Search APIs, but if you are also making use of Lucene bypassing Hibernate Search (for example when instantiating an Analyzer yourself), make sure to use the same value.

有关可以升级到的 Hibernate Search 版本(同时保持与给定版本 Lucene API 的向后兼容性)的信息,请参阅 compatibility policy

For information about which versions of Hibernate Search you can upgrade to, while retaining backward compatibility with a given version of Lucene APIs, refer to the compatibility policy.

17.5. Schema

Lucene 实际上没有集中式架构的概念来指定每个字段的数据类型和功能,而 Hibernate Search 保存在内存中维护这样的架构,以便记住可应用于每个字段的谓词/投影/排序。

Lucene does not really have a concept of centralized schema to specify the data type and capabilities of each field, but Hibernate Search maintains such a schema in memory, in order to remember which predicates/projections/sorts can be applied to each field.

在大多数情况下,模式由 the mapping configured through Hibernate Search’s mapping APIs 推断出来,它们是通用的并独立于 Lucene。

For the most part, the schema is inferred from the mapping configured through Hibernate Search’s mapping APIs, which are generic and independent of Lucene.

本节中说明了特定于 Lucene 后端的方面。

Aspects that are specific to the Lucene backend are explained in this section.

17.5.1. Field types

Available field types

某些类型不受 Lucene 后端直接支持,但仍然可以正常工作,因为它们由映射器“桥接”。例如,实体模型中的 java.util.Date “桥接”到 java.time.Instant,后者受 Lucene 后端支持。有关更多信息,请参阅 Supported property types

Some types are not supported directly by the Lucene backend, but will work anyway because they are "bridged" by the mapper. For example a java.util.Date in your entity model is "bridged" to java.time.Instant, which is supported by the Lucene backend. See Supported property types for more information.

不在此列表中的字段类型仍然可以使用,但需要多做一些工作:

Field types that are not in this list can still be used with a bit more work:

如果实体模型中的属性具有不受支持的类型,但可以转换为受支持的类型,则需要桥接。请参见 Binding and bridges

If a property in the entity model has an unsupported type, but can be converted to a supported type, you will need a bridge. See Binding and bridges.

如果您需要 Hibernate Search 不支持的特定类型的索引字段,您需要一个定义本机字段类型的桥梁。请参阅 Index field type DSL extensions

If you need an index field with a specific type that is not supported by Hibernate Search, you will need a bridge that defines a native field type. See Index field type DSL extensions.

表 11. Lucene 后端支持的字段类型

Table 11. Field types supported by the Lucene backend

Field type

Limitations

java.lang.String

-

java.lang.Byte

-

java.lang.Short

-

java.lang.Integer

-

java.lang.Long

-

java.lang.Double

-

java.lang.Float

-

java.lang.Boolean

-

java.math.BigDecimal

-

java.math.BigInteger

-

java.time.Instant

Lower range/resolution

java.time.LocalDate

Lower range/resolution

java.time.LocalTime

Lower range/resolution

java.time.LocalDateTime

Lower range/resolution

java.time.ZonedDateTime

Lower range/resolution

java.time.OffsetDateTime

Lower range/resolution

java.time.OffsetTime

Lower range/resolution

java.time.Year

Lower range/resolution

java.time.YearMonth

Lower range/resolution

java.time.MonthDay

-

org.hibernate.search.<wbr>engine.<wbr>spatial.<wbr>GeoPoint

Lower resolution

日期/时间字段的范围和解析度日期/时间类型不支持 java.time 类型中可表示的全部年份范围:

Range and resolution of date/time fields Date/time types do not support the whole range of years that can be represented in java.time types:

_java.time_可以表示从 _-999.999.999_到 _999.999.999_的年份。

java.time can represent years ranging from -999.999.999 to 999.999.999.

Lucene 后端支持从年份 _-292.275.054_到年份 _292.278.993_的日期。

The Lucene backend supports dates ranging from year -292.275.054 to year 292.278.993.

超出范围的值会触发索引失败。

Values that are out of range will trigger indexing failures.

时间类型的解析度也较低:

Resolution for time types is also lower:

java.time 支持纳秒精度。

java.time supports nanosecond-resolution.

Lucene 后端支持毫秒级分辨率。

The Lucene backend supports millisecond-resolution.

索引时,毫秒精度以上的精度会丢失。

Precision beyond the millisecond will be lost when indexing.

GeoPoint 字段的范围和解析度 GeoPoint_s are indexed as _LatLonPoint_s in the Lucene backend. According to _LatLonPoint 的 javadoc 中,在对值编码时会出现精度损失:

Range and resolution of GeoPoint fields GeoPoint_s are indexed as _LatLonPoint_s in the Lucene backend. According to _LatLonPoint's javadoc, there is a loss of precision when the values are encoded:

值以从原始 double 值中损失一些精度的方式编入索引(纬度组件为 4.190951585769653E-8 ,经度组件为 8.381903171539307E-8 )。

Values are indexed with some loss of precision from the original double values (4.190951585769653E-8 for the latitude component and 8.381903171539307E-8 for longitude).

值以从原始 double 值中损失一些精度的方式编入索引(纬度组件为 4.190951585769653E-8 ,经度组件为 8.381903171539307E-8 )。

Values are indexed with some loss of precision from the original double values (4.190951585769653E-8 for the latitude component and 8.381903171539307E-8 for longitude).

实际上,这意味着在最坏的情况下,索引点可能会偏差约 13 厘米(5.2 英寸)。

This effectively means indexed points can be off by about 13 centimeters (5.2 inches) in the worst case.

Index field type DSL extensions

并非所有 Lucene 字段类型都在 Hibernate Search 中具有内置支持。不过,依然可以通过利用“native”字段类型来使用不受支持的字段类型。使用此字段类型,可以直接创建 Lucene IndexableField 实例,从而可以访问 Lucene 所能提供的一切。

Not all Lucene field types have built-in support in Hibernate Search. Unsupported field types can still be used, however, by taking advantage of the "native" field type. Using this field type, Lucene IndexableField instances can be created directly, giving access to everything Lucene can offer.

以下是如何使用 Lucene “原生”类型的示例。

Below is an example of how to use the Lucene "native" type.

示例 423. 使用 Lucene “本机”类型

. Example 423. Using the Lucene "native" type

public class PageRankValueBinder implements ValueBinder { (1)
    @Override
    public void bind(ValueBindingContext<?> context) {
        context.bridge(
                Float.class,
                new PageRankValueBridge(),
                context.typeFactory() (2)
                        .extension( LuceneExtension.get() ) (3)
                        .asNative( (4)
                                Float.class, (5)
                                (absoluteFieldPath, value, collector) -> { (6)
                                    collector.accept( new FeatureField( absoluteFieldPath, "pageRank", value ) );
                                    collector.accept( new StoredField( absoluteFieldPath, value ) );
                                },
                                field -> (Float) field.numericValue() (7)
                        )
        );
    }

    private static class PageRankValueBridge implements ValueBridge<Float, Float> {
        @Override
        public Float toIndexedValue(Float value, ValueBridgeToIndexedValueContext context) {
            return value; (8)
        }

        @Override
        public Float fromIndexedValue(Float value, ValueBridgeFromIndexedValueContext context) {
            return value; (8)
        }
    }
}
@Entity
@Indexed
public class WebPage {

    @Id
    private Integer id;

    @NonStandardField( (1)
            valueBinder = @ValueBinderRef(type = PageRankValueBinder.class) (2)
    )
    private Float pageRank;

    // Getters and setters
    // ...

}

17.5.2. Multi-tenancy

根据在当前会话中定义的租户 ID,多租户功能得到支持并且会以透明的方式处理:

Multi-tenancy is supported and handled transparently, according to the tenant ID defined in the current session:

  1. documents will be indexed with the appropriate values, allowing later filtering;

  2. queries will filter results appropriately.

如果在映射器中启用了多租户,则在后端中会自动启用多租户,例如,如果 a multi-tenancy strategy is selected in Hibernate ORM,或者如果 multi-tenancy is explicitly configured in the Standalone POJO mapper

The multi-tenancy is automatically enabled in the backend if it is enabled in the mapper, e.g. if a multi-tenancy strategy is selected in Hibernate ORM, or if multi-tenancy is explicitly configured in the Standalone POJO mapper.

但是,可以手动启用多租户功能。

However, it is possible to enable multi-tenancy manually.

多租户策略是在后端级别设置的:

The multi-tenancy strategy is set at the backend level:

hibernate.search.backend.multi_tenancy.strategy = none

有关可用策略的详细信息,请参阅以下小节。

See the following subsections for details about available strategies.

none: single-tenancy

none 策略(默认策略)完全禁用多租户功能。

The none strategy (the default) disables multi-tenancy completely.

尝试设置租户 ID 会导致索引编制失败。

Attempting to set a tenant ID will lead to a failure when indexing.

discriminator: type name mapping using the index name

使用 discriminator 策略,来自所有租户的所有文档都存储在同一个索引中。

With the discriminator strategy, all documents from all tenants are stored in the same index.

索引编制时,会为每个文档透明地填充一个保存租户 ID 的鉴别器字段。

When indexing, a discriminator field holding the tenant ID is populated transparently for each document.

搜索时,会将针对租户 ID 字段的筛选器透明地添加到搜索查询中,以便仅返回当前租户的搜索结果。

When searching, a filter targeting the tenant ID field is added transparently to the search query to only return search hits for the current tenant.

17.6. Analysis

17.6.1. Basics

Analysis 是由分析器执行的文本处理,包括在索引编制(文档处理)时和在搜索(查询处理)时。

Analysis is the text processing performed by analyzers, both when indexing (document processing) and when searching (query processing).

Lucene 后端会自带一些 default analyzers,但也可以明确配置分析。

The Lucene backend comes with some default analyzers, but analysis can also be configured explicitly.

要在 Lucene 后端中配置分析,你需要:

To configure analysis in a Lucene backend, you will need to:

  • Define a class that implements the org.hibernate.search.backend.lucene.analysis.LuceneAnalysisConfigurer interface.

  • Configure the backend to use that implementation by setting the configuration property hibernate.search.backend.analysis.configurer to a bean reference pointing to the implementation, for example class:com.mycompany.MyAnalysisConfigurer.

Hibernate Search 将在启动时调用此实现的 configure 方法,配置器将能够利用 DSL 来定义 analyzers and normalizers,甚至(对于更高级的用法)定义 similarity。请参见下面的示例。

Hibernate Search will call the configure method of this implementation on startup, and the configurer will be able to take advantage of a DSL to define analyzers and normalizers or even (for more advanced use) the similarity. See below for examples.

17.6.2. Built-in analyzers

开箱即用的内置分析器不需要显式配置。如有必要,通过用相同名称定义自己的分析器可以覆盖它们。

Built-in analyzers are available out-of-the-box and don’t require explicit configuration. If necessary, they can be overridden by defining your own analyzer with the same name.

Lucene 后端附带一系列内置分析器;其名称在 org.hibernate.search.engine.backend.analysis.AnalyzerNames 中的常量中列出:

The Lucene backend comes with a series of built-in analyzer; their names are listed as constants in org.hibernate.search.engine.backend.analysis.AnalyzerNames:

default

@FullTextField 默认使用的分析器。

The analyzer used by default with @FullTextField.

默认实现:org.apache.lucene.analysis.standard.StandardAnalyzer

Default implementation: org.apache.lucene.analysis.standard.StandardAnalyzer.

默认行为:首先,使用标准标记化器进行标记化,该标记化器遵循 Unicode 文本分段算法的单词分隔规则,如 Unicode Standard Annex #29中指定的那样。然后,将每个标记小写。

Default behavior: first, tokenize using the standard tokenizer, which follows Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29. Then, lowercase each token.

standard

默认实现:org.apache.lucene.analysis.standard.StandardAnalyzer

Default implementation: org.apache.lucene.analysis.standard.StandardAnalyzer.

默认行为:首先,使用标准标记化器进行标记化,该标记化器遵循 Unicode 文本分段算法的单词分隔规则,如 Unicode Standard Annex #29中指定的那样。然后,将每个标记小写。

Default behavior: first, tokenize using the standard tokenizer, which follows Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29. Then, lowercase each token.

simple

默认实现: org.apache.lucene.analysis.core.SimpleAnalyzer

Default implementation: org.apache.lucene.analysis.core.SimpleAnalyzer.

默认行为:首先在非字母字符处拆分文本。然后将每个单词转换为小写。

Default behavior: first, split the text at non-letter characters. Then, lowercase each token.

whitespace

默认实现: org.apache.lucene.analysis.core.WhitespaceAnalyzer

Default implementation: org.apache.lucene.analysis.core.WhitespaceAnalyzer.

默认行为:在空格字符处拆分文本。不更改单词。

Default behavior: split the text at whitespace characters. Do not change the tokens.

stop

默认实现: org.apache.lucene.analysis.core.StopAnalyzer

Default implementation: org.apache.lucene.analysis.core.StopAnalyzer.

默认行为:首先在非字母字符处拆分文本。然后将每个单词转换为小写。最后,删除英语停用词。

Default behavior: first, split the text at non-letter characters. Then, lowercase each token. Finally, remove English stop words.

keyword

默认实现: org.apache.lucene.analysis.core.KeywordAnalyzer

Default implementation: org.apache.lucene.analysis.core.KeywordAnalyzer.

默认行为:不以任何方式更改文本。

Default behavior: do not change the text in any way.

通过这个分析器,全文字段的行为将类似于关键字字段,但功能更少:例如,没有词组聚合。

With this analyzer a full text field would behave similarly to a keyword field, but with fewer features: no terms aggregations, for example.

请考虑改用 @KeywordField

Consider using a @KeywordField instead.

17.6.3. Built-in normalizers

Lucene 后端不提供任何内置规范化器。

The Lucene backend does not provide any built-in normalizer.

17.6.4. Custom analyzers and normalizers

Referencing components by name

传递给配置器的上下文采用了 DSL 来定义分析器和规范化器:

The context passed to the configurer exposes a DSL to define analyzers and normalizers:

示例 424. 实现并使用分析配置器,在 Lucene 后端定义分析器和规范化器

. Example 424. Implementing and using an analysis configurer to define analyzers and normalizers with the Lucene backend

package org.hibernate.search.documentation.analysis;

import org.hibernate.search.backend.lucene.analysis.LuceneAnalysisConfigurationContext;
import org.hibernate.search.backend.lucene.analysis.LuceneAnalysisConfigurer;

public class MyLuceneAnalysisConfigurer implements LuceneAnalysisConfigurer {
    @Override
    public void configure(LuceneAnalysisConfigurationContext context) {
        context.analyzer( "english" ).custom() (1)
                .tokenizer( "standard" ) (2)
                .charFilter( "htmlStrip" ) (3)
                .tokenFilter( "lowercase" ) (4)
                .tokenFilter( "snowballPorter" ) (4)
                        .param( "language", "English" ) (5)
                .tokenFilter( "asciiFolding" );

        context.normalizer( "lowercase" ).custom() (6)
                .tokenFilter( "lowercase" )
                .tokenFilter( "asciiFolding" );

        context.analyzer( "french" ).custom() (7)
                .tokenizer( "standard" )
                .charFilter( "htmlStrip" )
                .tokenFilter( "lowercase" )
                .tokenFilter( "snowballPorter" )
                        .param( "language", "French" )
                .tokenFilter( "asciiFolding" );
    }
}
(1)
hibernate.search.backend.analysis.configurer = class:org.hibernate.search.documentation.analysis.MyLuceneAnalysisConfigurer

要了解有哪些字符筛选器、分词器和单词筛选器可用,可以在传递给分析配置器的上下文中调用 context.availableTokenizers()context.availableTokenizers()context.availableTokenFilters();这会返回所有有效名称的集合。

To know which character filters, tokenizers and token filters are available, you can call context.availableTokenizers(), context.availableTokenizers() and context.availableTokenFilters() on the context passed to your analysis configurer; this will return a set of all valid names.

要深入了解这些字符过滤器、标记化器和标记过滤器,请浏览 Lucene Javadoc,特别是查看 common analysis components的各个包,或阅读 Solr Wiki上的相应部分(无需 Solr 即可使用这些分析器,只是 Lucene 本身没有文档页面)。

To learn more about the behavior of these character filters, tokenizers and token filters, either browse the Lucene Javadoc, in particular, look through various packages of common analysis components, or read the corresponding section on the Solr Wiki (you don’t need Solr to use these analyzers, it’s just that there is no documentation page for Lucene proper).

在 Lucene Javadoc 中,每个工厂类的描述都包含后跟字符串常量的“SPI 名称”。在定义分析器时,应将此名称传递给该工厂以供使用。

In the Lucene Javadoc, the description of each factory class includes the mention "SPI Name" followed by a string constant. This is the name that should be passed to use that factory when defining analyzers.

Referencing components by factory class

你可以传递 Lucene 工厂类,而不是名称,以引用特定的分词器、字符过滤器或分词过滤器实现。这些类会扩展 org.apache.lucene.analysis.TokenizerFactoryorg.apache.lucene.analysis.TokenFilterFactoryorg.apache.lucene.analysis.CharFilterFactory

Instead of names, you may also pass Lucene factory classes to refer to a particular tokenizer, char filter or token filter implementation. Those classes extend org.apache.lucene.analysis.TokenizerFactory, org.apache.lucene.analysis.TokenFilterFactory or org.apache.lucene.analysis.CharFilterFactory.

这会避免在代码中使用字符串常量,代价是直接编译时依赖 Lucene。

This avoids string constants in your code, at the cost of having a direct compile-time dependency to Lucene.

示例 425. 使用 Lucene factory 类实现分析配置器

. Example 425. Analysis configurer implementation using Lucene factory classes

context.analyzer( "english" ).custom()
        .tokenizer( StandardTokenizerFactory.class )
        .charFilter( HTMLStripCharFilterFactory.class )
        .tokenFilter( LowerCaseFilterFactory.class )
        .tokenFilter( SnowballPorterFilterFactory.class )
                .param( "language", "English" )
        .tokenFilter( ASCIIFoldingFilterFactory.class );

context.normalizer( "lowercase" ).custom()
        .tokenFilter( LowerCaseFilterFactory.class )
        .tokenFilter( ASCIIFoldingFilterFactory.class );

要了解有哪些字符过滤器、标记化器和标记过滤器可用,请浏览 Lucene Javadoc,特别是查看 common analysis components的各个包,或阅读 Solr Wiki上的相应部分(无需 Solr 即可使用这些分析器,只是 Lucene 本身没有文档页面)。

To know which character filters, tokenizers and token filters are available, either browse the Lucene Javadoc, in particular, look through various packages of common analysis components, or read the corresponding section on the Solr Wiki (you don’t need Solr to use these analyzers, it’s just that there is no documentation page for Lucene proper).

Assigning names to analyzer instances

也可以给一个分析器实例指定一个名称:

It is also possible to assign a name to an analyzer instance:

示例 426. 在 Lucene 后端命名分析器实例

. Example 426. Naming an analyzer instance in the Lucene backend

context.analyzer( "my-standard" ).instance( new StandardAnalyzer() );

17.6.5. Overriding the default analyzer

使用 @FullTextField 但未明确指定分析器时的默认分析器名为 default

The default analyzer when using @FullTextField without specifying an analyzer explicitly, is named default.

像任何其他 built-in analyzer 一样,可以通过定义具有相同名称的 custom analyzer 来覆盖默认分析器:

Like any other built-in analyzer, it is possible to override the default analyzer by defining a custom analyzer with the same name:

示例 427. 覆盖 Lucene 后端的默认分析器

. Example 427. Overriding the default analyzer in the Lucene backend

package org.hibernate.search.documentation.analysis;

import org.hibernate.search.backend.lucene.analysis.LuceneAnalysisConfigurationContext;
import org.hibernate.search.backend.lucene.analysis.LuceneAnalysisConfigurer;
import org.hibernate.search.engine.backend.analysis.AnalyzerNames;

public class DefaultOverridingLuceneAnalysisConfigurer implements LuceneAnalysisConfigurer {
    @Override
    public void configure(LuceneAnalysisConfigurationContext context) {
        context.analyzer( AnalyzerNames.DEFAULT ) (1)
                .custom() (2)
                .tokenizer( "standard" )
                .tokenFilter( "lowercase" )
                .tokenFilter( "snowballPorter" )
                        .param( "language", "French" )
                .tokenFilter( "asciiFolding" );
    }
}
(1)
hibernate.search.backend.analysis.configurer = class:org.hibernate.search.documentation.analysis.DefaultOverridingLuceneAnalysisConfigurer

17.6.6. Similarity

搜索时,将根据在索引时间记录的统计信息使用特定公式为文档分配分数。这些统计信息和公式由一个称为“相似性”的单个组件定义,实现 Lucene 的 org.apache.lucene.search.similarities.Similarity 接口。

When searching, scores are assigned to documents based on statistics recorded at index time, using a specific formula. Those statistics and the formula are defined by a single component called the similarity, implementing Lucene’s org.apache.lucene.search.similarities.Similarity interface.

默认情况下,Hibernate Search 使用 BM25Similarity 及其默认参数 (k1 = 1.2, b = 0.75)。这应该可以在大多数情况下提供令人满意的评分。

By default, Hibernate Search uses BM25Similarity with its defaults parameters (k1 = 1.2, b = 0.75). This should provide satisfying scoring in most situations.

如有高级需求,可以在分析配置器中设置自定义 Similarity,如下所示。

If you have advanced needs, you can set a custom Similarity in your analysis configurer, as shown below.

请记住,还要从您的配置属性中引用分析配置器,如 Custom analyzers and normalizers 中所述。

Remember to also reference the analysis configurer from your configuration properties, as explained in Custom analyzers and normalizers.

示例 428. 实现分析配置器以使用 Lucene 后端更改相似性

. Example 428. Implementing an analysis configurer to change the Similarity with the Lucene backend

public class CustomSimilarityLuceneAnalysisConfigurer implements LuceneAnalysisConfigurer {
    @Override
    public void configure(LuceneAnalysisConfigurationContext context) {
        context.similarity( new ClassicSimilarity() ); (1)

        context.analyzer( "english" ).custom() (2)
                .tokenizer( "standard" )
                .tokenFilter( "lowercase" )
                .tokenFilter( "asciiFolding" );
    }
}

有关 Similarity 、其各种实现以及每种实现的利弊的更多信息,请参阅 Similarity 和 Lucene 源代码的 javadoc。

For more information about Similarity, its various implementations, and the pros and cons of each implementation, see the javadoc of Similarity and Lucene’s source code.

还可以在网上找到有用的资源,例如在 Elasticsearch 的文档中。

You can also find useful resources on the web, for example in Elasticsearch’s documentation.

17.7. Threads

Lucene 后端依赖于内部线程池在索引上执行写操作。

The Lucene backend relies on an internal thread pool to execute write operations on the index.

默认情况下,此池包含的线程数恰好等于引导时 JVM 可用的处理器数。可以使用配置属性更改此设置:

By default, the pool contains exactly as many threads as the number of processors available to the JVM on bootstrap. That can be changed using a configuration property:

hibernate.search.backend.thread_pool.size = 4

每索引这个数字都是 per backend,而不是每个索引。添加更多索引不会添加更多线程。

This number is per backend, not per index. Adding more indexes will not add more threads.

在此线程池中发生的运算包括阻塞 I/O,因此将其大小提高到 JVM 可用处理器核心数之上可能是合理的,还能提升性能。

Operations happening in this thread-pool include blocking I/O, so raising its size above the number of processor cores available to the JVM can make sense and may improve performance.

17.8. Indexing queues

在 Hibernate Search 在索引上执行的所有写操作中,预计会出现许多“索引”操作来创建/更新/删除特定文档。我们通常希望在请求与同一文档相关时保留这些请求的相对顺序。

Among all the write operations performed by Hibernate Search on the indexes, it is expected that there will be a lot of "indexing" operations to create/update/delete a specific document. We generally want to preserve the relative order of these requests when they are about the same documents.

出于这个原因,Hibernate Search 会将这些操作推入内部队列并批处理应用这些操作。每个索引维护 10 个队列,每个队列最多包含 1000 个元素。这些队列会独立运行(并行),但每个队列都会依次执行一个操作,因此在任意给定时间,每个索引最多可以应用 10 批索引请求。

For this reason, Hibernate Search pushes these operations to internal queues and applies them in batches. Each index maintains 10 queues holding at most 1000 elements each. Queues operate independently (in parallel), but each queue applies one operation after the other, so at any given time there can be at most 10 batches of indexing requests being applied for each index.

相对于同一文档 ID 的索引操作始终会被推送到同一队列。

Indexing operations relative to the same document ID are always pushed to the same queue.

可以自定义队列以减少资源消耗,或者相反,改善吞吐量。这是通过以下配置属性来实现的:

It is possible to customize the queues in order to reduce resource consumption, or on the contrary to improve throughput. This is done through the following configuration properties:

# To configure the defaults for all indexes:
hibernate.search.backend.indexing.queue_count = 10
hibernate.search.backend.indexing.queue_size = 1000
# To configure a specific index:
hibernate.search.backend.indexes.<index-name>.indexing.queue_count = 10
hibernate.search.backend.indexes.<index-name>.indexing.queue_size = 1000
  1. indexing.queue_count defines the number of queues. Expects a strictly positive integer value. The default for this property is 10.

更高的值将导致更多索引操作并行执行,如果在索引时 CPU 能力是瓶颈,这可能会提高索引吞吐量。

Higher values will lead to more indexing operations being performed in parallel, which may lead to higher indexing throughput if CPU power is the bottleneck when indexing.

请注意,将此数字提高到 number of threads 以上永远没有用,因为线程数限制了可以并行处理多少个队列。

Note that raising this number above the number of threads is never useful, as the number of threads limits how many queues can be processed in parallel.

  1. indexing.queue_size defines the maximum number of elements each queue can hold. Expects a strictly positive integer value. The default for this property is 1000.

较低的值可能会导致较低的内存使用率,尤其是当有多个队列时,但过低的值会增加 application threads blocking 的可能性,因为队列已满,这可能导致较低的索引编制吞吐量。

Lower values may lead to lower memory usage, especially if there are many queues, but values that are too low will increase the likeliness of application threads blocking because the queue is full, which may lead to lower indexing throughput.

当队列已满时,任何请求索引的尝试都会阻塞,直到该请求可以放入队列。

When a queue is full, any attempt to request indexing will block until the request can be put into the queue.

为了达到合理的性能水平,务必将队列的大小设置为足够高的数字,以便仅在应用程序负载非常高时才会发生此类阻塞。

In order to achieve a reasonable level of performance, be sure to set the size of queues to a high enough number that this kind of blocking only happens when the application is under very high load.

启用 sharding 时,将为每个分片分配自己的队列集。

When sharding is enabled, each shard will be assigned its own set of queues.

如果你使用基于文档 ID(而不是提供的路由键)的 hash 分区策略,请务必将队列数设置为与分片数没有公因子的数字;否则,某些队列的使用率可能远低于其他队列。

If you use the hash sharding strategy based on the document ID (and not based on a provided routing key), make sure to set the number of queues to a number with no common denominator with the number of shards; otherwise, some queues may be used much less than others.

例如,如果你将分片数设为 8,并将队列数设为 4,则最终存储在分片 0 中的文档将始终存储在该分片队列 0 中。这是因为路由到分片和路由到队列都会对文档 ID 进行哈希运算,然后对该哈希应用模运算,并且 <some hash> % 8 == 0 (路由到分片 0)意味着 <some hash> % 4 == 0 (路由到分片 0 的队列 0)。同样,这仅在你依靠文档 ID 而不是提供的用于分区的路由键时才成立。

For example, if you set the number of shards to 8 and the number of queues to 4, documents ending up in the shard #0 will always end up in queue #0 of that shard. That’s because both the routing to a shard and the routing to a queue take the hash of the document ID then apply a modulo operation to that hash, and <some hash> % 8 == 0 (routed to shard #0) implies <some hash> % 4 == 0 (routed to queue #0 of shard #0). Again, this is only true if you rely on the document ID and not on a provided routing key for sharding.

17.9. Writing and reading

17.9.1. Commit

有关在 Hibernate

For a preliminary introduction to writing to and reading from indexes in Hibernate Search, including in particular the concepts of commit and refresh, see Commit and refresh.

在 Lucene 术语中,@{24} 是当缓存在索引编写器中的更改被推送到索引本身时,以便崩溃或断电不再会导致数据丢失。

In Lucene terminology, a commit is when changes buffered in an index writer are pushed to the index itself, so that a crash or power loss will no longer result in data loss.

某些操作非常重要,并且总是在被认为完成之前提交。对于由 listener-triggered indexing 触发的更改(除非 configured otherwise),以及对于大规模操作(例如 purge)来说,情况就是这样。当遇到此类操作时,将立即执行提交,以保证仅在所有更改都安全地存储在磁盘上后才将该操作视为完成。

Some operations are critical and are always committed before they are considered complete. This is the case for changes triggered by listener-triggered indexing (unless configured otherwise), and also for large-scale operations such as a purge. When such an operation is encountered, a commit will be performed immediately, guaranteeing that the operation is only considered complete after all changes are safely stored on disk.

但是, mass indexer 贡献的更改等其他操作或在 indexing 使用 async synchronization strategy 时,不一定预期立即提交。

However, other operations, like changes contributed by the mass indexer or when indexing is using the async synchronization strategy, are not expected to be committed immediately.

就性能而言,提交可能是一项昂贵的操作,这就是 Hibernate Search 尝试不要太频繁地提交的原因。默认情况下,当不立即提交的更改应用于索引时,Hibernate Search 将延迟提交一秒钟。如果在那一秒钟内应用了其他更改,则它们将包含在相同的提交中。这极大地减少了在写入密集型场景(例如 mass indexing)中的提交数量,从而导致了更好的性能。

Performance-wise, committing may be an expensive operation, which is why Hibernate Search tries not to commit too often. By default, when changes that do not require an immediate commit are applied to the index, Hibernate Search will delay the commit by one second. If other changes are applied during that second, they will be included in the same commit. This dramatically reduces the amount of commits in write-intensive scenarios (e.g. mass indexing), leading to much better performance.

可以通过设置提交间隔(以毫秒为单位)来控制 Hibernate Search 执行提交的频率:

It is possible to control exactly how often Hibernate Search will commit by setting the commit interval (in milliseconds):

# To configure the defaults for all indexes:
hibernate.search.backend.io.commit_interval = 1000
# To configure a specific index:
hibernate.search.backend.indexes.<index-name>.io.commit_interval = 1000

此属性的默认值为 1000

The default for this property is 1000.

将提交间隔设置为 0 将强制 Hibernate Search 在每次更改批处理后提交,这可能会导致吞吐量大幅下降,尤其是对于 explicit or listener-triggered indexing,对 mass indexing 更是如此。

Setting the commit interval to 0 will force Hibernate Search to commit after every batch of changes, which may result in a much lower throughput, in particular for explicit or listener-triggered indexing but even more so for mass indexing.

记住,单独写入操作可能会强制进行提交,而这可能会抵消因设置较高的提交间隔而带来的潜在性能提升。

Remember that individual write operations may force a commit, which may cancel out the potential performance gains from setting a higher commit interval.

默认情况下,提交间隔可能只会提升 mass indexer 的吞吐量。如果你希望由 explicitly or by listeners 触发的更改也能从中受益,则需要选择非默认 synchronization strategy ,以免在每次更改后都需要进行提交。

By default, the commit interval may only improve throughput of the mass indexer. If you want changes triggered explicitly or by listeners to benefit from it too, you will need to select a non-default synchronization strategy, so as not to require a commit after each change.

17.9.2. Refresh

有关在 Hibernate

For a preliminary introduction to writing to and reading from indexes in Hibernate Search, including in particular the concepts of commit and refresh, see Commit and refresh.

在 Lucene 术语中,当打开一个新索引阅读器时,refresh 会进行操作,这样可以是下一个搜索查询考虑索引的最新更改。

In Lucene terminology, a refresh is when a new index reader is opened, so that the next search queries will take into account the latest changes to the index.

在性能方面,刷新可能是一项昂贵的操作,这就是 Hibernate Search 尝试不要过频繁地刷新的原因。索引读取器在每个搜索查询时都会刷新,但前提是自上次刷新以来发生了写操作。

Performance-wise, refreshing may be an expensive operation, which is why Hibernate Search tries not to refresh too often. The index reader is refreshed upon every search query, but only if writes have occurred since the last refresh.

在写密集型场景中,每次写操作后刷新仍然过于频繁,这时可以降低刷新频率,从而通过设置以毫秒为单位的刷新间隔提高读取吞吐量。当将其设置为大于 0 的值时,将不再在每个搜索查询时刷新索引读取器:如果搜索查询开始时刷新在 X 毫秒之前发生,则即使索引读取器可能已过时,也不会刷新它。

In write-intensive scenarios where refreshing after each write is still too frequent, it is possible to refresh less frequently and thus improve read throughput by setting a refresh interval in milliseconds. When set to a value higher than 0, the index reader will no longer be refreshed upon every search query: if, when a search query starts, the refresh occurred less than X milliseconds ago, then the index reader will not be refreshed, even though it may be out-of-date.

刷新间隔可以通过以下方式设置:

The refresh interval can be set this way:

# To configure the defaults for all indexes:
hibernate.search.backend.io.refresh_interval = 0
# To configure a specific index:
hibernate.search.backend.indexes.<index-name>.io.refresh_interval = 0

此属性的默认值为 0

The default for this property is 0.

17.9.3. IndexWriter settings

Hibernate Search 用来写索引的 Lucene’s IndexWriter,公开了多个设置,可以对其进行微调以更好地适配你的应用程序,最终获得更好的性能。

Lucene’s IndexWriter, used by Hibernate Search to write to indexes, exposes several settings that can be tweaked to better fit your application, and ultimately get better performance.

Hibernate Search 通过索引级别的具有 io.writer. 前缀的配置属性公开这些设置。

Hibernate Search exposes these settings through configuration properties prefixed with io.writer., at the index level.

以下是所有索引作者设置的列表。它们都可以通过配置属性以类似的方式设置;例如,io.writer.ram_buffer_size 可以像这样设置:

Below is a list of all index writer settings. They can all be set similarly through configuration properties; for example, io.writer.ram_buffer_size can be set like this:

# To configure the defaults for all indexes:
hibernate.search.backend.io.writer.ram_buffer_size = 32
# To configure a specific index:
hibernate.search.backend.indexes.<index-name>.io.writer.ram_buffer_size = 32

表 12. IndexWriter 的配置属性

Table 12. Configuration properties for the IndexWriter

Property

Description

[…​].io.writer.max_buffered_docs

The maximum number of documents that can be buffered in-memory before they are flushed to the Directory.Large values mean faster indexing, but more RAM usage.When used together with ram_buffer_size a flush occurs for whichever event happens first.

[…​].io.writer.ram_buffer_size

The maximum amount of RAM that may be used for buffering added documents and deletions before they are flushed to the Directory.Large values mean faster indexing, but more RAM usage.Generally for faster indexing performance it’s best to use this setting rather than max_buffered_docs.When used together with max_buffered_docs a flush occurs for whichever event happens first.

[…​].io.writer.infostream

Enables low level trace information about Lucene’s internal components; true or false.Logs will be appended to the logger org.hibernate.search.backend.lucene.infostream at the TRACE level.This may cause significant performance degradation, even if the logger ignores the TRACE level, so this should only be used for troubleshooting purposes.Disabled by default.

请参阅 Lucene 文档,特别是 IndexWriterConfig 的 javadoc 和源代码获取更多关于设置及其默认值的信息。

Refer to Lucene’s documentation, in particular the javadoc and source code of IndexWriterConfig, for more information about the settings and their defaults.

17.9.4. Merge settings

Lucene 索引不会存储在单个连续的文件中。相反,每次刷新索引会生成一个小文件,包含加入到索引中的所有文档。该文件被称为“段”。在包含过多段的索引上,搜索可能会更慢,所以 Lucene 会定期合并小段以创建更少且更大的段。

A Lucene index is not stored in a single, continuous file. Instead, each flush to the index will generate a small file containing all the documents added to the index. This file is called a "segment". Search can be slower on an index with too many segments, so Lucene regularly merges small segments to create fewer, larger segments.

Lucene 的合并行为通过 MergePolicy 控制。Hibernate Search 使用 LogByteSizeMergePolicy,它公开了多个设置,可以对其进行微调以更好地适配你的应用程序,最终获得更好的性能。

Lucene’s merge behavior is controlled through a MergePolicy. Hibernate Search uses the LogByteSizeMergePolicy, which exposes several settings that can be tweaked to better fit your application, and ultimately get better performance.

以下是所有合并设置的列表。它们都可以通过配置属性以类似的方式设置;例如,io.merge.factor 可以像这样设置:

Below is a list of all merge settings. They can all be set similarly through configuration properties; for example, io.merge.factor can be set like this:

# To configure the defaults for all indexes:
hibernate.search.backend.io.merge.factor = 10
# To configure a specific index:
hibernate.search.backend.indexes.<index-name>.io.merge.factor = 10

表 13. 与合并相关的配置属性

Table 13. Configuration properties related to merges

Property

Description

[…​].io.merge.max_docs

The maximum number of documents that a segment can have before merging. Segments with more than this number of documents will not be merged.Smaller values perform better on frequently changing indexes, larger values provide better search performance if the index does not change often.

[…​].io.merge.factor

The number of segments that are merged at once.With smaller values, merging happens more often and thus uses more resources, but the total number of segments will be lower on average, increasing read performance. Thus, larger values (> 10) are best for mass indexing, and smaller values (< 10) are best for explicit or listener-triggered indexing.The value must not be lower than 2.

[…​].io.merge.min_size

The minimum target size of segments, in MB, for background merges.Segments smaller than this size are merged more aggressively.Setting this too large might result in expensive merge operations, even tough they are less frequent.

[…​].io.merge.max_size

The maximum size of segments, in MB, for background merges.Segments larger than this size are never merged in the background.Settings this to a lower value helps reduce memory requirements and avoids some merging operations at the cost of optimal search speed.When forcefully merging an index, this value is ignored and max_forced_size is used instead (see below).

[…​].io.merge.max_forced_size

The maximum size of segments, in MB, for forced merges.This is the equivalent of io.merge.max_size for forceful merges. You will generally want to set this to the same value as max_size or lower, but setting it too low will degrade search performance as documents are deleted.

[…​].io.merge.calibrate_by_deletes

Whether the number of deleted documents in an index should be taken into account; true or false.When enabled, Lucene will consider that a segment with 100 documents, 50 of which are deleted, actually contains 50 documents. When disabled, Lucene will consider that such a segment contains 100 documents.Setting calibrate_by_deletes to false will lead to more frequent merges caused by io.merge.max_docs, but will more aggressively merge segments with many deleted documents, improving search performance.

请参阅 Lucene 文档,特别是 LogByteSizeMergePolicy 的 javadoc 和源代码获取更多关于设置及其默认值的信息。

Refer to Lucene’s documentation, in particular the javadoc and source code of LogByteSizeMergePolicy, for more information about the settings and their defaults.

选项 io.merge.max_sizeio.merge.max_forced_size 不会直接定义所有段文件允许的最大大小。

The options io.merge.max_size and io.merge.max_forced_size do not directly define the maximum size of all segment files.

首先,考虑合并段是指将段与另一个现有段组合在一起形成一个更大的段。 io.merge.max_size 是合并之前的段允许的最大大小,因此新合并后的段可能达到其两倍大小。

First, consider that merging a segment is about adding it together with another existing segment to form a larger one. io.merge.max_size is the maximum size of segments before merging, so newly merged segments can be up to twice that size.

其次,合并选项不会影响索引写入程序在合并之前创建的段的初始大小。此大小可以用设置 io.writer.ram_buffer_size 来限制,但 Lucene 依靠估算值来实施此限制;当这些估算值出现偏差时,新创建的段可能会略大于 io.writer.ram_buffer_size

Second, merge options do not affect the size of segments initially created by the index writer, before they are merged. This size can be limited with the setting io.writer.ram_buffer_size, but Lucene relies on estimates to implement this limit; when these estimates are off, it is possible for newly created segments to be slightly larger than io.writer.ram_buffer_size.

因此,例如,为了相当自信地确保没有文件增长到大于 15MB,请使用类似这样的设置:

So, for example, to be fairly confident no file grows larger than 15MB, use something like this:

hibernate.search.backend.io.writer.ram_buffer_size = 10 hibernate.search.backend.io.merge.max_size = 7 hibernate.search.backend.io.merge.max_forced_size = 7 hibernate.search.backend.io.writer.ram_buffer_size = 10 hibernate.search.backend.io.merge.max_size = 7 hibernate.search.backend.io.merge.max_forced_size = 7

hibernate.search.backend.io.writer.ram_buffer_size = 10 hibernate.search.backend.io.merge.max_size = 7 hibernate.search.backend.io.merge.max_forced_size = 7 hibernate.search.backend.io.writer.ram_buffer_size = 10 hibernate.search.backend.io.merge.max_size = 7 hibernate.search.backend.io.merge.max_forced_size = 7

使用 Lucene 后端进行搜索依赖于 same APIs as any other backend

Searching with the Lucene backend relies on the same APIs as any other backend.

此章节详细介绍与搜索相关的 Lucene 特有配置。

This section details Lucene-specific configuration related to searching.

17.10.1. Low-level hit caching

此特性意味着应用程序代码直接依赖 Lucene API。

This feature implies that application code rely on Lucene APIs directly.

即使是针对 bug 修复(微)版本,升级 Hibernate Search 也可能需要升级 Lucene,这可能会导致 Lucene 中中断 API 更改。

An upgrade of Hibernate Search, even for a bugfix (micro) release, may require an upgrade of Lucene, which may lead to breaking API changes in Lucene.

如果出现此情况,您将需要更改应用程序代码来应对这些更改。

If this happens, you will need to change application code to deal with the changes.

Lucene 支持缓存低级别命中,即缓存与给定 org.apache.lucene.search.Query 在给定索引段中匹配的文档列表。

Lucene supports caching low-level hits, i.e. caching the list of documents that match a given org.apache.lucene.search.Query in a given index segment.

在读密集型场景中,该缓存很有用,在这种场景中,相同的查询会在同一索引上执行很频繁,而且很少会写到索引中。

This cache can be useful in read-intensive scenarios, where the same query is executed very often on the same index, and the index is rarely written to.

要在 Lucene 后端中配置缓存,您需要:

To configure caching in a Lucene backend, you will need to:

  • Define a class that implements the org.hibernate.search.backend.lucene.cache.QueryCachingConfigurer interface.

  • Configure the backend to use that implementation by setting the configuration property hibernate.search.backend.query.caching.configurer to a bean reference pointing to the implementation, for example class:com.mycompany.MyQueryCachingConfigurer.

Hibernate Search 会在启动时调用此实现的 configure 方法,配置人员将能够利用 DSL 来定义 org.apache.lucene.search.QueryCacheorg.apache.lucene.search.QueryCachingPolicy

Hibernate Search will call the configure method of this implementation on startup, and the configurer will be able to take advantage of a DSL to define the org.apache.lucene.search.QueryCache and the org.apache.lucene.search.QueryCachingPolicy.