Hibernate Search 中文操作指南

19. Coordination

19.1. Basics

协调是一个复杂的话题,它解决的问题乍看之下可能不清楚。您可能会发现从更高级的角度来理解协调更容易。

Coordination is a complex topic, and the problems it solves may appear unclear at first sight. You may find it easier to approach coordination from a higher level.

有关涉及不同协调策略的一些示例体系结构和每个体系结构之间的差异摘要,请参阅 Examples of architectures

For a few example architectures involving different coordination strategies and a summary of the differences between each architecture, see Examples of architectures.

有关协调策略差异的摘要(特别是在监听器触发索引的上下文中),请参阅 Basics

For a summary of the differences between coordination strategies specifically in the context of listener-triggered indexing, see Basics.

使用 Hibernate Search 的应用程序通常依赖于多线程甚至多个应用程序实例,它们将同时更新数据库。

An application using Hibernate Search usually relies on multiple threads, or even multiple application instances, which will update the database concurrently.

协调策略定义了这些线程/节点如何互相协调以根据这些数据库更新更新索引,这种方式可确保一致性、防止数据丢失并优化性能。

The coordination strategy defines how these threads/nodes will coordinate with each other in order to update indexes according to these database updates, in a way that ensures consistency, prevents data loss, and optimizes performance.

该策略通过配置属性来设置:

This strategy is set through configuration properties:

hibernate.search.coordination.strategy = outbox-polling

该属性的默认值为 none

The default for this property is none.

协调策略仅适用于 Hibernate ORM 的集成。

Coordination strategies are only available for the integration to Hibernate ORM.

有关独立 POJO 映射中的协调,请参阅 this section

See this section for information about coordination in the Standalone POJO mapper.

有关可用策略的详细信息,请参阅以下小节。

See the following subsections for details about available strategies.

19.2. No coordination

19.2.1. Basics

该默认策略最为简单,并且无需任何用于应用节点间通信的额外基础设施。

The default strategy is the simplest and does not involve any additional infrastructure for communication between application nodes.

所有 explicit or listener-triggered indexing 操作都是在应用程序线程中直接执行的,这赋予了此策略提供 synchronous indexing 的独特能力,但代价是一些限制:

All explicit or listener-triggered indexing operations are executed directly in application threads, which gives this strategy the unique ability to provide synchronous indexing, at the cost of a few limitations:

协调是一个复杂的话题,它解决的问题乍看之下可能不清楚。您可能会发现从更高级的角度来理解协调更容易。

Coordination is a complex topic, and the problems it solves may appear unclear at first sight. You may find it easier to approach coordination from a higher level.

有关涉及不同协调策略的一些示例体系结构和每个体系结构之间的差异摘要,请参阅 Examples of architectures

For a few example architectures involving different coordination strategies and a summary of the differences between each architecture, see Examples of architectures.

有关协调策略差异的摘要(特别是在监听器触发索引的上下文中),请参阅 Basics

For a summary of the differences between coordination strategies specifically in the context of listener-triggered indexing, see Basics.

19.2.2. How indexing works without coordination

Changes have to occur in the ORM session in order to trigger indexing listeners

Associations must be updated on both sides

Only relevant changes trigger indexing

详情请参阅 Dirty checking

See Dirty checking for more details.

Entity data is extracted from entities upon session flushes or SearchSession.close()

When a Hibernate ORM session is flushed, or (with the Standalone POJO Mapper) when SearchSession.close() is called, Hibernate Search will extract data from the entities to build documents to index, and will put these documents in an internal buffer for later indexing. This extraction may involve loading extra data from the database.

利用 Hibernate ORM integration ,基于 Hibernate ORM 会话刷新时填充此内部缓冲区,这意味着您可在 flush() 后安全地 clear() 会话:在刷新过程中执行的实体更改都将正确索引。

With the Hibernate ORM integration, the fact that this internal buffer is populated on Hibernate ORM session flush means that you can safely clear() the session after a flush(): entity changes performed up to the flush will be indexed correctly.

如果您之前使用的是 Hibernate Search 5 或早期版本,您会看到这是一个显著改进:不再需要在事务中间调用 flushToIndexes() 并更新索引,除非数据量较大(请参阅 Hibernate ORM and the periodic "flush-clear" pattern with SearchIndexingPlan )。

If you come from Hibernate Search 5 or earlier, you may see this as a significant improvement: there is no need to call flushToIndexes() and update indexes in the middle of a transaction anymore, except for larger volumes of data (see Hibernate ORM and the periodic "flush-clear" pattern with SearchIndexingPlan).

但是,如果您在事务中使用 Hibernate ORM integration 执行批处理,并定期调用 session.flush() / session.clear() 以节省内存,请注意用于保存要索引文档的 Hibernate Search 内部缓冲区将随着每次刷新而增大,并且直到该事务提交或回滚为止,才被清除。如果您因此遇到内存问题,请参阅 Hibernate ORM and the periodic "flush-clear" pattern with SearchIndexingPlan 以了解一些解决方案。

However, if you perform a batch process inside a transaction with the Hibernate ORM integration, and call session.flush()/session.clear() regularly to save memory, be aware that Hibernate Search’s internal buffer holding documents to index will grow on each flush, and will not be cleared until the transaction is committed or rolled back. If you encounter memory issues because of that, see Hibernate ORM and the periodic "flush-clear" pattern with SearchIndexingPlan for a few solutions.

Extraction of entity data may fetch extra data from the database

Even when you change only a single property of an indexed entity, if that property is indexed, Hibernate Search needs to rebuild the corresponding document in full.

Hibernate Search 尝试仅加载索引所需的项,但根据你的映射,这可能导致懒加载关联只是为了重新对实体建立索引,即使你并不需要在你的业务代码中使用这些关联,这也可能对你的应用线程和你的数据库表示开销。

Hibernate Search tries to only load what is necessary for indexing, but depending on your mapping, this may lead to lazy associations being loaded just to reindex entities, even if you didn’t need them in your business code, which may represent an overhead for your application threads as well as your database.

使用 Hibernate ORM integration,可以通过以下方式在一定程度上减轻这些额外的成本:

With the Hibernate ORM integration, this extra cost can be mitigated to some extent by:

充分利用 Hibernate ORM 的批获取:请参阅 the batch_fetch_size propertythe @BatchSize annotation

leveraging Hibernate ORM’s batch fetching: see the batch_fetch_size property and the @BatchSize annotation.

利用 Hibernate ORM 的 second-level cache,特别是对于从索引实体引用的不可变实体(例如,对于引用数据,如国家/地区、城市……​)。

leveraging Hibernate ORM’s second-level cache, especially for immutable entities referenced from indexed entities (e.g. for reference data such as countries, cities, …​).

Indexing is not guaranteed on commit, but only after the application thread returns

When entity changes happen inside a transaction, indexes are not updated immediately, but only after the transaction is successfully committed. That way, if a transaction is rolled back, the indexes will be left in a state consistent with the database, discarding all the index changes that were planned during the transaction.

同样,在使用 Standalone POJO Mapper 时,保证在 SearchSession.close() 返回后更新索引。

Similarly, when using the Standalone POJO Mapper, indexes are guaranteed to be updated after SearchSession.close() returns.

但是,如果在索引时后端发生错误,此行为意味着 index changes may be lost, leading to out-of-sync indexes。如果这对您来说是个问题,您应考虑切换到 another coordination strategy

However, if an error occurs in the backend while indexing, this behavior means that index changes may be lost, leading to out-of-sync indexes. If this is a problem for you, you should consider switching to another coordination strategy.

利用 Hibernate ORM integration ,在任何事务之外(不推荐)发生实体更改时,对索引的更新将立即在会话 flush() 时进行。如果没有该刷新,则索引将不会自动更新。

With the Hibernate ORM integration, when entity changes happen outside any transaction (not recommended), indexes are updated immediately upon session flush(). Without that flush, indexes will not be updated automatically.

Index changes may not be visible immediately

By default, indexing will resume the application thread after index changes are committed to the indexes. This means index changes are safely stored to disk, but this does not mean a search query ran immediately after indexing will take the changes into account: when using the Elasticsearch backend in particular, changes may take some time to be visible from search queries.

有关详细信息,请参阅 Synchronization with the indexes

19.3. outbox-polling: additional event tables and polling in background processors

以下列出的特性尚处于 incubating 阶段:它们仍在积极开发中。

Features detailed below are incubating: they are still under active development.

通常 compatibility policy 不适用:孵化元素(例如类型、方法、配置属性等)的契约在后续版本中可能会以向后不兼容的方式更改,甚至可能被移除。

The usual compatibility policy does not apply: the contract of incubating elements (e.g. types, methods, configuration properties, etc.) may be altered in a backward-incompatible way — or even removed — in subsequent releases.

我们建议您使用孵化特性,以便开发团队可以收集反馈并对其进行改进,但在需要时您应做好更新依赖于这些特性的代码的准备。

You are encouraged to use incubating features so the development team can get feedback and improve them, but you should be prepared to update code which relies on them as needed.

19.3.1. Basics

outbox-polling 策略通过应用程序数据库中的 additional tables 实现协调。

The outbox-polling strategy implements coordination through additional tables in the application database.

Explicit and listener-triggered indexing 是通过在与实体更改相同的交易中将事件推送到出站表中来实现的,并轮询后台处理器的出站表,这些处理器执行索引。

Explicit and listener-triggered indexing are implemented by pushing events to an outbox table within the same transaction as the entity changes, and polling this outbox table from background processors which perform indexing.

该策略可以提供无论后端中是否存在临时 I/O 错误均会索引实体的保证,代价是只能异步执行此索引。

This strategy is able to provide guarantees that entities will be indexed regardless of temporary I/O errors in backend, at the cost of being only able to perform this indexing asynchronously.

outbox-polling 策略可以通过以下设置来启用:

The outbox-polling strategy can be enabled with the following settings:

hibernate.search.coordination.strategy = outbox-polling

如果已启用 multi-tenancy ,您将需要额外的配置。

If multi-tenancy is enabled, you will need extra configuration.

请参阅 Multi-tenancy

您还需要添加此依赖关系:

You will also need to add this dependency:

<dependency>
   <groupId>org.hibernate.search</groupId>
   <artifactId>hibernate-search-mapper-orm-outbox-polling</artifactId>
   <version>7.2.0.Alpha2</version>
</dependency>

协调是一个复杂的话题,它解决的问题乍看之下可能不清楚。您可能会发现从更高级的角度来理解协调更容易。

Coordination is a complex topic, and the problems it solves may appear unclear at first sight. You may find it easier to approach coordination from a higher level.

有关涉及不同协调策略的一些示例体系结构和每个体系结构之间的差异摘要,请参阅 Examples of architectures

For a few example architectures involving different coordination strategies and a summary of the differences between each architecture, see Examples of architectures.

有关协调策略差异的摘要(特别是在监听器触发索引的上下文中),请参阅 Basics

For a summary of the differences between coordination strategies specifically in the context of listener-triggered indexing, see Basics.

19.3.2. How indexing works with outbox-polling coordination

Changes have to occur in the ORM session in order to trigger indexing listeners

Associations must be updated on both sides

Only relevant changes trigger indexing

详情请参阅 Dirty checking

See Dirty checking for more details.

Indexing happens in a background thread

当 Hibernate ORM 会话发生刷新时,Hibernate Search 将在同一个 Hibernate ORM 会话和同一个事务中保留实体更改事件。

When a Hibernate ORM session is flushed, Hibernate Search will persist entity change events within the same Hibernate ORM session and the same transaction.

一个 event processor 轮询数据库以获取新的实体更改事件,并在找到新事件(即在事务提交后)时异步执行适当实体的重新索引。

An event processor polls the database for new entity change events, and asynchronously performs reindexing of the appropriate entities when it finds new events (i.e. after the transaction is committed).

由于这些事件是在会话刷新时保留的,这意味着您可在 flush() 后安全地 clear() 会话:在刷新过程中检测到的实体更改事件都将正确保留。

The fact that events are persisted on session flush means that you can safely clear() the session after a flush(): entity changes events detected up to the flush will be persisted correctly.

如果您之前使用的是 Hibernate Search 5 或早期版本,您会看到这是一个显著改进:不再需要在事务中间调用 flushToIndexes() 并更新索引。

If you come from Hibernate Search 5 or earlier, you may see this as a significant improvement: there is no need to call flushToIndexes() and update indexes in the middle of a transaction anymore.

The background processor will completely reload entities from the database

负责重新索引实体的后台处理器无法访问 first level cache 在实体发生更改时的状态,因为它是发生在不同的会话中。

The background processor responsible for reindexing entities does not have access to the state of the first level cache when the entity change occurred, because it occurred in a different session.

这意味着每次实体发生更改并且必须重新编入索引时,后台进程将完整加载该实体。根据您的映射,它可能还需要加载到其他实体的惰性关联。

This means each time an entity changes and has to be re-indexed, the background process will load that entity in full. Depending on your mapping, it may also need to load lazy associations to other entities.

这种额外开销可以通过以下方式在一定程度上缓解:

This extra cost can be mitigated to some extent by:

充分利用 Hibernate ORM 的批获取;请参阅 the batch_fetch_size propertythe @BatchSize annotation

leveraging Hibernate ORM’s batch fetching; see the batch_fetch_size property and the @BatchSize annotation.

利用 Hibernate ORM 的 second-level cache,特别是对于从索引实体引用的不可变实体(例如,对于引用数据,如国家/地区、城市……​)。

leveraging Hibernate ORM’s second-level cache, especially for immutable entities referenced from indexed entities (e.g. for reference data such as countries, cities, …​).

Indexing is guaranteed on transaction commit

当实体更改发生在事务中时,Hibernate Search 将在同一个事务中保留实体更改事件。

When entity changes happen inside a transaction, Hibernate Search will persist entity change events within the same transaction.

如果提交了该事务,那么这些事件也将被提交;如果回滚了该事务,那么这些事件也将被回滚。这保证了这些事件最终将由后台线程处理,并且索引将得到相应更新,但仅在(如果)事务成功时才执行上述操作。

If the transaction is committed, these events will be committed as well; if it rolls back, the events will be rolled back as well. This guarantees the events will eventually be processed by a background thread and that the indexes will be updated accordingly, but only when (if) the transaction succeeds.

当实体变化在任何事务外部发生(不推荐)时,事件索引将在会话 flush() 后立即发送。如果没有该刷新,索引将不会被自动更新。

When entity changes happen outside any transaction (not recommended), events indexes are sent immediately after the session flush(). Without that flush, indexes will not be updated automatically.

Index changes will not be visible immediately

默认情况下,应用程序线程将在实体变更事件后恢复提交到数据库后恢复。这意味着这些变更已安全地存储到磁盘中,但这并不意味着线程恢复时立即运行的搜索查询将考虑这些变更: indexing will happen at a later time, asynchronously, in a background processor

By default, the application thread will resume after entity change events are committed to the database. This means these changes are safely stored to disk, but this does not mean a search query ran immediately when the thread resumes will take the changes into account: indexing will happen at a later time, asynchronously, in a background processor.

你可以 configure this event processor 运行更频繁,但它仍将保持异步。

You can configure this event processor to run more often, but it will remain asynchronous.

19.3.3. Impact on the database schema

Basics

outbox-polling 协调策略需要将数据存储在应用数据库中的附加表中,以便这些数据能够被后台线程使用。

The outbox-polling coordination strategy needs to store data in additional tables in the application database, so that this data can be consumed by background threads.

这些附加表特别包括一个 Outbox 表,其中每当以需要重新索引的方式更改某个实体时,都会推送一行(表示一个更改事件)至该 Outbox 表。

This includes in particular an outbox table, to which one row (representing a change event) is pushed every time an entity is changed in a way that requires reindexing.

这也包括一个代理表,其中 Hibernate Search 注册了每个后台事件处理器以 dynamically assign shards 到每个应用程序实例,或仅仅检查 statically assigned shards 是否一致。

This also includes an agent table, where Hibernate Search registers every background event processor in order to dynamically assign shards to each application instance, or simply to check that statically assigned shards are consistent.

通过自动添加到 Hibernate ORM 配置的实体来访问这些表,因此当依赖于 Hibernate ORM 的 automatic schema generation 时,应该自动生成它们。

These tables are accessed through entities that are automatically added to the Hibernate ORM configuration, and as such they should be automatically generated when relying on Hibernate ORM’s automatic schema generation.

如果你需要将这些表的创建/删除过程集成到自己的脚本中,最简单的解决方案是让 Hibernate ORM 为整个架构生成 DDL 脚本,然后复制以 HSEARCH_ 为前缀的所有与结构(表、序列、…)相关的内容。请参阅 automatic schema generation,尤其是 Hibernate ORM 属性 javax.persistence.schema-generation.scripts.actionjavax.persistence.schema-generation.scripts.create-target_和 _javax.persistence.schema-generation.scripts.drop-target

If you need to integrate the creation/dropping of these tables to your own script, the easiest solution is to have Hibernate ORM generate DDL scripts for your whole schema and copy everything related to constructs (tables, sequences, …​) prefixed with HSEARCH_. See automatic schema generation, in particular the Hibernate ORM properties javax.persistence.schema-generation.scripts.action, javax.persistence.schema-generation.scripts.create-target and javax.persistence.schema-generation.scripts.drop-target.

Custom schema/table name/etc.

默认情况下,在上一节中提到的 outbox 表和代理程序表应该在默认目录/架构中找到,并且使用大写表名,前缀为 HSEARCH。用于这些表的标识生成器的名称以前缀 HSEARCH 开头,并以 _GENERATOR 结尾。

By default, outbox and agent tables, mentioned in the previous section, are expected to be found in the default catalog/schema, and are using uppercased table names prefixed with HSEARCH. Identity generator names used for these tables are prefixed with HSEARCH and suffixed with _GENERATOR.

有时,数据库对象有特定的命名约定,或者需要将领域表和技术表分开。为了在这方面允许一些灵活性,Hibernate 搜索提供了一组配置属性来指定目录/架构/表名,以及 outbox 事件和代理程序表的自定义 UUID 生成器策略/数据类型:

Sometimes there are specific naming conventions for database objects, or a need to separate the domain and technical tables. To allow some flexibility in this area, Hibernate Search provides a set of configuration properties to specify catalog/schema/table names and a custom UUID generator strategy/data type for outbox event and agent tables:

# Configure the agent mapping:
hibernate.search.coordination.entity.mapping.agent.catalog=CUSTOM_CATALOG
hibernate.search.coordination.entity.mapping.agent.schema=CUSTOM_SCHEMA
hibernate.search.coordination.entity.mapping.agent.table=CUSTOM_AGENT_TABLE
hibernate.search.coordination.entity.mapping.agent.uuid_gen_strategy=time
hibernate.search.coordination.entity.mapping.agent.uuid_type=BINARY
# Configure the outbox event mapping:
hibernate.search.coordination.entity.mapping.outboxevent.catalog=CUSTOM_CATALOG
hibernate.search.coordination.entity.mapping.outboxevent.schema=CUSTOM_SCHEMA
hibernate.search.coordination.entity.mapping.outboxevent.table=CUSTOM_OUTBOX_TABLE
hibernate.search.coordination.entity.mapping.outboxevent.uuid_gen_strategy=time
hibernate.search.coordination.entity.mapping.outboxevent.uuid_type=BINARY
  1. agent.catalog defines the database catalog to use for the agent table.

默认为 Hibernate ORM 中配置的默认目录。

Defaults to the default catalog configured in Hibernate ORM.

  1. agent.schema defines the database schema to use for the agent table.

默认为 Hibernate ORM 中配置的默认架构。

Defaults to the default schema configured in Hibernate ORM.

  1. agent.table defines the name of the agent table.

默认为 HSEARCH_AGENT

Defaults to HSEARCH_AGENT.

  1. agent.uuid_gen_strategy defines name of the UUID generator strategy used for the agent table. Available options are auto/random/time. auto is a default and is the same as random which uses UUID#randomUUID(). time is an IP based strategy consistent with IETF RFC 4122.

默认为 auto

Defaults to auto.

  1. agent.uuid_type defines the name of the Hibernate SqlType used for representing an UUID in the agent table. Hibernate Search provides a special default option that is going to be used by default and will result into one of UUID/BINARY/CHAR depending on the database in use. While currently Hibernate Search will use the dialect’s default representation of the UUID in the database, it is not a guarantee. If a specific type is required, it is best to provide it explicitly via this property. Please refer to SqlTypes to see the list of available type codes supported by Hibernate ORM. The SQL type code can be passed as a name of a corresponding constant in org.hibernate.type.SqlTypes or as an integer value.

默认为 default

Defaults to default.

  1. outboxevent.catalog defines the database catalog to use for the outbox event table.

默认为 Hibernate ORM 中配置的默认目录。

Defaults to the default catalog configured in Hibernate ORM.

  1. outboxevent.schema defines the database schema to use for the outbox event table.

默认为 Hibernate ORM 中配置的默认架构。

Defaults to the default schema configured in Hibernate ORM.

  1. outboxevent.table defines the name of the outbox events table.

默认为 HSEARCH_OUTBOX_EVENT

Defaults to HSEARCH_OUTBOX_EVENT.

  1. outboxevent.uuid_gen_strategy defines name of the UUID generator strategy used for the outbox event table. Available options are auto/random/time. auto is a default and is the same as random which uses UUID#randomUUID(). time is an IP based strategy consistent with IETF RFC 4122.

默认为 auto

Defaults to auto.

  1. outboxevent.uuid_type defines the name of the Hibernate SqlType used for representing an UUID in the outbox event table. Hibernate Search provides a special default option that is going to be used by default and will result into one of UUID/BINARY/CHAR depending on the database in use. While currently Hibernate Search will use the dialect’s default representation of the UUID in the database, it is not a guarantee. If a specific type is required, it is best to provide it explicitly via this property. Please refer to SqlTypes to see the list of available type codes supported by Hibernate ORM. The SQL type code can be passed as a name of a corresponding constant in org.hibernate.type.SqlTypes or as an integer value.

默认为 default

Defaults to default.

如果应用程序依赖于 automatic database schema generation,请确保在指定目录/模式时,底层数据库支持它们。此外,检查名称长度和大小写敏感性是否有任何限制。

If your application relies on automatic database schema generation, make sure that the underlying database supports catalogs/schemas when specifying them. Also check if there are any constraints on name length and case sensitivity.

没有必要同时提供所有属性。例如,你只能自定义架构。未指定的属性将使用它们的默认值。

It is not required to provide all properties at the same time. For example, you can customize the schema only. Unspecified properties will use their defaults.

19.3.4. Sharding and pulse

为了避免在不同的应用程序节点上对同一实体不必要地建立多次索引,Hibernate 搜索在所属的“分片”中对实体进行分区:

In order to avoid unnecessarily indexing the same entity multiple times on different application nodes, Hibernate Search partitions the entities in what it calls "shards":

  1. Each entity belongs to exactly one shard.

  2. Each application node involved in event processing is uniquely assigned one or more shards, and will only process events related to entities in these shards.

为了可靠地分配分片,Hibernate Search adds an agent table to the database,并使用该表注册了参与索引的代理(大多数情况下,一个应用程序实例 = 一个代理 = event processor)。注册的代理形成一个集群。

In order to reliably assign shards, Hibernate Search adds an agent table to the database, and uses that table to register agents involved in indexing (most of the time, one application instance = one agent = the event processor). The registered agents form a cluster.

为了确保始终为代理程序分配一个分片,并且一个分片绝不会分配给多个代理程序,因此每个代理程序将定期执行“脉冲”,以更新和检查代理程序表。

To make sure that agents are always assigned one shard, and that one shard is never assigned to more than one agent, each agent will periodically perform a "pulse", updating and inspecting the agent table.

脉冲期间发生的操作取决于代理程序的类型。在“脉冲”期间:

What happens during a pulse depends on the type of agent. During a "pulse":

  1. An event processor will:

更新其在代理表中的自己的条目,让其他代理知道它仍然处于活动状态;

update its own entry in the agent table, to let other agents knows it’s still active;

如果检测到这些代理已过期(长时间未更新其条目),则强制删除其他代理的条目;

forcibly remove entries from other agents if it detects that these agents expired (did not update their entry for a long time);

如果使用 static sharding,则检测和报告配置错误,例如,两个代理分配到同一个分片;

detect and report configuration mistakes if using static sharding, e.g. two agents assigned to the same shard;

如果 mass indexer正在运行,则决定挂起自身;

decide to suspend itself if a mass indexer is running;

如果使用 dynamic sharding,则根据需要触发重新平衡;例如,当检测到新的代理最近加入集群或代理离开集群(自愿或被迫)时。

trigger rebalancing as necessary if using dynamic sharding; e.g. when it detects that new agents recently joined the cluster, or that agents left the cluster (voluntarily or forcibly).

  1. update its own entry in the agent table, to let other agents knows it’s still active;

  2. forcibly remove entries from other agents if it detects that these agents expired (did not update their entry for a long time);

  3. detect and report configuration mistakes if using static sharding, e.g. two agents assigned to the same shard;

  4. decide to suspend itself if a mass indexer is running;

  5. trigger rebalancing as necessary if using dynamic sharding; e.g. when it detects that new agents recently joined the cluster, or that agents left the cluster (voluntarily or forcibly).

  6. A mass indexer will:

更新其在代理表中的自己的条目,让其他代理知道它仍然处于活动状态;

update its own entry in the agent table, to let other agents knows it’s still active;

如果检测到这些代理已过期(长时间未更新其条目),则强制删除其他代理的条目;

forcibly remove entries from other agents if it detects that these agents expired (did not update their entry for a long time);

如果注意到一些 event processors仍在运行,则切换到主动等待模式(频繁轮询);

switch to active waiting mode (frequent polling) if it notices some event processors are still running;

如果注意到不再运行任何 event processors,则切换到脉冲模式(不频繁轮询)并为启动大规模索引开启绿灯。

switch to pulse-only mode (infrequent polling) and give the green light for mass indexing to start if it notices no event processors are running anymore.

  1. update its own entry in the agent table, to let other agents knows it’s still active;

  2. forcibly remove entries from other agents if it detects that these agents expired (did not update their entry for a long time);

  3. switch to active waiting mode (frequent polling) if it notices some event processors are still running;

  4. switch to pulse-only mode (infrequent polling) and give the green light for mass indexing to start if it notices no event processors are running anymore.

有关动态和静态分片的详细信息,请参阅以下部分。

For more details about dynamic and static sharding, see the following sections.

19.3.5. Event processor

Basics

在后台执行的 outbox-polling 协调策略的代理程序中,最重要的一个便是事件处理器:它通过遍历 outbox 表来轮询事件,然后在发现新事件时重新索引对应的实体。

Among agents of the outbox-polling coordination strategy executing in the background, the most important one is the event processor: it polls the outbox table for events and then re-indexes the corresponding entities when new events are found.

使用以下配置属性可以配置事件处理器:

The event processor can be configured using the following configuration properties:

hibernate.search.coordination.event_processor.enabled = true
hibernate.search.coordination.event_processor.polling_interval = 100
hibernate.search.coordination.event_processor.pulse_interval = 2000
hibernate.search.coordination.event_processor.pulse_expiration = 30000
hibernate.search.coordination.event_processor.batch_size = 50
hibernate.search.coordination.event_processor.transaction_timeout = 10
hibernate.search.coordination.event_processor.retry_delay = 15
  1. event_processor.enabled defines whether the event processor is enabled, as a boolean value. The default for this property is true, but it can be set to false to disable event processing on some application nodes, for example to dedicate some nodes to HTTP request processing and other nodes to event processing.

  2. event_processor.polling_interval defines how long to wait for another query to the outbox events table after a query didn’t return any event, as an integer value in milliseconds. The default for this property is 100.

值越高,实体更改与索引中对应的更新之间的延迟越大,但当没有要处理的事件时,对数据库的压力越小。

High values mean higher latency between an entity change and the corresponding update in the index, but less stress on the database when there are no events to process.

较小值表示实体更改和索引中的相应更新之间的延迟较低,但当没有事件需要处理时,数据库压力较大。

Low values mean lower latency between an entity change and the corresponding update in the index, but more stress on the database when there are no events to process.

  1. event_processor.pulse_interval defines how long the event processor can poll for events before it must perform a "pulse", as an integer value in milliseconds. The default for this property is 2000.

有关“脉冲”的信息,请参阅 the sharding basics

See the sharding basics for information about "pulses".

脉冲间隔必须设置为轮询间隔(见上文)和过期间隔(见下文)的三分之一 (1/3) 之间的值。

The pulse interval must be set to a value between the polling interval (see above) and one third (1/3) of the expiration interval (see below).

较小值(更接近轮询间隔)表示节点加入或离开集群时,不处理事件而浪费的时间更少,降低了由于事件处理器被错误地认为已断开连接而不处理事件而浪费时间的风险,但由于更频繁地检查代理列表,加大了数据库压力。

Low values (closer to the polling interval) mean less time wasted not processing events when a node joins or leaves the cluster, and reduced risk of wasting time not processing events because an event processor is incorrectly considered disconnected, but more stress on the database because of more frequent checks of the list of agents.

较大值(更接近过期间隔)表示节点加入或离开集群时,不处理事件而浪费的时间更多,并且因事件处理器被错误地认为已断开连接而不处理事件而浪费时间风险增加,但由于不太频繁地检查代理列表,减轻了数据库压力。

High values (closer to the expiration interval) mean more time wasted not processing events when a node joins or leaves the cluster, and increased risk of wasting time not processing events because an event processor is incorrectly considered disconnected, but less stress on the database because of less frequent checks of the list of agents.

  1. event_processor.pulse_expiration defines how long an event processor "pulse" remains valid before considering the processor disconnected and forcibly removing it from the cluster, as an integer value in milliseconds. The default for this property is 30000.

有关“脉冲”的信息,请参阅 the sharding basics

See the sharding basics for information about "pulses".

过期间隔必须设置为至少比脉冲间隔(见上文)大 3 倍的值。

The expiration interval must be set to a value at least 3 times larger than the pulse interval (see above).

较小值(更接近脉冲间隔)表示节点因崩溃或网络故障而突然离开集群时,不处理事件而浪费的时间更少,但由于事件处理器被错误地认为已断开连接而不处理事件而浪费时间的风险增加。

Low values (closer to the pulse interval) mean less time wasted not processing events when a node abruptly leaves the cluster due to a crash or network failure, but increased risk of wasting time not processing events because an event processor is incorrectly considered disconnected.

较大值(远大于脉冲间隔)表示节点因崩溃或网络故障而突然离开集群时,不处理事件而浪费的时间更多,但由于事件处理器被错误地认为已断开连接而不处理事件而浪费时间风险降低。

High values (much larger than the pulse interval) mean more time wasted not processing events when a node abruptly leaves the cluster due to a crash or network failure, but reduced risk of wasting time not processing events because an event processor is incorrectly considered disconnected.

  1. event_processor.batch_size defines how many outbox events, at most, are processed in a single transaction as an integer value. The default for this property is 50.

较大的值表示后台进程打开的事务数量较少,且由于一级缓存(持久性上下文)可能会提高性能,但内存使用量会增加,极端情况下可能导致 OutOfMemoryErrors

High values mean a lower number of transactions opened by the background process and may increase performance thanks to the first-level cache (persistence context), but will increase memory usage and in extreme cases may lead to OutOfMemoryErrors.

  1. event_processor.transaction_timeout defines the timeout for transactions processing outbox events as an integer value in seconds.

仅在配置 JTA 事务管理器时有效。

Only effective when a JTA transaction manager is configured.

使用 JTA 但未设置此属性时,Hibernate Search 将使用 JTA 事务管理器中配置的任何默认事务超时。

When using JTA and this property is not set, Hibernate Search will use whatever default transaction timeout is configured in the JTA transaction manager.

  1. event_processor.retry_delay defines how long the event processor must wait before re-processing an event after its processing failed, as a positive integer value in seconds. The default for this property is 30.

使用值 0 以立即重新处理失败的事件,无延迟。

Use the value 0 to reprocess failed events as soon as possible, with no delay.

Sharding

默认情况下, sharding 是动态的:Hibernate Search 在数据库中注册每个应用程序实例,并使用该信息为每个应用程序实例动态分配一个唯一的单一分片,并在实例启动或停止时更新分配。动态分片除了 basics 之外不接受任何配置。

By default, sharding is dynamic: Hibernate Search registers each application instance in the database, and uses that information to dynamically assign a single, unique shard to each application instance, updating assignments as instances start or stop. Dynamic sharding does not accept any configuration beyond the basics.

如果您希望明确配置分片,则可以通过设置以下配置属性使用静态分片:

If you want to configure sharding explicitly, you can use static sharding by setting the following configuration properties:

hibernate.search.coordination.event_processor.shards.total_count = 4
hibernate.search.coordination.event_processor.shards.assigned = 0
  1. shards.total_count defines the total number of shards as an integer value. This property has no default and must be set explicitly if you want static sharding. It must be set to the same value on all application nodes with assigned shards. When this property is set, shards.assigned must also be set

  2. shards.assigned defines the shards assigned to the application node as an integer value, or multiple comma-separated integer values. This property has no default and must be set explicitly if you want static sharding. When this property is set, shards.total_count must also be set.

分片由 [0, total_count - 1] 范围内的索引引用(有关 total_count,请见上文)。给定应用程序节点必须被分配至少一个分片,但可以通过将 shards.assigned 设置为逗号分隔列表(例如 0,3)来分配多个分片。

Shards are referred to by an index in the range [0, total_count - 1] (see above for total_count). A given application node must be assigned at least one shard but may be assigned multiple shards by setting shards.assigned to a comma-separated list, e.g. 0,3.

每个数据片必须仅限分配给一个应用程序节点。

Each shard must be assigned to one and only one application node.

在每个数据片恰好具有一个节点之前,事件处理都无法启动。

Event processing simply won’t start until every shard has exactly one node.

示例 441。静态分片设置示例

. Example 441. Example of static sharding settings

# Node #0
hibernate.search.coordination.strategy = outbox-polling
hibernate.search.coordination.event_processor.shards.total_count = 2
hibernate.search.coordination.event_processor.shards.assigned = 0
# Node #1
hibernate.search.coordination.strategy = outbox-polling
hibernate.search.coordination.event_processor.shards.total_count = 2
hibernate.search.coordination.event_processor.shards.assigned = 1
# Node #2
hibernate.search.coordination.strategy = outbox-polling
hibernate.search.coordination.event_processor.enabled = false
# Node #3
hibernate.search.coordination.strategy = outbox-polling
hibernate.search.coordination.event_processor.enabled = false
Processing order

可以通过配置属性调整处理出盒事件的顺序:

The order in which outbox events are processed can be tuned with a configuration property:

hibernate.search.coordination.event_processor.order = auto

可用的选项为:

Available options are:

auto

默认值。

The default value.

根据方言和其他设置选择最安全、最合适的方式:

Picks the safest, most appropriate order based on the dialect and other settings:

在对出站事件使用基于时间的 UUID 时(请参见 Impact on the database schema),选择_id_。

When using time-based UUIDs for outbox events (see Impact on the database schema), picks id.

否则,如果使用 Microsoft SQL Server 方言,则选择_none_。

Otherwise, if using a Microsoft SQL Server dialect, picks none.

否则,选择 time

Otherwise, picks time.

none

不按特定顺序处理出站信箱事件。

Process outbox events in no particular order.

这实际上表示事件将按特定于数据库且不确定的顺序使用。

This essentially means events will be consumed in a database-specific, undetermined order.

在具有多个事件处理器的情况下,这将降低由事务死锁(尤其是 Microsoft SQL Server 中的死锁)造成的后台故障的频率,这在技术上并不会“修复”事件处理(无论如何,这些故障都会通过重试自动处理),但可能会提高性能,并减少日志中的不必要噪音。

In setups with multiple event processors, this reduces the rate of background failures caused by transaction deadlocks (in particular with Microsoft SQL Server), which does not technically "fix" event processing (those failures are handled automatically by trying again anyway), but may improve performance and reduce unnecessary noise in logs.

但是,这可能会导致这样的情况,即由于更新事件在特定事件之前经过处理,因此连续推迟该特定事件的处理,这在事件队列从不为空的写入密集型场景中可能是个问题。

However, this may lead to situations where the processing of one particular event is continuously postponed due to newer events being processed before that particular event, which can be a problem in write-intensive scenarios where the event queue is never empty.

time

按“时间”顺序处理出站信箱事件,即按创建事件的顺序。

Process outbox events in "time" order, i.e. in the order events are created.

这确保了事件或多或少按照创建它们的顺序进行处理,并避免了由于新事件在特定事件之前经过处理而持续推迟特定事件处理的情况。

This ensures events are processed more or less in the order they were created and avoids situations where the processing of one particular event is continuously postponed due to newer events being processed before that particular event.

但是,在具有多个事件处理程序的设置中,这可能会增加由事务死锁(尤其是 Microsoft SQL Server 中的死锁)引起的后台故障的比率,这在技术上不会中断事件处理(这些故障是通过重新尝试自动处理的),但可能会降低性能,并导致日志中出现不必要的噪音。

However, in setups with multiple event processors, this may increase the rate of background failures caused by transaction deadlocks (in particular with Microsoft SQL Server), which does not technically break event processing (those failures are handled automatically by trying again anyway), but may reduce performance and lead to unnecessary noise in logs.

id

按标识符顺序处理出站信箱事件。

Process outbox events in identifier order.

如果发件箱事件标识符是基于时间的 UUID(参见 Impact on the database schema),则此行为与 _time_类似,但没有死锁风险。

If outbox event identifiers are time-based UUIDs (see Impact on the database schema), this behaves similarly to time, but without the risk of deadlocks.

如果发件箱事件标识符是随机 UUID(参见 Impact on the database schema),则此行为与 _none_类似。

If outbox event identifiers are random UUIDs (see Impact on the database schema), this behaves similarly to none.

19.3.6. Mass indexer

Basics

mass indexing期间,应用程序实例会异常绕过 sharding,并索引来自任何分片的实体。

During mass indexing, an application instance will exceptionally bypass sharding and index entities from any shard.

绕过分片可能会非常危险,因为在某些罕见情况下,同时从事件处理器和批量索引器对同一个实体进行索引可能会导致索引不同步。这就是为什么为了确保完全安全,在批量索引进行期间会暂停事件处理。事件仍然会产生并持久化,但它们的处理会延迟到批量索引完成为止。

Bypassing sharding can be dangerous, because indexing the same entity simultaneously from an event processor and the mass indexer could potentially result in an out-of-sync index in some rare situations. This is why, to be perfectly safe, event processing gets suspended while mass indexing is in progress. Events are still produced and persisted, but their processing gets delayed until mass indexing finishes.

通过在中介表中注册批量索引器中介来实现事件处理的暂停,最终事件处理器会检测中介并做出反应而自行暂停。批量索引完成后,批量索引器中介会从中介表中移除,事件处理器会检测该操作并恢复事件处理。

The suspension of event processing is achieved by registering a mass indexer agent in the agent table, which event processors will eventually detect and react to by suspending themselves. When mass indexing finishes, the mass indexer agent is removed from the agent table, event processors detect that and resume event processing.

可以使用以下配置属性配置批量索引器中介:

The mass indexer agent can be configured using the following configuration properties:

hibernate.search.coordination.mass_indexer.polling_interval = 100
hibernate.search.coordination.mass_indexer.pulse_interval = 2000
hibernate.search.coordination.mass_indexer.pulse_expiration = 30000
  1. mass_indexer.polling_interval defines how long to wait for another query to the agent table when actively waiting for event processors to suspend themselves, as an integer value in milliseconds. The default for this property is 100.

较低的值会缩短批量索引器中介检测到事件处理器最终自行暂停所需的时间,但在批量索引器中介处于活动等待状态时会增加数据库上的压力。

Low values will reduce the time it takes for the mass indexer agent to detect that event processors finally suspended themselves, but will increase the stress on the database while the mass indexer agent is actively waiting.

较高的值会增加批量索引器中介检测到事件处理器最终自行暂停所需的时间,但在批量索引器中介处于活动等待状态时会减小数据库上的压力。

High values will increase the time it takes for the mass indexer agent to detect that event processors finally suspended themselves, but will reduce the stress on the database while the mass indexer agent is actively waiting.

  1. mass_indexer.pulse_interval defines how long the mass indexer can wait before it must perform a "pulse", as an integer value in milliseconds. The default for this property is 2000.

有关“脉冲”的信息,请参阅 the sharding basics

See the sharding basics for information about "pulses".

脉冲间隔必须设置为轮询间隔(见上文)和过期间隔(见下文)的三分之一 (1/3) 之间的值。

The pulse interval must be set to a value between the polling interval (see above) and one third (1/3) of the expiration interval (see below).

较低的值(接近轮询间隔)意味着降低了批量索引器中介错误地被认为已断开连接,因此事件处理器在批量索引期间重新开始处理事件的风险,但由于更频繁地更新中介表中批量索引器中介的条目,因此会增加数据库上的压力。

Low values (closer to the polling interval) mean reduced risk of event processors starting to process events again during mass indexing because a mass indexer agent is incorrectly considered disconnected, but more stress on the database because of more frequent updates of the mass indexer agent’s entry in the agent table.

较高的值(接近过期间隔)意味着批量索引器中介错误地被认为已断开连接,因此事件处理器在批量索引期间重新开始处理事件的风险增加,但由于不太频繁地更新中介表中批量索引器中介的条目,因此会减小数据库上的压力。

High values (closer to the expiration interval) mean increased risk of event processors starting to process events again during mass indexing because a mass indexer agent is incorrectly considered disconnected, but less stress on the database because of less frequent updates of the mass indexer agent’s entry in the agent table.

  1. mass_indexer.pulse_expiration defines how long an event processor "pulse" remains valid before considering the processor disconnected and forcibly removing it from the cluster, as an integer value in milliseconds. The default for this property is 30000.

有关“脉冲”的信息,请参阅 the sharding basics

See the sharding basics for information about "pulses".

过期间隔必须设置为至少比脉冲间隔(见上文)大 3 倍的值。

The expiration interval must be set to a value at least 3 times larger than the pulse interval (see above).

较低的值(接近脉冲间隔)意味着当批量索引器中介由于崩溃而终止时,事件处理器不会处理事件所浪费的时间更少,但批量索引器中介错误地被认为已断开连接,因此事件处理器在批量索引期间重新开始处理事件的风险增加。

Low values (closer to the pulse interval) mean less time wasted with event processors not processing events when a mass indexer agent terminates due to a crash, but increased risk of event processors starting to process events again during mass indexing because a mass indexer agent is incorrectly considered disconnected.

较高的值(远大于脉冲间隔)意味着当批量索引器中介由于崩溃而终止时,事件处理器不会处理事件所浪费的时间更多,但批量索引器中介错误地被认为已断开连接,因此事件处理器在批量索引期间重新开始处理事件的风险降低。

High values (much larger than the pulse interval) mean more time wasted with event processors not processing events when a mass indexer agent terminates due to a crash, but reduced risk of event processors starting to process events again during mass indexing because a mass indexer agent is incorrectly considered disconnected.

19.3.7. Multi-tenancy

未在此配置中提及租户标识符可能会导致事件在发件箱表中堆积,而不会对 processed进行,或由于配置不完整而在 mass indexing处引发异常。

Failing to mention a tenant identifier in this configuration might result in events piling up in the outbox table without ever being processed, or in exceptions being thrown upon mass indexing due to incomplete configuration.

除此之外,多租户支持应该是相当透明的:Hibernate Search 将简单地为每个租户标识符复制事件处理器。

Apart from that, multi-tenancy support should be fairly transparent: Hibernate Search will simply duplicate event processors for each tenant identifiers.

您可以使用不同的根节点来配置属性,从而对不同的租户使用不同的配置:

You can use different configuration for different tenants by using a different root for configuration properties:

  1. hibernate.search.coordination is the default root, whose properties will be used as a default for all tenants.

  2. hibernate.search.coordination.tenants.<tenant-identifier> is the tenant-specific root.

请参见下面的示例。

See below for an example.

示例 442。为单租户专用的节点配置多租户应用程序中的协调

. Example 442. Configuration of coordination in a multi-tenant application with a node dedicated to a single tenant

hibernate.search.multi_tenancy.tenant_ids=tenant1,tenant2,tenant3,tenant4
hibernate.search.coordination.strategy = outbox-polling
hibernate.search.coordination.tenants.tenant1.event_processor.enabled = false (1)
hibernate.search.multi_tenancy.tenant_ids=tenant1,tenant2,tenant3,tenant4
hibernate.search.coordination.strategy = outbox-polling
hibernate.search.coordination.event_processor.enabled = false (1)
hibernate.search.coordination.tenants.tenant1.event_processor.enabled = true (2)

19.3.8. Aborted events

如果在处理外箱事件时出现问题,事件处理器会尝试重新处理事件两次,然后将此事件标记为已中止。中止的事件不会由处理器处理。

If something goes wrong when an outbox event is processed, the event processor will try to re-process the events two times, after that the event will be marked as aborted. Aborted events won’t be processed by the processor.

Hibernate Search 公开一些 API,用于在中止的事件上执行操作。

Hibernate Search exposes some APIs to work on aborted events.

示例 443。使用作废事件的 API

. Example 443. Use the API for the aborted events

OutboxPollingSearchMapping searchMapping = Search.mapping( sessionFactory ).extension( OutboxPollingExtension.get() ); (1)
long count = searchMapping.countAbortedEvents(); (2)
searchMapping.reprocessAbortedEvents(); (3)
searchMapping.clearAllAbortedEvents(); (4)

如果启用了多租户,则需要传递租户 ID,以指定要执行的租户操作。

If multi-tenancy is enabled, it will be necessary to pass the tenant id to target the tenant operations are performed on.

示例 444。使用多租户的作废事件的 API

. Example 444. Use the API for the aborted events with multi-tenancy

long count = searchMapping.countAbortedEvents( tenantId ); (1)
searchMapping.reprocessAbortedEvents( tenantId ); (2)
searchMapping.clearAllAbortedEvents( tenantId ); (3)