Hibernate Search 中文操作指南
14. Indexing entities
14.1. Basics
有多种方法可以在 Hibernate Search 中索引实体。
There are multiple ways to index entities in Hibernate Search.
如果您想了解最流行的方法,请直接跳到以下部分:
If you want to get to know the most popular ones, head directly to the following section:
-
To keep indexes synchronized transparently as entities change in a Hibernate ORM Session, see listener-triggered indexing.
-
To index a large amount of data — for example the whole database, when adding Hibernate Search to an existing application — see the MassIndexer.
或者,下表可以帮助您确定最适合您用例的方法。
Otherwise, the following table may help you figure out what’s best for your use case.
表 6. 索引方法对比
Table 6. Comparison of indexing methods
Name and link |
Use case |
API |
Mapper |
Handle incremental changes in application transactions |
None: works implicitly without API calls |
||
MassIndexer |
Reindex large volumes of data in batches |
Specific to Hibernate Search |
|
Jakarta EE standard |
14.2. Indexing plans
14.2.1. Basics
对于 listener-triggered indexing和 some forms of explicit indexing,Hibernate Search 依赖于“索引计划”来聚合“实体变更”事件并推断出要执行的最终索引操作。
For listener-triggered indexing as well as some forms of explicit indexing, Hibernate Search relies on an "indexing plan" to aggregate "entity change" events and infer the resulting indexing operations to execute.
以下是索引计划在高层次上如何工作的: |
Here is how indexing plans work at a high level: |
-
While the application performs entity changes, entity change events (entity created, updated, deleted) are added to the plan.For listener-triggered indexing (Hibernate ORM integration only) this happens implicitly as changes are performed, but it can also be done explicitly.
-
Eventually, the application decides changes are complete, and the plan processes change events added so far, either inferring which entities need to be reindexed and building the corresponding documents (no coordination) or building events to be sent to the outbox (outbox-polling coordination).For the Hibernate ORM integration this happens when the Hibernate ORM Session gets flushed (explicitly or as part of a transaction commit), while for the Standalone POJO Mapper this happens when the SearchSession is closed.
-
Finally the plan gets executed, triggering indexing, potentially asynchronously.For the Hibernate ORM integration this happens on transaction commit, while for the Standalone POJO Mapper this happens when the SearchSession is closed.
以下是索引计划的关键特性的摘要,以及它们如何因配置的 coordination strategy而异。
Below is a summary of key characteristics of indexing plans and how they vary depending on the configured coordination strategy.
表 7. 根据协调策略比较索引计划
Table 7. Comparison of indexing plans depending on the coordination strategy
Coordination strategy |
No coordination (default) |
|
Guarantee of indexes updates |
Non-transactional, after the database transaction / _SearchSession.close()_ returns |
|
Visibility of index updates |
||
Overhead for application threads |
||
Overhead for the database (Hibernate ORM integration only) |
14.2.2. Synchronization with the indexes
Basics
有关在 Hibernate |
For a preliminary introduction to writing to and reading from indexes in Hibernate Search, including in particular the concepts of commit and refresh, see Commit and refresh. |
使用 outbox-polling coordination strategy 时,执行索引变更的实际索引计划会在后台线程中异步创建。因此,采用该协调策略时,设置非默认索引计划同步策略没有任何意义,这样做会在启动时导致异常。
When using the outbox-polling coordination strategy, the actual indexing plan performing the index changes is created asynchronously in a background thread. Because of that, with that coordination strategy it does not make sense to set a non-default indexing plan synchronization strategy, and doing so will lead to an exception on startup.
当提交事务 ( Hibernate ORM integration) 或关闭 SearchSession( Standalone POJO Mapper) 时, with default coordination settings,即索引计划的执行 ( implicit (listener-triggered)或 explicit) 可能会阻塞应用程序线程,直至索引达到一定完成度。
When a transaction is committed (Hibernate ORM integration) or the SearchSession is closed (Standalone POJO Mapper), with default coordination settings, the execution of the indexing plan (implicit (listener-triggered) or explicit) can block the application thread until indexing reaches a certain level of completion.
阻塞线程有两个主要原因:
There are two main reasons for blocking the thread:
-
Indexed data safety: if, once the database transaction completes, index data must be safely stored to disk, an index commit is necessary. Without it, index changes may only be safe after a few seconds, when a periodic index commit happens in the background.
-
Real-time search queries: if, once the database transaction completes (for the Hibernate ORM integration) or the SearchSession's close() method returns (for the Standalone POJO Mapper), any search query must immediately take the index changes into account, an index refresh is necessary. Without it, index changes may only be visible after a few seconds, when a periodic index refresh happens in the background.
这两个要求受 synchronization strategy 控制。默认策略由配置属性 hibernate.search.indexing.plan.synchronization.strategy 定义。以下是所有可用策略及其保证的参考。
These two requirements are controlled by the synchronization strategy. The default strategy is defined by the configuration property hibernate.search.indexing.plan.synchronization.strategy. Below is a reference of all available strategies and their guarantees.
Strategy |
Throughput |
Guarantees when the application thread resumes |
Changes applied (with or without commit) |
Changes safe from crash/power loss (commit) |
Changes visible on search (refresh) |
async |
Best |
No guarantee |
No guarantee |
No guarantee |
write-sync (default) |
Medium |
Guaranteed |
Guaranteed |
No guarantee |
read-sync |
Medium to worst |
Guaranteed |
No guarantee |
Guaranteed |
sync |
Guaranteed |
根据后端及其配置的不同,sync 和 read-sync 策略可能会导致较低的索引吞吐量,因为后端可能不适用于频繁的按需索引刷新。
Depending on the backend and its configuration, the _sync_ and _read-sync_ strategies may lead to poor indexing throughput, because the backend may not be designed for frequent, on-demand index refreshes.
这就是仅在您确信后端适合进行索引刷新或进行集成测试时,才推荐此策略的原因。特别是, sync 策略将与 Lucene 后端的默认配置配合良好,但与 Elasticsearch 后端的配合效果不佳。
This is why this strategy is only recommended if you know your backend is designed for it, or for integration tests. In particular, the sync strategy will work fine with the default configuration of the Lucene backend, but will perform poorly with the Elasticsearch backend.
可能会根据所选策略以不同方式报告索引失败: |
Indexing failures may be reported differently depending on the chosen strategy: |
无法从实体中提取数据:
Failure to extract data from entities:
无论采用何种策略,都会在应用程序线程中引发异常。
Regardless of the strategy, throws an exception in the application thread.
无法应用索引变更(即索引上的 I/O 操作):
Failure to apply index changes (i.e. I/O operations on the index):
对于立即应用变更的策略:在应用程序线程中引发异常。
For strategies that apply changes immediately: throws an exception in the application thread.
对于不立即应用变更的策略:将失败转发至 failure handler ,默认情况下,它只会记录失败。
For strategies that do not apply changes immediately: forwards the failure to the failure handler, which by default will simply log the failure.
提交索引变更失败:
Failure to commit index changes:
对于保证索引提交的策略:在应用程序线程中引发异常。
For strategies that guarantee an index commit: throws an exception in the application thread.
对于不能保证索引提交的策略:将失败转发至 failure handler ,默认情况下,它只会记录失败。
For strategies that do not guarantee an index commit: forwards the failure to the failure handler, which by default will simply log the failure.
无论采用何种策略,都会在应用程序线程中引发异常。
Regardless of the strategy, throws an exception in the application thread.
对于立即应用变更的策略:在应用程序线程中引发异常。
For strategies that apply changes immediately: throws an exception in the application thread.
对于不立即应用变更的策略:将失败转发至 failure handler ,默认情况下,它只会记录失败。
For strategies that do not apply changes immediately: forwards the failure to the failure handler, which by default will simply log the failure.
对于保证索引提交的策略:在应用程序线程中引发异常。
For strategies that guarantee an index commit: throws an exception in the application thread.
对于不能保证索引提交的策略:将失败转发至 failure handler ,默认情况下,它只会记录失败。
For strategies that do not guarantee an index commit: forwards the failure to the failure handler, which by default will simply log the failure.
Per-session override
虽然上面提到的配置属性定义了一个默认值,但可以通过调用 SearchSession#indexingPlanSynchronizationStrategy(…) 并传递不同的策略来覆盖某个特定会话中的此默认值。
While the configuration property mentioned above defines a default, it is possible to override this default on a particular session by calling SearchSession#indexingPlanSynchronizationStrategy(…) and passing a different strategy.
内置策略可以通过调用检索:
The built-in strategies can be retrieved by calling:
-
IndexingPlanSynchronizationStrategy.async()
-
IndexingPlanSynchronizationStrategy.writeSync()
-
IndexingPlanSynchronizationStrategy.readSync()
-
or IndexingPlanSynchronizationStrategy.sync()
. Example 141. Overriding the indexing plan synchronization strategy
SearchSession searchSession = /* ... */ (1)
searchSession.indexingPlanSynchronizationStrategy(
IndexingPlanSynchronizationStrategy.sync()
); (2)
entityManager.getTransaction().begin();
try {
Book book = entityManager.find( Book.class, 1 );
book.setTitle( book.getTitle() + " (2nd edition)" ); (3)
entityManager.getTransaction().commit(); (4)
}
catch (RuntimeException e) {
entityManager.getTransaction().rollback();
}
List<Book> result = searchSession.search( Book.class )
.where( f -> f.match().field( "title" ).matching( "2nd edition" ) )
.fetchHits( 20 ); (5)
Custom strategy
您还可以实现自定义策略。然后,可以像内置策略一样设置自定义策略:
You can also implement custom strategy. The custom strategy can then be set just like the built-in strategies:
-
as the default by setting the configuration property hibernate.search.indexing.plan.synchronization.strategy to a bean reference pointing to the custom implementation, for example class:com.mycompany.MySynchronizationStrategy.
-
at the session level by passing an instance of the custom implementation to SearchSession#indexingPlanSynchronizationStrategy(…).
14.2.3. Indexing plan filter
以下列出的特性尚处于 incubating 阶段:它们仍在积极开发中。
Features detailed below are incubating: they are still under active development.
通常 compatibility policy 不适用:孵化元素(例如类型、方法、配置属性等)的契约在后续版本中可能会以向后不兼容的方式更改,甚至可能被移除。
The usual compatibility policy does not apply: the contract of incubating elements (e.g. types, methods, configuration properties, etc.) may be altered in a backward-incompatible way — or even removed — in subsequent releases.
我们建议您使用孵化特性,以便开发团队可以收集反馈并对其进行改进,但在需要时您应做好更新依赖于这些特性的代码的准备。
You are encouraged to use incubating features so the development team can get feedback and improve them, but you should be prepared to update code which relies on them as needed.
在某些情况下,例如在导入大量数据时,按编程方式暂停 explicit and listener-triggered indexing可能会有帮助。Hibernate Search 允许配置应用程序范围和会话级别筛选器,以管理哪些类型被跟踪变更并编入索引。
In some scenarios, it might be helpful to pause the explicit and listener-triggered indexing programmatically, for example, when importing larger amounts of data. Hibernate Search allows configuring application-wide and session-level filters to manage which types are tracked for changes and indexed.
. Example 142. Configuring an application-wide filter
SearchMapping searchMapping = /* ... */ (1)
searchMapping.indexingPlanFilter( (2)
ctx -> ctx.exclude( EntityA.class ) (3)
.include( EntityExtendsA2.class )
);
. Example 143. Configuring a session-level filter
SearchSession session = /* ... */ (1)
session.indexingPlanFilter(
ctx -> ctx.exclude( EntityA.class ) (2)
.include( EntityExtendsA2.class )
);
可以通过提供索引和包含的类型以及它们的超类型来定义过滤器。不允许使用接口,并且将接口类传递给任何过滤器定义方法将导致异常。如果使用了由 Map 表示的动态类型,则必须使用它们的名称来配置过滤器。过滤器规则为:
Filter can be defined by providing indexed and contained types as well as their supertypes. Interfaces are not allowed and passing an interface class to any of the filter definition methods will result in an exception. If dynamic types represented by a Map are used then their names must be used to configure the filter. Filter rules are:
-
If the type A is explicitly included by the filter, then a change to an object that is exactly of a type A is processed.
-
If the type A is explicitly excluded by the filter, then a change to an object that is exactly of a type A is ignored.
-
If the type A is explicitly included by the filter, then a change to an object that is exactly of a type B, which is a subtype of the type A, is processed unless the filter explicitly excludes a more specific supertype of a type B.
-
If the type A is excluded by the filter explicitly, then a change to an object that is exactly of a type B, which is a subtype of type the A, is ignored unless the filter explicitly includes a more specific supertype of a type B.
会话级过滤器优先于应用级过滤器。如果会话级过滤器配置未明确或通过继承包含/排除实体的确切类型,则决策将由应用程序范围的过滤器做出。如果应用程序范围的过滤器对某个类型也没有明确的配置,则认为包含此类型。
A session-level filter takes precedence over an application-wide one. If the session-level filter configuration does not either explicitly or through inheritance include/exclude the exact type of an entity, then the decision will be made by the application-wide filter. If an application-wide filter also has no explicit configuration for a type, then this type is considered to be included.
在某些情况下,我们可能需要完全禁用索引。逐个列举所有实体可能很繁琐,但由于过滤器配置是隐式应用于子类型的,因此 .exclude(Object.class) 可用于排除所有类型。相反,当应用程序范围的过滤器完全禁用索引时,可以使用 .include(Object.class) 启用会话过滤器中的索引。
In some cases we might need to disable the indexing entirely. Listing all entities one by one might be cumbersome, but since filter configuration is implicitly applied to subtypes, .exclude(Object.class) can be used to exclude all types. Conversely, .include(Object.class) can be used to enable indexing within a session filter when the application-wide filter disables indexing completely.
. Example 144. Disable all indexing within a session
SearchSession searchSession = /* ... */ (1)
searchSession.indexingPlanFilter(
ctx -> ctx.exclude( Object.class ) (2)
);
. Example 145. Enable indexing in the session while application-wide indexing is paused
SearchMapping searchMapping = /* ... */ (1)
searchMapping.indexingPlanFilter(
ctx -> ctx.exclude( Object.class ) (2)
);
试图通过同一个过滤器在同一时间将相同类型配置为包含和排除将导致异常的引发。 |
Trying to configure the same type as both included and excluded at the same time by the same filter will lead to an exception being thrown. |
只有在使用 outbox-polling coordination strategy 时才安全使用全应用程序过滤器。在使用此协调策略时,会在与更改它们的会话不同的会话中加载和索引实体。这可能会导致意外的结果,因为处理事件的会话不会应用由修改实体的会话配置的过滤器。如果此类过滤器已配置,除非该过滤器排除了所有类型以防止此协调策略配置的会话级过滤器产生任何意外后果,否则将引发异常。 |
Only an application-wide filter is safe to use when using the outbox-polling coordination strategy. When this coordination strategy is in use, entities are loaded and indexed in a different session from the one where they were changed. It might lead to unexpected results as the session where events are processed will not apply the filter configured by the session in which entities were modified. An exception will be thrown if such a filter is configured unless this filter excludes all the types to prevent any unexpected consequences of configuring session-level filters with this coordination strategy. |
14.3. Implicit, listener-triggered indexing
14.3.1. Basics
此功能仅可通过 Hibernate ORM integration 使用。 |
This feature is only available with the Hibernate ORM integration. |
尤其不能与 Standalone POJO Mapper 一起使用。
It cannot be used with the Standalone POJO Mapper in particular.
默认情况下,每当通过 Hibernate ORM 会话更改实体时,如果 entity type是 mapped to an index,Hibernate Search 会透明地更新相关索引。
By default, every time an entity is changed through a Hibernate ORM Session, if the entity type is mapped to an index, Hibernate Search updates the relevant index transparently.
以下是监听器触发的索引在高级别的工作方式:
Here is how listener-triggered indexing works at a high level:
-
When the Hibernate ORM Session gets flushed (explicitly or as part of a transaction commit), Hibernate ORM determines what changed exactly (entity created, updated, deleted), forwards the information to Hibernate Search.
-
Hibernate Search adds this information to a (session-scoped) indexing plan and the plan processes change events added so far, either inferring which entities need to be reindexed and building the corresponding documents (no coordination) or building events to be sent to the outbox (outbox-polling coordination).
-
On database transaction commit, the plan gets executed, either sending the document indexing/deletion request to the backend (no coordination) or sending the events to the database (outbox-polling coordination).
以下是监听器触发的索引的关键特性的摘要,以及它们如何因配置的 coordination strategy而异。
Below is a summary of key characteristics of listener-triggered indexing and how they vary depending on the configured coordination strategy.
单击链接了解更多详细信息。
Follow the links for more details.
表 8. 根据协调策略比较监听器触发索引
Table 8. Comparison of listener-triggered indexing depending on the coordination strategy
Coordination strategy |
No coordination (default) |
|
Detects changes occurring in ORM sessions (session.persist(…), session.delete(…), setters, …) |
Detects changes caused by JPQL or SQL queries (insert/update/delete) |
|
Associations must be updated on both sides |
||
Changes triggering reindexing |
Guarantee of indexes updates |
|
Non-transactional, after the database transaction / _SearchSession.close()_ returns |
Visibility of index updates |
|
Overhead for application threads |
||
Overhead for the database |
14.3.2. Configuration
如果你的索引是只读的,或者你通过重新索引定期更新索引,则使用 MassIndexer 、 Jakarta Batch mass indexing job 或 explicitly ,则可能不需要侦听器触发的索引。
Listener-triggered indexing may be unnecessary if your index is read-only or if you update it regularly by reindexing, either using the MassIndexer, using the Jakarta Batch mass indexing job, or explicitly.
您可以通过将配置属性 hibernate.search.indexing.listeners.enabled 设置为 false 来禁用监听器触发索引。
You can disable listener-triggered indexing by setting the configuration property hibernate.search.indexing.listeners.enabled to false.
由于监听器触发的索引在底层使用 indexing plans,因此影响索引计划的多个配置选项也将影响监听器触发的索引:
As listener-triggered indexing uses indexing plans under the hood, several configuration options affecting indexing plans will affect listener-triggered indexing as well:
14.3.3. In-session entity change detection and limitations
Hibernate Search 使用 Hibernate ORM 的内部事件来检测变更。如果你实际上在代码中操作受管理的实体对象,这些事件将被触发:调用 session.persist(…)、session.delete(…),调用实体 setter 等。
Hibernate Search uses internal events of Hibernate ORM in order to detect changes. These events will be triggered if you actually manipulate managed entity objects in your code: calls to session.persist(…), session.delete(…), to entity setters, etc.
对于大多数应用程序来说,这个工作得很好,但你需要考虑一些限制:
This works great for most applications, but you need to consider some limitations:
14.3.4. Dirty checking
在构建已索引文档时,Hibernate Search 会意识到被访问的实体属性。在处理 Hibernate ORM 实体更改事件时,它也会意识到实际上发生了哪些属性更改。利用这些知识,它能够检测哪些实体更改实际上与索引相关,并在修改属性但不会影响已索引文档时跳过重新索引。
Hibernate Search is aware of the entity properties that are accessed when building indexed documents. When processing Hibernate ORM entity change events, it is also aware of which properties actually changed. Thanks to that knowledge, it is able to detect which entity changes are actually relevant to indexing, and to skip reindexing when a property is modified, but does not affect the indexed document.
14.4. Indexing a large amount of data with the MassIndexer
14.4.1. Basics
在以下情况下, listener-triggered or explicit indexing可能不够用,因为已存在的数据必须编入索引:
There are cases where listener-triggered or explicit indexing is not enough, because pre-existing data has to be indexed:
-
when restoring a database backup;
-
when indexes had to be wiped, for example because the Hibernate Search mapping or some core settings changed;
-
when entities cannot be indexed as they change (e.g. with listener-triggered indexing) for performance reasons, and periodic reindexing (every night, …) is preferred.
为了处理这些情况,Hibernate Search 提供了 MassIndexer:一个基于外部数据存储的内容完全重建索引的工具(对于 Hibernate ORM integration,该数据存储是数据库)。可以告知 _MassIndexer_重新索引一些选定的已编入索引类型或所有类型。
To address these situations, Hibernate Search provides the MassIndexer: a tool to rebuild indexes completely based on the content of an external datastore (for the Hibernate ORM integration, that datastore is the database). The MassIndexer can be told to reindex a few selected indexed types, or all of them.
MassIndexer 采用以下方法来实现相当高的吞吐量:
The MassIndexer takes the following approach to achieve a reasonably high throughput:
-
Indexes are purged completely when mass indexing starts.
-
Mass indexing is performed by several parallel threads, each loading data from the database and sending indexing requests to the indexes, not triggering any commit or refresh.
-
An implicit flush (commit) and refresh are performed upon mass indexing completion, except for Amazon OpenSearch Serverless since it doesn’t support explicit flushes or refreshes.[.iokays-translated-47646cdc9694f6a80435ea0d2ad7df33]
由于初始索引清除,并且大规模索引是一项非常耗费资源的操作,因此建议在 MassIndexer 工作时让你的应用程序脱机。
Because of the initial index purge, and because mass indexing is a very resource-intensive operation, it is recommended to take your application offline while the MassIndexer is working.
在 MassIndexer 忙碌时查询索引可能比平时慢,并且可能会返回不完整的结果。
Querying the index while a MassIndexer is busy may be slower than usual and will likely return incomplete results.
以下代码片段将重建所有已索引实体的索引,删除索引,然后从数据库重新加载所有实体。
The following snippet of code will rebuild the index of all indexed entities, deleting the index and then reloading all entities from the database.
. Example 146. Reindexing everything using a MassIndexer
SearchSession searchSession = /* ... */ (1)
searchSession.massIndexer() (2)
.startAndWait(); (3)
MassIndexer 会创建其自己的独立会话和(只读)事务,因此无需在 MassIndexer 启动前开始数据库事务或在完成该事务后提交事务。 |
The MassIndexer creates its own, separate sessions and (read-only) transactions, so there is no need to begin a database transaction before the MassIndexer is started or to commit a transaction after it is done. |
请注意 MySQL 用户: MassIndexer 使用仅进的可滚动结果来迭代要加载的主键,但 MySQL 的 JDBC 驱动程序会预加载内存中的所有值。
A note to MySQL users: the MassIndexer uses forward-only scrollable results to iterate on the primary keys to be loaded, but MySQL’s JDBC driver will preload all values in memory.
要避免此“优化”,将 idFetchSize parameter 设置为 Integer.MIN_VALUE 。
To avoid this "optimization" set the idFetchSize parameter to Integer.MIN_VALUE.
尽管 MassIndexer 易于使用,但建议进行一些调整以加速该过程。有几个可选参数可用,可以在大规模索引器启动之前按如下所示设置。参见 MassIndexer parameters 以了解所有可用参数的参考,以及 Tuning the MassIndexer for best performance 以了解关键主题的详细信息。
Although the MassIndexer is simple to use, some tweaking is recommended to speed up the process. Several optional parameters are available, and can be set as shown below, before the mass indexer starts. See MassIndexer parameters for a reference of all available parameters, and Tuning the MassIndexer for best performance for details about key topics.
. Example 147. Using a tuned MassIndexer
searchSession.massIndexer() (1)
.idFetchSize( 150 ) (2)
.batchSizeToLoadObjects( 25 ) (3)
.threadsToLoadObjects( 12 ) (4)
.startAndWait(); (5)
用多个线程运行 MassIndexer 可能需要与数据库建立许多连接。如果您没有足够大的连接池,则 MassIndexer 本身和/或您的其他应用程序可能会匮乏资源并无法处理其他请求:请确保根据 Threads and connections 中说明调整连接池的大小,以适应大量索引参数。
Running the MassIndexer with many threads may require many connections to the database. If you don’t have a sufficiently large connection pool, the MassIndexer itself and/or your other applications could starve and be unable to serve other requests: make sure you size your connection pool according to the mass indexing parameters, as explained in Threads and connections.
14.4.2. Selecting types to be indexed
当创建批量索引器时,你可以选择实体类型,以仅重新索引这些类型(以及它们的已索引子类型,如果存在):
You can select entity types when creating a mass indexer, to reindex only these types (and their indexed subtypes, if any):
. Example 148. Reindexing selected types using a MassIndexer
searchSession.massIndexer( Book.class ) (1)
.startAndWait(); (2)
14.4.3. Mass indexing multiple tenants
上述部分中的示例从给定的会话创建批量索引器,这将始终将批量索引限制为该会话针对的租户。
Examples in sections above create a mass indexer from a given session, which will always limit mass indexing to the tenant targeted by that session.
当使用 multi-tenancy 时,可以通过从 SearchScope 检索大规模索引器并将租户标识符集合传递来一次重新索引多个租户:
When using multi-tenancy you can reindex multiple tenants at once by retrieving the mass indexer from a SearchScope and passing a collection of tenant identifiers:
. Example 149. Reindexing multiple tenants listed explicitly using a MassIndexer
SearchMapping searchMapping = /* ... */ (1)
searchMapping.scope( Object.class ) (2)
.massIndexer( asSet( "tenant1", "tenant2" ) ) (3)
.startAndWait(); (4)
使用 Hibernate ORM mapper,如果 included the comprehensive list of tenants in Hibernate Search’s configuration,则只需调用 _scope.massIndexer()_而不带任何参数,生成的批量索引器会定位所有已配置租户:
With the Hibernate ORM mapper, if you included the comprehensive list of tenants in Hibernate Search’s configuration, you can simply call scope.massIndexer() without any argument, and the resulting mass indexer will target all configured tenants:
. Example 150. Reindexing multiple tenants configured implicitly using a MassIndexer
SearchMapping searchMapping = /* ... */ (1)
searchMapping.scope( Object.class ) (2)
.massIndexer() (3)
.startAndWait(); (4)
14.4.4. Running the mass indexer asynchronously
由于批量索引器不依赖于原始的 Hibernate ORM 会话,因此可以异步运行批量索引器。在异步使用时,批量索引器将返回完成阶段以跟踪批量索引的完成情况:
It is possible to run the mass indexer asynchronously, because it does not rely on the original Hibernate ORM session. When used asynchronously, the mass indexer will return a completion stage to track the completion of mass indexing:
. Example 151. Reindexing asynchronously using a MassIndexer
searchSession.massIndexer() (1)
.start() (2)
.thenRun( () -> { (3)
log.info( "Mass indexing succeeded!" );
} )
.exceptionally( throwable -> {
log.error( "Mass indexing failed!", throwable );
return null;
} );
// OR
Future<?> future = searchSession.massIndexer()
.start()
.toCompletableFuture(); (4)
14.4.5. Conditional reindexing
此功能仅可通过 Hibernate ORM integration 使用。 |
This feature is only available with the Hibernate ORM integration. |
尤其不能与 Standalone POJO Mapper 一起使用。
It cannot be used with the Standalone POJO Mapper in particular.
您可以通过将条件作为字符串传递给批量索引器来选择要重新索引的目标实体的子集。在查询数据库以查找要索引的实体时,将应用该条件。
You can select a subset of target entities to be reindexed by passing a condition as string to the mass indexer. The condition will be applied when querying the database for entities to index.
条件字符串应遵循 Hibernate Query Language (HQL) 语法。可访问的实体属性为要重新索引的实体的属性(仅限这些属性)。
The condition string is expected to follow the Hibernate Query Language (HQL) syntax. Accessible entity properties are those of the entity being reindexed (and nothing more).
. Example 152. Use of conditional reindexing
SearchSession searchSession = /* ... */ (1)
MassIndexer massIndexer = searchSession.massIndexer(); (2)
massIndexer.type( Book.class ).reindexOnly( "publicationYear < 1950" ); (3)
massIndexer.type( Author.class ).reindexOnly( "birthDate < :cutoff" ) (4)
.param( "cutoff", Year.of( 1950 ).atDay( 1 ) ); (5)
massIndexer.startAndWait(); (6)
即使重新索引应用于实体的子集,默认情况下所有实体也会在开始时被清除。清除 can be disabled completely ,但在启用时无法筛选将被清除的实体。
Even if the reindexing is applied on a subset of entities, by default all entities will be purged at the start. The purge can be disabled completely, but when enabled there is no way to filter the entities that will be purged.
有关更多信息,参见 HSEARCH-3304 。
See HSEARCH-3304 for more information.
14.4.6. MassIndexer parameters
表 9. MassIndexer 参数
Table 9. MassIndexer parameters
Setter |
Default value |
Description |
typesToIndexInParallel(int) |
1 |
The number of types to index in parallel. |
threadsToLoadObjects(int) |
6 |
The number of threads for entity loading, for each type indexed in parallel. That is to say, the number of threads spawned for entity loading will be typesToIndexInParallel * threadsToLoadObjects (+ 1 thread per type to retrieve the IDs of entities to load). |
idFetchSize(int) |
100 |
Only supported with the *Hibernate ORM integration.* The fetch size to be used when loading primary keys. Some databases accept special values, for example MySQL might benefit from using Integer#MIN_VALUE, otherwise it will attempt to preload everything in memory. |
batchSizeToLoadObjects(int) |
10 |
Only supported with the *Hibernate ORM integration.* The fetch size to be used when loading entities from database. Some databases accept special values, for example MySQL might benefit from using Integer#MIN_VALUE, otherwise it will attempt to preload everything in memory. |
dropAndCreateSchemaOnStart(boolean) |
false |
Drops the indexes and their schema (if they exist) and re-creates them before indexing.Indexes will be unavailable for a short time during the dropping and re-creation, so this should only be used when failures of concurrent operations on the indexes (listener-triggered indexing, …) are acceptable.This should be used when the existing schema is known to be obsolete, for example when the Hibernate Search mapping changed and some fields now have a different type, a different analyzer, new capabilities (projectable, …), etc.This may also be used when the schema is up-to-date, since it can be faster than a purge (purgeAllOnStart) on large indexes, especially with the Elasticsearch backend.As an alternative to this parameter, you can also use a schema manager to manage schemas manually at the time of your choosing: Manual schema management. |
purgeAllOnStart(boolean) |
Default value depends on dropAndCreateSchemaOnStart(boolean). Defaults to false if the mass indexer is configured to drop and create the schema on start, to true otherwise. |
Removes all entities from the indexes before indexing.Only set this to false if you know the index is already empty; otherwise, you will end up with duplicates in the index. |
mergeSegmentsAfterPurge(boolean) |
true in general, false on Amazon OpenSearch Serverless |
Force merging of each index into a single segment after the initial index purge, just before indexing. This setting has no effect if purgeAllOnStart is set to false. |
mergeSegmentsOnFinish(boolean) |
false |
Force merging of each index into a single segment after indexing. This operation does not always improve performance: see Merging segments and performance. |
cacheMode(CacheMode) |
CacheMode.IGNORE |
Only supported with the *Hibernate ORM integration.* The Hibernate CacheMode when loading entities. The default is CacheMode.IGNORE, and it will be the most efficient choice in most cases, but using another mode such as CacheMode.GET may be more efficient if many of the entities being indexed refer to a small set of other entities. |
transactionTimeout |
- |
Only supported in JTA-enabled environments and with the *Hibernate ORM integration.* Timeout of transactions for loading ids and entities to be re-indexed. The timeout should be long enough to load and index all entities of one type. Note that these transactions are read-only, so choosing a large value (e.g. 1800, meaning 30 minutes) should not cause any problem. |
limitIndexedObjectsTo(long) |
- |
Only supported with the *Hibernate ORM integration.* The maximum number of results to load per entity type. This parameter let you define a threshold value to avoid loading too many entities accidentally. The value defined must be greater than 0. The parameter is not used by default. It is equivalent to keyword LIMIT in SQL. |
monitor(MassIndexingMonitor) |
A logging monitor. |
The component responsible for monitoring progress of mass indexing.As a MassIndexer can take some time to finish its job, it is often necessary to monitor its progress. The default, built-in monitor logs progress periodically at the INFO level, but a custom monitor can be set by implementing the MassIndexingMonitor interface and passing an instance using the monitor method.Implementations of MassIndexingMonitor must be thread-safe. |
failureHandler(MassIndexingFailureHandler) |
A failure handler. |
The component responsible for handling failures occurring during mass indexing.A MassIndexer performs multiple operations in parallel, some of which can fail without stopping the whole mass indexing process. As a result, it may be necessary to trace individual failures.The default, built-in failure handler just forwards the failures to the global background failure handler, which by default will log them at the ERROR level, but a custom handler can be set by implementing the MassIndexingFailureHandler interface and passing an instance using the failureHandler method. This can be used to simply log failures in a context specific to the mass indexer, e.g. a web interface in a maintenance console from which mass indexing was requested, or for more advanced use cases, such as cancelling mass indexing on the first failure.Implementations of MassIndexingFailureHandler must be thread-safe. |
environment(MassIndexingEnvironment) |
An empty environment (no threadlocals, …). |
This feature is *incubating: it is still under active development.* The contract of incubating elements (e.g. types, methods, configuration properties, etc.) may be altered in a backward-incompatible way — or even removed — in subsequent releases.The component responsible for setting up an environment (threadlocals, …) on mass indexing threads before mass indexing starts, and tearing down that environment after mass indexing.Implementations should handle their exceptions unless it is an unrecoverable situation in which further mass indexing does not make sense: any exception thrown by the MassIndexingEnvironment will abort mass indexing. |
failureFloodingThreshold(long) |
100 with the default failure handler (see description) |
This feature is *incubating: it is still under active development.* The maximum number of failures to be handled per indexed type. Any failures exceeding this number will be ignored and not sent for processing by MassIndexingFailureHandler. Can be set to Long.MAX_VALUE if none of the failures should be ignored.Defaults to a threshold defined by the failure handler in use; see MassIndexingFailureHandler#failureFloodingThreshold, FailureHandler#failureFloodingThreshold. For the default log-based failure handler, the default threshold is 100. |
14.4.7. Tuning the MassIndexer for best performance
Basics
MassIndexer 被设计为尽快完成重新索引任务,但没有一刀切的解决方案,因此需要进行一些配置以获得最佳效果。
The MassIndexer was designed to finish the re-indexing task as quickly as possible, but there is no one-size-fits-all solution, so some configuration is required to get the best of it.
性能优化会变得非常复杂,因此在你尝试配置 MassIndexer 时,请记住以下事项:
Performance optimization can get quite complex, so keep the following in mind while you attempt to configure the MassIndexer:
-
Always test your changes to assess their actual effect: advice provided in this section is true in general, but each application and environment is different, and some options, when combined, may produce unexpected results.
-
Take baby steps: before tuning mass indexing with 40 indexed entity types with two million instances each, try a more reasonable scenario with only one entity type, optionally limiting the number of entities to index to assess performance more quickly.
-
Tune your entity types individually before you try to tune a mass indexing operation that indexes multiple entity types in parallel.
Threads and connections
增加并行性通常会有帮助,因为瓶颈通常是与数据库/数据存储的连接延迟:值得尝试一个远高于可用实际内核数的线程数。
Increasing parallelism usually helps as the bottleneck usually is the latency to the database/datastore connection: it’s probably worth it to experiment with a number of threads significantly higher than the number of actual cores available.
但是,每个线程需要一个连接(例如,JDBC 连接),而连接通常是有限的。为了安全地增加线程数:
However, each thread requires one connection (e.g. a JDBC connection), and connections are usually in limited supply. In order to increase the number of threads safely:
-
You should make sure your database/datastore can actually handle the resulting number of connections.
-
Your connection pool should be configured to provide a sufficient number of connections.
-
The above should take into account the rest of your application (request threads in a web application): ignoring this may bring other processes to a halt while the MassIndexer is working.
有一个简单的公式可以了解应用于 MassIndexer 的不同选项如何影响所用工作线程和连接数:
There is a simple formula to understand how the different options applied to the MassIndexer affect the number of used worker threads and connections:
以下是对影响并行性的参数的大致合理的调优起点的几个建议:
Here are a few suggestions for a roughly sane tuning starting point for the parameters that affect parallelism:
typesToIndexInParallel
它可能应该是一个很低的值,例如 1 或 2,具体取决于你的 CPU 有多少个空闲周期,以及数据库往返交互的速度有多慢。
Should probably be a low value, like 1 or 2, depending on how much of your CPUs have spare cycles and how slow a database round trip will be.
threadsToLoadObjects
更高的值会增加数据库中选择的实体的预加载速率,但也会增加内存使用量和对处理后续索引的线程的压力。请注意,每个线程都会从实体中提取数据以重新索引,这取决于你的映射,可能需要访问惰性关联和加载关联的实体,从而对数据库/数据存储进行阻塞调用,因此可能需要大量线程并行工作。
Higher increases the preloading rate for the picked entities from the database, but also increases memory usage and the pressure on the threads working on subsequent indexing. Note that each thread will extract data from the entity to reindex, which depending on your mapping might require accessing lazy associations and load associated entities, thus making blocking calls to the database/datastore, so you will probably need a high number of threads working in parallel.
所有内部线程组都有以“Hibernate Search”为前缀的有意义的名称,因此应该可以使用大多数诊断工具轻松识别它们,包括简单的线程转储。 |
All internal thread groups have meaningful names prefixed with "Hibernate Search", so they should be easily identified with most diagnostic tools, including simply thread dumps. |
14.5. Indexing a large amount of data with the Jakarta Batch integration
14.5.1. Basics
此功能仅可通过 Hibernate ORM integration 使用。 |
This feature is only available with the Hibernate ORM integration. |
尤其不能与 Standalone POJO Mapper 一起使用。
It cannot be used with the Standalone POJO Mapper in particular.
Hibernate Search 提供了一个 Jakarta Batch 作业来执行批量索引。它不仅涵盖了上文描述的现有批量索引器功能,还受益于 Jakarta Batch 的一些强大的标准特性,例如使用检查点进行故障恢复、面向块的处理和并行执行。此批作业接受不同类型的实体作为输入,从数据库加载相关实体,然后从中重建全文索引。
Hibernate Search provides a Jakarta Batch job to perform mass indexing. It covers not only the existing functionality of the mass indexer described above, but also benefits from some powerful standard features of Jakarta Batch, such as failure recovery using checkpoints, chunk oriented processing, and parallel execution. This batch job accepts different entity type(s) as input, loads the relevant entities from the database, then rebuilds the full-text index from these.
执行此作业需要批处理运行时,而 Hibernate Search 并不提供该运行时。你可以自由选择一个符合你需求的运行时,例如嵌入到 Jakarta EE 容器中的默认批处理运行时。Hibernate Search 提供了对 JBeret 实现的完全集成(请参阅 how to configure it here)。至于其他实现,也可以使用,但需要 a bit more configuration on your side。
Executing this job requires a batch runtime that is not provided by Hibernate Search. You are free to choose one that fits your needs, e.g. the default batch runtime embedded in your Jakarta EE container. Hibernate Search provides full integration to the JBeret implementation (see how to configure it here). As for other implementations, they can also be used, but will require a bit more configuration on your side.
如果运行时是 JBeret,则需要添加以下依赖项:
If the runtime is JBeret, you need to add the following dependency:
<dependency>
<groupId>org.hibernate.search</groupId>
<artifactId>hibernate-search-mapper-orm-jakarta-batch-jberet</artifactId>
<version>7.2.0.Alpha2</version>
</dependency>
对于任何其他运行时,都需要添加以下依赖项:
For any other runtime, you need to add the following dependency:
<dependency>
<groupId>org.hibernate.search</groupId>
<artifactId>hibernate-search-mapper-orm-jakarta-batch-core</artifactId>
<version>7.2.0.Alpha2</version>
</dependency>
以下是如何运行批处理实例的示例:
Here is an example of how to run a batch instance:
. Example 153. Reindexing everything using a Jakarta Batch mass-indexing job
Properties jobProps = MassIndexingJob.parameters() (1)
.forEntities( Book.class, Author.class ) (2)
.build();
JobOperator jobOperator = BatchRuntime.getJobOperator(); (3)
long executionId = jobOperator.start( MassIndexingJob.NAME, jobProps ); (4)
14.5.2. Job Parameters
下表包含了所有可以用来自定义大规模索引作业的作业参数。
The following table contains all the job parameters you can use to customize the mass-indexing job.
表 10.Jakarta Batch 集成中的作业参数
Table 10. Job Parameters in Jakarta Batch Integration
Parameter Name / Builder Method |
Default value |
Description |
entityTypes / .forEntity(Class<?>), .forEntities(Class<?>, Class<?>…) |
- |
This parameter is always required.The entity types to index in this job execution, comma-separated. |
purgeAllOnStart / .purgeAllOnStart(boolean) |
True |
Specify whether the existing index should be purged at the beginning of the job. This operation takes place before indexing.Only affects the indexes targeted by the entityTypes parameter. |
Only affects the indexes targeted by the entityTypes parameter. |
dropAndCreateSchemaOnStart / .dropAndCreateSchemaOnStart(boolean) |
|
False |
Specify whether the existing schema should be dropped and created at the beginning of the job. This operation takes place before indexing.Only affects the indexes targeted by the entityTypes parameter. |
|
Only affects the indexes targeted by the entityTypes parameter. |
mergeSegmentsAfterPurge / .mergeSegmentsAfterPurge(boolean) |
True |
Specify whether the mass indexer should merge segments at the beginning of the job. This operation takes place after the purge operation and before indexing. |
mergeSegmentsOnFinish / .mergeSegmentsOnFinish(boolean) |
True |
Specify whether the mass indexer should merge segments at the end of the job. This operation takes place after indexing. |
cacheMode / .cacheMode(CacheMode) |
IGNORE |
Specify the Hibernate CacheMode when loading entities. The default is IGNORE, and it will be the most efficient choice in most cases, but using another mode such as GET may be more efficient if many of the entities being indexed are already present in the Hibernate ORM second-level cache before mass indexing. Enabling caches has an effect only if the entity id is also the document id, which is the default. PUT or NORMAL values may lead to bad performance, because all the entities are also loaded into Hibernate second level cache. |
idFetchSize / .idFetchSize(int) |
1000 |
Specifies the fetch size to be used when loading primary keys. Some databases accept special values, for example MySQL might benefit from using Integer#MIN_VALUE, otherwise it will attempt to preload everything in memory. |
entityFetchSize / .entityFetchSize(int) |
200, or the value of checkpointInterval if it is smaller |
Specifies the fetch size to be used when loading entities from database. The value defined must be greater than 0, and equal to or less than the value of checkpointInterval. |
customQueryHQL / .restrictedBy(String) |
- |
Use HQL / JPQL to index entities of a target entity type. Your query should contain only one entity type. Mixing this approach with the criteria restriction is not allowed. Please notice that there’s no query validation for your input. See #mapper-orm-indexing-jakarta-batch-indexing-mode for more detail and limitations. |
maxResultsPerEntity / .maxResultsPerEntity(int) |
- |
The maximum number of results to load per entity type. This parameter let you define a threshold value to avoid loading too many entities accidentally. The value defined must be greater than 0. The parameter is not used by default. It is equivalent to keyword LIMIT in SQL. |
rowsPerPartition / .rowsPerPartition(int) |
20,000 |
The maximum number of rows to process per partition. The value defined must be greater than 0, and equal to or greater than the value of checkpointInterval. |
maxThreads / .maxThreads(int) |
The number of partitions |
The maximum number of threads to use for processing the job. Note the batch runtime cannot guarantee the request number of threads are available; it will use as many as it can up to the request maximum. |
checkpointInterval / .checkpointInterval(int) |
2,000, or the value of rowsPerPartition if it is smaller |
The number of entities to process before triggering a checkpoint. The value defined must be greater than 0, and equal to or less than the value of rowsPerPartition. |
entityManagerFactoryReference / .entityManagerFactoryReference(String) |
- |
This parameter is required when there is more than one persistence unit.The string that will identify the EntityManagerFactory. |
entityManagerFactoryNamespace / .entityManagerFactoryNamespace(String) |
- |
14.5.3. Conditional indexing
你可以通过将条件作为字符串传递给大规模索引作业来选择要编制索引的目标实体的子集。该条件将在查询数据库以获取要编制索引的实体时应用。
You can select a subset of target entities to be indexed by passing a condition as string to the mass indexing job. The condition will be applied when querying the database for entities to index.
条件字符串应遵循 Hibernate Query Language (HQL) 语法。可访问的实体属性为要重新索引的实体的属性(仅限这些属性)。
The condition string is expected to follow the Hibernate Query Language (HQL) syntax. Accessible entity properties are those of the entity being reindexed (and nothing more).
. Example 154. Conditional indexing using a reindexOnly HQL parameter
Properties jobProps = MassIndexingJob.parameters() (1)
.forEntities( Author.class ) (2)
.reindexOnly( "birthDate < :cutoff", (3)
Map.of( "cutoff", Year.of( 1950 ).atDay( 1 ) ) ) (4)
.build();
JobOperator jobOperator = BatchRuntime.getJobOperator(); (5)
long executionId = jobOperator.start( MassIndexingJob.NAME, jobProps ); (6)
即使重新索引应用于实体的子集,所有实体默认情况下也会在开始时被清除。清除 can be disabled completely ,但启用时无法筛选将被清除的实体。
Even if the reindexing is applied on a subset of entities, by default all entities will be purged at the start. The purge can be disabled completely, but when enabled there is no way to filter the entities that will be purged.
有关更多信息,参见 HSEARCH-3304 。
See HSEARCH-3304 for more information.
14.5.4. Parallel indexing
为了获得更好的性能,使用多线程以并行的方式执行索引创建。要编制索引的实体集被分割成多个分区。每个线程一次处理一个分区。
For better performance, indexing is performed in parallel using multiple threads. The set of entities to index is split into multiple partitions. Each thread processes one partition at a time.
以下部分将解释如何调整并行执行。
The following section will explain how to tune the parallel execution.
线程数、获取大小、分区大小等的“黄金分割”高度依赖于你的整体架构、数据库设计甚至数据值。 |
The "sweet spot" of number of threads, fetch size, partition size, etc. to achieve best performance is highly dependent on your overall architecture, database design and even data values. |
你应该对这些设置进行尝试,找出针对你特定情况最适合的方法。
You should experiment with these settings to find out what’s best in your particular case.
Threads
作业执行所使用的最大线程数通过方法 maxThreads() 定义。在给定的 N 个线程中,有 1 个线程保留给核心,因此只有 N - 1 个线程可用于不同的分区。如果 N = 1,则程序将正常工作,并且所有批元素都将在同一线程中运行。Hibernate Search 中使用的默认线程数为 10。您可以用您喜欢的数字覆盖它。
The maximum number of threads used by the job execution is defined through method maxThreads(). Within the N threads given, there’s 1 thread reserved for the core, so only N - 1 threads are available for different partitions. If N = 1, the program will work, and all batch elements will run in the same thread. The default number of threads used in Hibernate Search is 10. You can overwrite it with your preferred number.
MassIndexingJob.parameters()
.maxThreads( 5 )
...
请注意,批处理运行时无法保证提供请求的线程数,它将尽可能使用请求的最大值(Jakarta Batch Specification v2.1 最终版本,第 29 页)。另外请注意,所有批处理作业共享同一个线程池,因此同时执行作业并非总是明智之举。 |
Note that the batch runtime cannot guarantee the requested number of threads are available, it will use as many as possible up to the requested maximum (Jakarta Batch Specification v2.1 Final Release, page 29). Note also that all batch jobs share the same thread pool, so it’s not always a good idea to execute jobs concurrently. |
Rows per partition
每个分区都由一定数量的要索引的元素组成。您可以使用 rowsPerPartition 调整一个分区将包含多少个元素。
Each partition consists of a fixed number of elements to index. You may tune exactly how many elements a partition will hold with rowsPerPartition.
MassIndexingJob.parameters()
.rowsPerPartition( 5000 )
...
此属性与“块大小”无关,它是每次写入之间处理多少个元素的方式。处理的这一方面由块解决。 |
This property has nothing to do with "chunk size", which is how many elements are processed together between each write. That aspect of processing is addressed by chunking. |
相反, rowsPerPartition 更关乎你的批量索引作业的并行程度。
Instead, rowsPerPartition is more about how parallel your mass indexing job will be.
请参见 Chunking section ,了解如何调整块。
Please see the Chunking section to see how to tune chunking.
当 rowsPerPartition 较低时,将有很多小的分区,因此处理线程不太可能出现饥饿(由于没有更多要处理的分区而保持空闲),但另一方面,您将只能利用较小的获取大小,这将增加数据库访问次数。此外,由于故障恢复机制,在启动新分区时会产生一些开销,因此,如果分区数量过大,这种开销将累积起来。
When rowsPerPartition is low, there will be many small partitions, so processing threads will be less likely to starve (stay idle because there’s no more partition to process), but on the other hand you will only be able to take advantage of a small fetch size, which will increase the number of database accesses. Also, due to the failure recovery mechanisms, there is some overhead in starting a new partition, so with an unnecessarily large number of partitions, this overhead will add up.
当 _rowsPerPartition_较高时,将会有几个较大的分区,因此你将能够利用较高的 chunk size,从而获得较高的提取大小,这将减少数据库访问次数,并且启动新分区的开销将不太明显,但另一方面可能无法使用所有可用线程。
When rowsPerPartition is high, there will be a few big partitions, so you will be able to take advantage of a higher chunk size, and thus a higher fetch size, which will reduce the number of database accesses, and the overhead of starting a new partition will be less noticeable, but on the other hand you may not use all the threads available.
每个分区处理一种根实体类型,因此不同实体类型永远不会在同一个分区下运行。 |
Each partition deals with one root entity type, so two different entity types will never run under the same partition. |
14.5.5. Chunking and session clearing
大规模索引作业支持或多或少从暂停或失败的作业停止的地方重新启动。
The mass indexing job supports restart a suspended or failed job more or less from where it stopped.
通过在几个连续的 chunks 实体中分割每个分区,并在每个块的末尾保存在 checkpoint 中的处理信息,就可以做到这一点。当作业重新启动时,它将从上一个检查点恢复。
This is made possible by splitting each partition in several consecutive chunks of entities, and saving process information in a checkpoint at the end of each chunk. When a job is restarted, it will resume from the last checkpoint.
每个块的大小由 checkpointInterval 参数决定。
The size of each chunk is determined by the checkpointInterval parameter.
MassIndexingJob.parameters()
.checkpointInterval( 1000 )
...
但块的大小不仅仅是为了保存进度,还与性能有关:
But the size of a chunk is not only about saving progress, it is also about performance:
-
a new Hibernate session is opened for each chunk;
-
a new transaction is started for each chunk;
-
inside a chunk, the session is cleared periodically according to the entityFetchSize parameter, which must thereby be smaller than (or equal to) the chunk size;
-
documents are flushed to the index at the end of each chunk.
通常,检查点时间间隔应与每个分区中的行数相比很小。
In general the checkpoint interval should be small compared to the number of rows per partition. |
事实上,由于故障恢复机制,每个分区的第一个检查点之前发生错误的元素的处理时间比其他元素都长,因此在一个 1000 个元素的分区中,设置 100 个元素的检查点时间间隔比设置 1000 个元素的检查点时间间隔更快。
Indeed, due to the failure recovery mechanism, the elements before the first checkpoint of each partition will take longer to process than the other, so in a 1000-element partition, having a 100-element checkpoint interval will be faster than having a 1000-element checkpoint interval.
另一方面,块在绝对意义上不应太小。执行检查点意味着你的 Jakarta Batch 运行时会将其持久性存储的作业执行进度信息写入到其持久性存储,这也会产生成本。此外,为每个块创建了一个新事务和会话,这不是免费的,这意味着将获取大小设置为高于块大小的值是毫无意义的。最后,在每个块的末尾执行的索引刷新是一个涉及全局锁的昂贵操作,基本而言,执行次数越少,索引速度就越快。因此,设置一个 1 个元素的检查点时间间隔绝对不是一个好主意。
On the other hand, chunks shouldn’t be too small in absolute terms. Performing a checkpoint means your Jakarta Batch runtime will write information about the progress of the job execution to its persistent storage, which also has a cost. Also, a new transaction and session are created for each chunk which doesn’t come for free, and implies that setting the fetch size to a value higher than the chunk size is pointless. Finally, the index flush performed at the end of each chunk is an expensive operation that involves a global lock, which essentially means that the less you do it, the faster indexing will be. Thus having a 1-element checkpoint interval is definitely not a good idea.
14.5.6. Selecting the persistence unit (EntityManagerFactory)
无论如何检索实体管理器工厂,都必须确保海量索引器使用的实体管理器工厂在整个海量索引过程中保持打开状态。
Regardless of how the entity manager factory is retrieved, you must make sure that the entity manager factory used by the mass indexer will stay open during the whole mass indexing process.
JBeret
如果您的 Jakarta Batch 运行时是 JBeret(特别是在 WildFly 中使用),则可以使用 CDI 检索 EntityManagerFactory。
If your Jakarta Batch runtime is JBeret (used in WildFly in particular), you can use CDI to retrieve the EntityManagerFactory.
如果你只使用一个持久性单元,大规模索引器将能够在没有任何特殊配置的情况下自动访问你的数据库。
If you use only one persistence unit, the mass indexer will be able to access your database automatically without any special configuration.
如果你想使用多个持久性单元,则必须在 CDI 上下文中将 EntityManagerFactories 注册为 bean。请注意,实体管理器工厂可能不会被默认视为 bean,在这种情况下,你必须自己注册它们。你可以使用应用程序范围的 bean 来执行此操作:
If you want to use multiple persistence units, you will have to register the EntityManagerFactories as beans in the CDI context. Note that entity manager factories will probably not be considered as beans by default, in which case you will have to register them yourself. You may use an application-scoped bean to do so:
@ApplicationScoped
public class EntityManagerFactoriesProducer {
@PersistenceUnit(unitName = "db1")
private EntityManagerFactory db1Factory;
@PersistenceUnit(unitName = "db2")
private EntityManagerFactory db2Factory;
@Produces
@Singleton
@Named("db1") // The name to use when referencing the bean
public EntityManagerFactory createEntityManagerFactoryForDb1() {
return db1Factory;
}
@Produces
@Singleton
@Named("db2") // The name to use when referencing the bean
public EntityManagerFactory createEntityManagerFactoryForDb2() {
return db2Factory;
}
}
一旦实体管理器工厂在 CDI 上下文中注册,你就可以通过使用 entityManagerReference 参数对其进行命名来指示大规模索引器使用特定的实体管理器工厂。
Once the entity manager factories are registered in the CDI context, you can instruct the mass indexer to use one in particular by naming it using the entityManagerReference parameter.
由于 CDI API 的限制,在将海量索引器与 CDI 一起使用时,不能通过持久性单元名称引用实体管理器工厂。 |
Due to limitations of the CDI APIs, it is not currently possible to reference an entity manager factory by its persistence unit name when using the mass indexer with CDI. |
Other DI-enabled Jakarta Batch implementations
如果你想使用允许依赖项注入的其他 Jakarta Batch 实现:
If you want to use a different Jakarta Batch implementation that happens to allow dependency injection:
-
You must map the following two scope annotations to the relevant scope in the dependency injection mechanism:_org.hibernate.search.jakarta.batch.core.inject.scope.spi.HibernateSearchJobScoped__org.hibernate.search.jakarta.batch.core.inject.scope.spi.HibernateSearchPartitionScoped_
-
org.hibernate.search.jakarta.batch.core.inject.scope.spi.HibernateSearchJobScoped
-
org.hibernate.search.jakarta.batch.core.inject.scope.spi.HibernateSearchPartitionScoped
-
You must make sure that the dependency injection mechanism will register all injection-annotated classes (@Named, …) from the hibernate-search-mapper-orm-jakarta-batch-core module in the dependency injection context. For instance this can be achieved in Spring DI using the @ComponentScan annotation.
-
You must register a single bean in the dependency injection context that will implement the EntityManagerFactoryRegistry interface.
Plain Java environment (no dependency injection at all)
以下内容仅在你 jakarta Batch 运行时根本不支持依赖注入时才有效,即它忽略批处理工件中的 @Inject 注释。例如,对于 Java SE 模式中的 JBatch 就是这种情况。
The following will work only if your Jakarta Batch runtime does not support dependency injection at all, i.e. it ignores @Inject annotations in batch artifacts. This is the case for JBatch in Java SE mode, for instance.
如果您只使用一个持久性单元,群体索引器将能够在没有任何特殊配置情况下自动访问数据库:您只需要在启动群体索引器之前确保在应用程序中创建 EntityManagerFactory (或 SessionFactory)。
If you use only one persistence unit, the mass indexer will be able to access your database automatically without any special configuration: you only have to make sure to create the EntityManagerFactory (or SessionFactory) in your application before launching the mass indexer.
如果您想使用多个持久性单元,您将必须在启动群体索引器时添加两个参数:
If you want to use multiple persistence units, you will have to add two parameters when launching the mass indexer:
-
entityManagerFactoryReference: this is the string that will identify the EntityManagerFactory.
-
entityManagerFactoryNamespace: this allows to select how you want to reference the EntityManagerFactory. Possible values are:
persistence-unit-name (默认):使用 persistence.xml 中定义的持久性单元名称。
persistence-unit-name (the default): use the persistence unit name defined in persistence.xml.
session-factory-name:使用 Hibernate 配置中通过 hibernate.session_factory_name 配置属性定义的会话工厂名称。
session-factory-name: use the session factory name defined in the Hibernate configuration by the hibernate.session_factory_name configuration property.
-
persistence-unit-name (the default): use the persistence unit name defined in persistence.xml.
-
session-factory-name: use the session factory name defined in the Hibernate configuration by the hibernate.session_factory_name configuration property.[.iokays-translated-071d94589f23939db57016ac55c9d9b4]
如果在 Hibernate 配置中设置属性并且没有使用 JNDI,则还必须将设置为。
If you set the hibernate.session_factory_name property in the Hibernate configuration, and you don’t use JNDI, you will also have to set hibernate.session_factory_name_is_jndi to false.
14.6. Explicit indexing
14.6.1. Basics
虽然 listener-triggered indexing 和 MassIndexer 或 the mass indexing job 应该能满足大多数需求,但有时需要手动控制索引。
While listener-triggered indexing and the MassIndexer or the mass indexing job should take care of most needs, it is sometimes necessary to control indexing manually.
尤其是当 listener-triggered indexing 为 disabled 或根本不受支持(例如, with the Standalone POJO Mapper )时,或者当由侦听器触发的无法检测到实体更改 such as JPQL/SQL insert, update or delete queries 时,需要这样做。
The need arises in particular when listener-triggered indexing is disabled or simply not supported (e.g. with the Standalone POJO Mapper), or when listener-triggered cannot detect entity changes — such as JPQL/SQL insert, update or delete queries.
为解决这些用例,Hibernate Search 提供了以下部分中解释的多个 API。
To address these use cases, Hibernate Search exposes several APIs explained if the following sections.
14.6.2. Configuration
由于显式索引在底层使用 indexing plans,因此影响索引计划的多个配置选项也将影响显式索引:
As explicit indexing uses indexing plans under the hood, several configuration options affecting indexing plans will affect explicit indexing as well:
14.6.3. Using a SearchIndexingPlan manually
使用 SearchIndexingPlan 接口通过 SearchSession 的上下文来显式访问 indexing plan 。该接口表示在一个会话上下文中计划的一系列(可变)更改,并且在事务提交(对于 Hibernate ORM integration )或关闭 SearchSession (对于 Standalone POJO Mapper )时应用于索引。
Explicit access to the indexing plan is done in the context of a SearchSession using the SearchIndexingPlan interface. This interface represents the (mutable) set of changes that are planned in the context of a session, and will be applied to indexes upon transaction commit (for the Hibernate ORM integration) or upon closing the SearchSession (for the Standalone POJO Mapper).
基于 indexing plan 的显式索引的高层工作方式如下:
Here is how explicit indexing based on an indexing plan works at a high level:
-
When the application wants an index change, it calls one of the add/addOrUpdate/delete methods on the indexing plan of the current SearchSession.For the Hibernate ORM integration the current SearchSession is bound to the Hibernate ORM Session, while for the Standalone POJO Mapper the SearchSession is is created explicitly by the application.
-
Eventually, the application decides changes are complete, and the plan processes change events added so far, either inferring which entities need to be reindexed and building the corresponding documents (no coordination) or building events to be sent to the outbox (outbox-polling coordination).The application may trigger this explicitly using the indexing plan’s process method, but it is generally not necessary as it happens automatically: for the Hibernate ORM integration this happens when the Hibernate ORM Session gets flushed (explicitly or as part of a transaction commit), while for the Standalone POJO Mapper this happens when the SearchSession is closed.
-
Finally the plan gets executed, triggering indexing, potentially asynchronously.The application may trigger this explicitly using the indexing plan’s execute method, but it is generally not necessary as it happens automatically: for the Hibernate ORM integration this happens on transaction commit, while for the Standalone POJO Mapper this happens when the SearchSession is closed.
SearchIndexingPlan 界面提供以下方法:
The SearchIndexingPlan interface offers the following methods:
add(Object entity)
(仅适用于 Standalone POJO Mapper 。)
(Available with the Standalone POJO Mapper only.)
如果实体类型映射到索引,则向索引中添加文档 (@Indexed)。
Add a document to the index if the entity type is mapped to an index (@Indexed).
如果文档已经存在,这可能会在索引中创建重复项。除非你对自己非常确定并且需要(稍有)性能提升,否则优先使用 addOrUpdate 。
This may create duplicates in the index if the document already exists. Prefer addOrUpdate unless you are really sure of yourself and need a (slight) performance boost.
addOrUpdate(Object entity)
如果实体类型映射到索引( @Indexed ),则在索引中添加或更新文档,并重新索引嵌入此实体的文档(例如,通过 @IndexedEmbedded )。
Add or update a document in the index if the entity type is mapped to an index (@Indexed), and re-index documents that embed this entity (through @IndexedEmbedded for example).
delete(Object entity)
如果实体类型映射到索引( @Indexed ),则从索引中删除文档,并重新索引嵌入此实体的文档(例如,通过 @IndexedEmbedded )。
Delete a document from the index if the entity type is mapped to an index (@Indexed), and re-index documents that embed this entity (through @IndexedEmbedded for example).
purge(Class<?> entityType, Object id)
从索引中删除实体,但不要尝试重新索引嵌入此实体的文档。
Delete the entity from the index, but do not try to re-index documents that embed this entity.
与相比,这主要在以下情况下有用:实体已从数据库中删除,并且即使在分离状态下也不在会话中可用。在这种情况下,重新索引关联实体将成为用户的责任,因为 Hibernate Search 无法知道哪些实体与不再存在的实体关联。
Compared to delete, this is mainly useful if the entity has already been deleted from the database and is not available, even in a detached state, in the session. In that case, reindexing associated entities will be the user’s responsibility, since Hibernate Search cannot know which entities are associated to an entity that no longer exists.
purge(String entityName, Object id)
与 purge(Class<?> entityType, Object id) 相同,但实体类型由其名称引用(请参阅 @javax.persistence.Entity#name )。
Same as purge(Class<?> entityType, Object id), but the entity type is referenced by its name (see @javax.persistence.Entity#name).
process()`
(仅适用于 Hibernate ORM integration 。)
(Available with the Hibernate ORM integration only.)
处理迄今为止添加的更改事件,推断出哪些实体需要重新索引并构建对应的文档 ( no coordination ),或构建需要发送到出站收件箱的事件 ( outbox-polling coordination )。
Process change events added so far, either inferring which entities need to be reindexed and building the corresponding documents (no coordination) or building events to be sent to the outbox (outbox-polling coordination).
此方法通常会自动执行(请参阅本节开头的概述),因此仅在处理大量项目时批处理时才需要明确调用它,如 Hibernate ORM and the periodic "flush-clear" pattern with SearchIndexingPlan 中所述。
This method is generally executed automatically (see the high-level description near top of this section), so calling it explicitly is only useful for batching when processing a large number of items, as explained in Hibernate ORM and the periodic "flush-clear" pattern with SearchIndexingPlan.
execute()
(仅适用于 Hibernate ORM integration 。)
(Available with the Hibernate ORM integration only.)
执行索引计划,触发索引,可能异步。
Execute the indexing plan, triggering indexing, potentially asynchronously.
此方法通常会自动执行(请参阅本节开头的概述),因此仅在非常罕见的情况下才需要明确调用它,即在处理大量项目且事务不是选项时批处理,如 Hibernate ORM and the periodic "flush-clear" pattern with SearchIndexingPlan 中所述。
This method is generally executed automatically (see the high-level description near top of this section), so calling it explicitly is only useful in very rare cases, for batching when processing a large number of items and transactions are not an option, as explained in Hibernate ORM and the periodic "flush-clear" pattern with SearchIndexingPlan.
以下是使用 addOrUpdate 和 delete 的示例。
Below are examples of using addOrUpdate and delete.
. Example 155. Explicitly adding or updating an entity in the index using SearchIndexingPlan
// Not shown: open a transaction if relevant
SearchSession searchSession = /* ... */ (1)
SearchIndexingPlan indexingPlan = searchSession.indexingPlan(); (2)
Book book = entityManager.getReference( Book.class, 5 ); (3)
indexingPlan.addOrUpdate( book ); (4)
// Not shown: commit the transaction or close the session if relevant
. Example 156. Explicitly deleting an entity from the index using SearchIndexingPlan
// Not shown: open a transaction if relevant
SearchSession searchSession = /* ... */ (1)
SearchIndexingPlan indexingPlan = searchSession.indexingPlan(); (2)
Book book = entityManager.getReference( Book.class, 5 ); (3)
indexingPlan.delete( book ); (4)
// Not shown: commit the transaction or close the session if relevant
在单个索引计划中可以执行多个操作。甚至可以多次更改同一个实体,例如添加然后移除:Hibernate Search 会按预期简化操作。 |
Multiple operations can be performed in a single indexing plan. The same entity can even be changed multiple times, for example added and then removed: Hibernate Search will simplify the operation as expected. |
对于任何合理数量的实体,这都能很好地工作,但在单个会话中更改或简单的加载大量实体需要使用 Hibernate ORM 特别小心,然后使用 Hibernate Search 更加小心。有关更多信息,请参阅 Hibernate ORM and the periodic "flush-clear" pattern with SearchIndexingPlan 。
This will work fine for any reasonable number of entities, but changing or simply loading large numbers of entities in a single session requires special care with Hibernate ORM, and then some extra care with Hibernate Search. See Hibernate ORM and the periodic "flush-clear" pattern with SearchIndexingPlan for more information.
14.6.4. Hibernate ORM and the periodic "flush-clear" pattern with SearchIndexingPlan
此功能仅可通过 Hibernate ORM integration 使用。 |
This feature is only available with the Hibernate ORM integration. |
尤其不能与 Standalone POJO Mapper 一起使用。
It cannot be used with the Standalone POJO Mapper in particular.
使用 JPA 操作大量数据集时一个相当常见的用例是 periodic "flush-clear" pattern,其中一个循环在每次迭代时读取或写入实体,并刷新然后清除每个 n 迭代的会话。此模式允许在将内存占用保持在合理较低水平的同时处理大量实体。
A fairly common use case when manipulating large datasets with JPA is the periodic "flush-clear" pattern, where a loop reads or writes entities for every iteration and flushes then clears the session every n iterations. This pattern allows processing a large number of entities while keeping the memory footprint reasonably low.
以下是不使用 Hibernate Search 时保留大量实体的模式示例。
Below is an example of this pattern to persist a large number of entities when not using Hibernate Search.
. Example 157. A batch process with JPA
entityManager.getTransaction().begin();
try {
for ( int i = 0; i < NUMBER_OF_BOOKS; ++i ) { (1)
Book book = newBook( i );
entityManager.persist( book ); (2)
if ( ( i + 1 ) % BATCH_SIZE == 0 ) {
entityManager.flush(); (3)
entityManager.clear(); (4)
}
}
entityManager.getTransaction().commit();
}
catch (RuntimeException e) {
entityManager.getTransaction().rollback();
throw e;
}
对于 Hibernate Search 6(与 Hibernate Search 5 及更早版本相反),此模式将按预期工作:
With Hibernate Search 6 (on contrary to Hibernate Search 5 and earlier), this pattern will work as expected:
-
with coordination disabled (the default), documents will be built on flushes, and sent to the index upon transaction commit.
-
with outbox-polling coordination, entity change events will be persisted on flushes, and committed along with the rest of the changes upon transaction commit.
但是,每个 flush 调用都有可能将数据添加到内部缓冲区,对于大量数据,这可能导致出现 OutOfMemoryException,具体取决于 JVM 堆大小、 coordination strategy以及文档的复杂性和数量。
However, each flush call will potentially add data to an internal buffer, which for large volumes of data may lead to an OutOfMemoryException, depending on the JVM heap size, the coordination strategy and the complexity and number of documents.
如果你遇到了内存问题,第一个解决方案是将批处理分解成多个事务,每个事务处理较少数量的元素:内部文档缓冲区将在每次事务后清除。
If you run into memory issues, the first solution is to break down the batch process into multiple transactions, each handling a smaller number of elements: the internal document buffer will be cleared after each transaction.
请参见下面的示例。
See below for an example.
使用此模式时,如果某个事务失败,部分数据将已存在于数据库和索引中,且无法回滚更改。
With this pattern, if one transaction fails, part of the data will already be in the database and in indexes, with no way to roll back the changes.
但是,索引将与数据库一致,并且可以从失败的最后一个事务(手动)重新启动进程。
However, the indexes will be consistent with the database, and it will be possible to (manually) restart the process from the last transaction that failed.
. Example 158. A batch process with Hibernate Search using multiple transactions
try {
int i = 0;
while ( i < NUMBER_OF_BOOKS ) { (1)
entityManager.getTransaction().begin(); (2)
int end = Math.min( i + BATCH_SIZE, NUMBER_OF_BOOKS ); (3)
for ( ; i < end; ++i ) {
Book book = newBook( i );
entityManager.persist( book ); (4)
}
entityManager.getTransaction().commit(); (5)
}
}
catch (RuntimeException e) {
entityManager.getTransaction().rollback();
throw e;
}
多事务解决方案和最初的 flush() / clear() 循环模式可以合并,将进程分解为多个中等大小的事务,并在每个事务中定期调用 flush / clear 。 |
The multi-transaction solution and the original flush()/clear() loop pattern can be combined, breaking down the process in multiple medium-sized transactions, and periodically calling flush/clear inside each transaction. |
此组合解决方案是最灵活的,因此如果要微调批处理,它是最合适的。
This combined solution is the most flexible, hence the most suitable if you want to fine-tune your batch process.
如果将批处理分解为多个事务不是一种选择,另一个解决方案是在调用 session.flush()/session.clear() 后直接写入索引,而无需等待数据库事务提交:在每次写入索引后,内部文档缓冲区将被清除。
If breaking down the batch process into multiple transactions is not an option, a second solution is to just write to indexes after the call to session.flush()/session.clear(), without waiting for the database transaction to be committed: the internal document buffer will be cleared after each write to indexes.
这可以通过调用索引计划中的 execute() 方法来完成,如下例所示。
This is done by calling the execute() method on the indexing plan, as shown in the example below.
使用此模式,如果出现异常,部分数据将已在索引中,而无法回滚更改,而数据库更改将已回滚。因此,索引将与数据库不一致。
With this pattern, if an exception is thrown, part of the data will already be in the index, with no way to roll back the changes, while the database changes will have been rolled back. The index will thus be inconsistent with the database.
为从该情况中恢复,您必须手动执行完全相同的数据库更改(使数据库与索引重新同步),或手动执行受事务影响的 reindex the entities (使索引与数据库重新同步)。
To recover from that situation, you will have to either execute the exact same database changes that failed manually (to get the database back in sync with the index), or reindex the entities affected by the transaction manually (to get the index back in sync with the database).
当然,如果您能承受较长时间使索引脱机,则一个更简单的解决方案是擦除索引并 reindex everything 。
Of course, if you can afford to take the indexes offline for a longer period of time, a simpler solution would be to wipe the indexes clean and reindex everything.
. Example 159. A batch process with Hibernate Search using execute()
SearchSession searchSession = Search.session( entityManager ); (1)
SearchIndexingPlan indexingPlan = searchSession.indexingPlan(); (2)
entityManager.getTransaction().begin();
try {
for ( int i = 0; i < NUMBER_OF_BOOKS; ++i ) {
Book book = newBook( i );
entityManager.persist( book ); (3)
if ( ( i + 1 ) % BATCH_SIZE == 0 ) {
entityManager.flush();
entityManager.clear();
indexingPlan.execute(); (4)
}
}
entityManager.getTransaction().commit(); (5)
}
catch (RuntimeException e) {
entityManager.getTransaction().rollback();
throw e;
}