Hibernate Search 中文操作指南
5. Architecture
5.1. Components of Hibernate Search
从用户的角度来看,Hibernate Search 由以下两个组件组成:
From the user’s perspective, Hibernate Search consists of two components:
Mapper
映射器将用户模型“映射”到索引模型,并提供与用户模型一致的 API 来执行索引和搜索。
The mapper "maps" the user model to an index model, and provide APIs consistent with the user model to perform indexing and searching.
大多数应用程序依赖于 Hibernate ORM mapper,它提供索引 Hibernate ORM 实体属性的能力,但也存在一个可以在不使用 Hibernate ORM 时使用的 Standalone POJO mapper。
Most applications rely on the Hibernate ORM mapper, which offers the ability to index properties of Hibernate ORM entities, but there is also a Standalone POJO mapper that can be used without Hibernate ORM.
部分通过领域模型上的注释,部分通过配置属性来配置映射器。
The mapper is configured partly through annotations on the domain model, and partly through configuration properties.
Backend
后端是全文引擎的抽象,其中“完成任务”。通过“索引管理器”实现通用索引和搜索接口,每个管理器提供对一个索引的访问。
The backend is the abstraction over the full-text engines, where "things get done". It implements generic indexing and searching interfaces for use by the mapper through "index managers", each providing access to one index.
例如, Lucene backend委托给 Lucene 库,而 Elasticsearch backend委托给远程 Elasticsearch 集群。
For instance the Lucene backend delegates to the Lucene library, and the Elasticsearch backend delegates to a remote Elasticsearch cluster.
后端部分由映射程序配置,它会告诉后端哪些索引必须存在以及它们必须有什么字段,部分通过配置属性配置。
The backend is configured partly by the mapper, which tells the backend which indexes must exist and what fields they must have, and partly through configuration properties.
映射程序和后端协同工作,提供以下三个主要功能:
The mapper and the backend work together to provide three main features:
Mass indexing
这就是 Hibernate Search 如何根据数据库的内容从零重建索引。
This is how Hibernate Search rebuilds indexes from zero based on the content of a database.
映射程序查询数据库以检索每个实体的标识符,然后批量处理这些标识符,加载实体并对其进行处理,以生成要发送到后端进行索引的文档。后端将文档放入内部队列中,并将在后台进程中批量处理文档,并在完成后通知映射程序。
The mapper queries the database to retrieve the identifier of every entity, then processes these identifiers in batches, loading the entities then processing them to generate documents that are sent to the backend for indexing. The backend puts the document in an internal queue, and will index documents in batches, in background processes, notifying the mapper when it’s done.
有关详细信息,请参阅 Indexing a large amount of data with the MassIndexer 。
See Indexing a large amount of data with the MassIndexer for details.
Explicit and listener-triggered indexing
显式和侦听器触发的索引依赖于索引计划 ( SearchIndexingPlan ) 来基于有限更改索引特定实体。
Explicit and listener-triggered indexing rely on indexing plans (SearchIndexingPlan) to index specific entities as a result of limited changes.
使用 explicit indexing时,调用方会将实体更改的信息显式传给 indexing plan;使用 listener-triggered indexing时,实体更改在 the Hibernate ORM integration(带有 a few exceptions)中以透明方式检测到,并自动添加到索引计划中。
With explicit indexing, the caller explicitly passes information about changes on entities to an indexing plan; with listener-triggered indexing, entity changes are detected transparently by the Hibernate ORM integration (with a few exceptions) and added to the indexing plan automatically.
侦听器触发的索引仅在 the Hibernate ORM integration 的上下文中才有意义;有 no such feature available for the Standalone POJO Mapper 。在两种情况下, indexing plan 会从这些更改中推断出实体是否需要重新索引,可能是更改的实体本身或 other entities that embed the changed entity in their index 。
Listener-triggered indexing only makes sense in the context of the Hibernate ORM integration; there is no such feature available for the Standalone POJO Mapper.In both cases, the indexing plan will deduce from those changes whether entities need to be reindexed, be it the changed entity itself or other entities that embed the changed entity in their index.
在事务提交时,将处理索引计划中的更改(在同一线程或后台进程中,具体取决于 coordination strategy),生成文档,然后将其发送到后端进行索引。后端将文档放入内部队列中,并在后台进程中分批索引文档,并在完成时通知映射器。
Upon transaction commit, changes in the indexing plan are processed (either in the same thread or in a background process, depending on the coordination strategy), and documents are generated, then sent to the backend for indexing. The backend puts the documents in an internal queue, and will index documents in batches, in background processes, notifying the mapper when it’s done.
有关详细信息,请参见 Implicit, listener-triggered indexing。
See Implicit, listener-triggered indexing for details.
Searching
这就是 Hibernate Search 提供查询索引的方法。
This is how Hibernate Search provides ways to query an index.
映射程序公开对搜索 DSL 的入口点,允许选择要查询的实体类型。当选择一个或多个实体类型时,映射程序将委派给相应的索引管理器以提供搜索 DSL,并最终创建搜索查询。在查询执行时,后端会向映射程序提交实体引用的列表,然后映射程序加载相应的实体。然后由查询返回实体。
The mapper exposes entry points to the search DSL, allowing selection of entity types to query. When one or more entity types are selected, the mapper delegates to the corresponding index managers to provide a Search DSL and ultimately create the search query. Upon query execution, the backend submits a list of entity references to the mapper, which loads the corresponding entities. The entities are then returned by the query.
有关详细信息,请参见 Searching。
See Searching for details.
5.2. Examples of architectures
5.2.1. Overview
表 2. 架构的比较
Table 2. Comparison of architectures
Architecture |
|||
Compatible mappers |
Application topology |
||
Single-node |
Single-node or multi-node |
Extra bits to maintain |
Indexes on filesystem |
Elasticsearch cluster |
Guarantee of index updates |
Non-transactional, after the database transaction / _SearchSession.close()_ returns |
|
Visibility of index updates |
|||
Native features |
Mostly for experts |
For anyone |
Overhead for application threads |
Overhead for the database |
|||
Impact on database schema |
None |
||
Limitations |
Listener-triggered indexing ignores: JPQL/SQL queries, asymmetric association updates |
Out-of-sync indexes in rare situations: concurrent @IndexedEmbedded, backend I/O errors |
No other known limitation |
5.2.2. Single-node application with the Lucene backend
Description
使用 Lucene backend时,索引局部于给定的应用程序节点 (JVM)。通过直接调用 Lucene 库来访问它们,而无需通过网络。
With the Lucene backend, indexes are local to a given application node (JVM). They are accessed through direct calls to the Lucene library, without going through the network.
此模式仅与单节点应用程序相关。
This mode is only relevant to single-node applications.
Pros and cons
优点:
Pros:
-
Simplicity: no external services are required, everything lives on the same server.
-
Immediate visibility (~milliseconds) of index updates. While other architectures can perform comparably well for most use cases, a single-node, Lucene backend is the best way to implement indexing if you need changes to be visible immediately after the database changes.
缺点:
Cons:
-
Without coordination, backend errors during indexing may lead to out-of sync indexes.
-
Not so easy to extend: experienced developers can access a lot of Lucene features, even those that are not exposed by Hibernate Search, by providing native Lucene objects; however, Lucene APIs are not very easy to figure out for developers unfamiliar with Lucene. If you’re interested, see for example Query-based predicates.
-
Overhead for application threads: reindexing is done directly in application threads, and it may require additional time to load data that must be indexed from the database. Depending on the amount of data to load, this may increase the application’s latency and/or decrease its throughput.
-
No horizontal scalability: there can only be one application node, and all indexes need to live on the same server.
Getting started
若要实现此架构,请使用以下 Maven 依赖项:
To implement this architecture, use the following Maven dependencies:
<dependency>
<groupId>org.hibernate.search</groupId>
<artifactId>hibernate-search-mapper-orm</artifactId>
<version>7.2.0.Alpha2</version>
</dependency>
<dependency>
<groupId>org.hibernate.search</groupId>
<artifactId>hibernate-search-backend-lucene</artifactId>
<version>7.2.0.Alpha2</version>
</dependency>
With the Standalone POJO Mapper (no Hibernate ORM)
<dependency>
<groupId>org.hibernate.search</groupId>
<artifactId>hibernate-search-mapper-pojo-standalone</artifactId>
<version>7.2.0.Alpha2</version>
</dependency>
<dependency>
<groupId>org.hibernate.search</groupId>
<artifactId>hibernate-search-backend-lucene</artifactId>
<version>7.2.0.Alpha2</version>
</dependency>
5.2.3. Single-node or multi-node application, without coordination and with the Elasticsearch backend
Description
使用 Elasticsearch backend时,索引不绑定到应用程序节点。它们由一个单独的 Elasticsearch 节点集群管理,并通过对 REST API 的调用来访问。
With the Elasticsearch backend, indexes are not tied to the application node. They are managed by a separate cluster of Elasticsearch nodes, and accessed through calls to REST APIs.
因此,可以按以下方式设置多个应用程序节点,使它们全部独立执行索引更新和搜索查询,而无需相互协调。
Thus, it is possible to set up multiple application nodes in such a way that they all perform index updates and search queries independently, without coordinating with each other.
Pros and cons
优点:
Pros:
-
Easy to extend: you can easily access most Elasticsearch features, even those that are not exposed by Hibernate Search, by providing your own JSON. See for example JSON-defined predicates, or JSON-defined aggregations, or leveraging advanced features with JSON manipulation.
-
Horizontal scalability of the indexes: you can size the Elasticsearch cluster according to your needs. See "Scalability and resilience" in the Elasticsearch documentation.
-
Horizontal scalability of the application: you can have as many instances of the application as you need (though high concurrency increases the likeliness of some problems with this architecture, see "Cons" below).
缺点:
Cons:
-
Without coordination, backend errors during indexing may lead to out-of sync indexes.
-
Need to manage an additional service: the Elasticsearch cluster.
-
Overhead for application threads: reindexing is done directly in application threads, and it may require additional time to load data that must be indexed from the database. Depending on the amount of data to load, this may increase the application’s latency and/or decrease its throughput.
-
Delayed visibility (~1 second) of index updates (near-real-time). While changes can be made visible as soon as possible after the database changes, Elasticsearch is near-real-time by nature, and won’t perform very well if you need changes to be visible immediately after the database changes.
Getting started
若要实现此架构,请使用以下 Maven 依赖项:
To implement this architecture, use the following Maven dependencies:
<dependency>
<groupId>org.hibernate.search</groupId>
<artifactId>hibernate-search-mapper-orm</artifactId>
<version>7.2.0.Alpha2</version>
</dependency>
<dependency>
<groupId>org.hibernate.search</groupId>
<artifactId>hibernate-search-backend-elasticsearch</artifactId>
<version>7.2.0.Alpha2</version>
</dependency>
With the Standalone POJO Mapper (no Hibernate ORM)
<dependency>
<groupId>org.hibernate.search</groupId>
<artifactId>hibernate-search-mapper-pojo-standalone</artifactId>
<version>7.2.0.Alpha2</version>
</dependency>
<dependency>
<groupId>org.hibernate.search</groupId>
<artifactId>hibernate-search-backend-elasticsearch</artifactId>
<version>7.2.0.Alpha2</version>
</dependency>
5.2.4. Multi-node application with outbox polling and Elasticsearch backend
以下列出的特性尚处于 incubating 阶段:它们仍在积极开发中。
Features detailed below are incubating: they are still under active development.
通常 compatibility policy 不适用:孵化元素(例如类型、方法、配置属性等)的契约在后续版本中可能会以向后不兼容的方式更改,甚至可能被移除。
The usual compatibility policy does not apply: the contract of incubating elements (e.g. types, methods, configuration properties, etc.) may be altered in a backward-incompatible way — or even removed — in subsequent releases.
我们建议您使用孵化特性,以便开发团队可以收集反馈并对其进行改进,但在需要时您应做好更新依赖于这些特性的代码的准备。
You are encouraged to use incubating features so the development team can get feedback and improve them, but you should be prepared to update code which relies on them as needed.
Description
使用 Hibernate Search 的 outbox-polling coordination strategy ,实体更改事件不会在出现它们的 ORM 会话中立即处理,而是推送到数据库中的出站表中。
With Hibernate Search’s outbox-polling coordination strategy, entity change events are not processed immediately in the ORM session where they arise, but are pushed to an outbox table in the database.
一个后台进程轮询出站表以查找新事件,并异步处理它们,根据需要更新索引。由于该队列 can be sharded,多个应用程序节点可以共享索引工作负载。
A background process polls that outbox table for new events, and processes them asynchronously, updating the indexes as necessary. Since that queue can be sharded, multiple application nodes can share the workload of indexing.
这需要 Elasticsearch backend,以便索引不绑定到单个应用程序节点,并且可以从多个应用程序节点更新或查询。
This requires the Elasticsearch backend so that indexes are not tied to a single application node and can be updated or queried from multiple application nodes.
Pros and cons
优点:
Pros:
-
Safest:
在这里消除了其他体系结构中导致不同步索引的事件,因为实体更改事件 are persisted in the same transaction as the entity changes允许无限次重试。
the possibility of out-of-sync indexes caused by indexing errors in the backend that affects other architectures is eliminated here, because entity change events are persisted in the same transaction as the entity changes allowing retries for as long as necessary.
在这里消除了其他体系结构中导致不同步索引的事件,因为在重新索引之前 each entity instance is reloaded from the database within a new transaction。
the possibility of out-of-sync indexes caused by concurrent updates that affects other architectures is eliminated here, because each entity instance is reloaded from the database within a new transaction before being re-indexed.
-
the possibility of out-of-sync indexes caused by indexing errors in the backend that affects other architectures is eliminated here, because entity change events are persisted in the same transaction as the entity changes allowing retries for as long as necessary.
-
the possibility of out-of-sync indexes caused by concurrent updates that affects other architectures is eliminated here, because each entity instance is reloaded from the database within a new transaction before being re-indexed.
-
Easy to extend: you can easily access most Elasticsearch features, even those that are not exposed by Hibernate Search, by providing your own JSON. See for example JSON-defined predicates, or JSON-defined aggregations, or leveraging advanced features with JSON manipulation.
-
Minimal overhead for application threads: application threads only need to append events to the queue, they don’t perform reindexing themselves.
-
Horizontal scalability of the indexes: you can size the Elasticsearch cluster according to your needs. See "Scalability and resilience" in the Elasticsearch documentation.
-
Horizontal scalability of the application: you can have as many instances of the application as you need.
缺点:
Cons:
-
Need to manage an additional service: the Elasticsearch cluster.
-
Delayed visibility (~1 second or more, depending on load and hardware) of index updates. First because Elasticsearch is near-real-time by nature, but also because the event queue introduces additional delays.
-
Impact on the database schema: additional tables must be created in the database to hold the data necessary for coordination.
-
Overhead for the database: the background process that reads entity changes and performs reindexing needs to read changed entities from the database.
Getting started
outbox-polling 协调策略需要额外的依赖关系。若要实现此架构,请使用以下 Maven 依赖项:
The outbox-polling coordination strategy requires an extra dependency. To implement this architecture, use the following Maven dependencies:
<dependency>
<groupId>org.hibernate.search</groupId>
<artifactId>hibernate-search-mapper-orm</artifactId>
<version>7.2.0.Alpha2</version>
</dependency>
<dependency>
<groupId>org.hibernate.search</groupId>
<artifactId>hibernate-search-mapper-orm-outbox-polling</artifactId>
<version>7.2.0.Alpha2</version>
</dependency>
<dependency>
<groupId>org.hibernate.search</groupId>
<artifactId>hibernate-search-backend-elasticsearch</artifactId>
<version>7.2.0.Alpha2</version>
</dependency>
With the Standalone POJO Mapper (no Hibernate ORM)
目前,这种架构无法使用独立 POJO 映射器实现,因为此映射器 does not support coordination 。
This architecture cannot be implemented with the Standalone POJO Mapper at the moment, because this mapper does not support coordination.
Also, configure coordination as explained in outbox-polling: additional event tables and polling in background processors.