Hibernate Search 中文操作指南
4. Concepts
4.1. Full-text search
全文搜索是一组技术,用于在文本文档语料库中搜索与给定查询最匹配的文档。
Full-text search is a set of techniques for searching, in a corpus of text documents, the documents that best match a given query.
与传统搜索(例如在 SQL 数据库中)的主要区别在于,存储的文本不被视为单个文本块,而是标记(单词)的集合。
The main difference with traditional search — for example in an SQL database — is that the stored text is not considered as a single block of text, but as a collection of tokens (words).
Hibernate Search 依靠 #coordination-none[或 ]#coordination-none[来实施全文搜索功能。由于 Elasticsearch 在内部使用 Lucene,因此它们有很多特性和一般全文搜索方法是相同的。
Hibernate Search relies on either Apache Lucene or Elasticsearch to implement full-text search. Since Elasticsearch uses Lucene internally, they share a lot of characteristics and their general approach to full-text search.
简而言之,这些搜索引擎基于倒排索引的概念:一个字典,其中键是文档中找到的标记(单词),值是包含此标记的每个文档的标识符列表。
To simplify, these search engines are based on the concept of inverted indexes: a dictionary where the key is a token (word) found in a document, and the value is the list of identifiers of every document containing this token.
仍然简化一下,一旦所有文档编入索引,搜索文档就会涉及三个步骤:
Still simplifying, once all documents are indexed, searching for documents involves three steps:
-
extracting tokens (words) from the query;
-
looking up these tokens in the index to find matching documents;
-
aggregating the results of the lookups to produce a list of matching documents.
Lucene 和 Elasticsearch 不仅限于文本搜索:还支持数字数据,支持对整数、双精度数、长整数、日期等的文本搜索。这些类型使用略微不同的方法进行索引和查询,显然不涉及文本处理。
Lucene and Elasticsearch are not limited to just text search: numeric data is also supported, enabling support for integers, doubles, longs, dates, etc. These types are indexed and queried using a slightly different approach, which obviously does not involve text processing. |
4.2. Entity types
当涉及到应用程序的领域模型时,Hibernate Search 会区分被视为实体的类型(Java 类)和非实体的类型。
When it comes to the domain model of applications, Hibernate Search distinguishes between types (Java classes) that are considered entities, and those that are not.
Hibernate Search 中实体类型的定义特征在于它们的实例具有不同的生命周期:实体实例可以保存到数据存储中,或者从数据存储中检索,而无需保存或检索另一种类型的实例。出于此目的,假设每个实体实例都带有不可变的唯一标识符。
The defining characteristic of entity types in Hibernate Search is that their instances have a distinct lifecycle: an entity instance may be saved into a datastore, or retrieved from it, without requiring the saving or retrieval of an instance of another type. For that purpose, each entity instance is assumed to carry an immutable, unique identifier.
这些特性使 Hibernate Search 能够将实体类型 map 到索引,但只能索引实体类型。从实体中引用或包含在其中的“可嵌入”类型,但其生命周期完全绑定到该实体,不能映射到索引。
These characteristics allow Hibernate Search to map entity types to indexes, but only entity types. "Embeddable" types that are referenced from or contained within an entity, but whose lifecycle is completely tied to that entity, cannot be mapped to an index.
Hibernate Search 的多个方面涉及实体类型概念:
Multiple aspects of Hibernate Search involve the concept of entity types:
-
Each entity type has an entity name, distinct from the type name. E.g. for a class named com.acme.Book, the entity name could be Book (the default), or any arbitrarily chosen string.
-
Properties pointing to an entity type (called associations) have specific mechanics; in particular, in order to handle reindexing, Hibernate Search needs to know about the inverse side of associations.
-
For the purposes of change tracking when reindexing, (e.g. in indexing plans), entity types represent the smallest scope Hibernate Search considers.This means the paths representing "changed properties" in Hibernate Search always have an entity as their starting point, and the components within these paths never reach into another entity (but may point to one, when an association changes.
-
Hibernate Search may need additional configuration to enable loading of entity types from an external datastore, be it to load entities matching a query from an external source or to load all entity instances from an external source for full reindexing.
4.3. Mapping
Hibernate search 针对的应用程序使用基于 entity 的模型来表示数据。在此模型中,每个实体都是单个对象,具有几个原子类型属性(String、Integer、LocalDate、…)。每个实体可以包含非根聚合(“可嵌入”类型),并且每个实体还可以与一个甚至多个其他实体有多个关联。
Applications targeted by Hibernate search use an entity-based model to represent data. In this model, each entity is a single object with a few properties of atomic type (String, Integer, LocalDate, …). Each entity can contain non-root aggregates ("embeddable" types), and each can also have multiple associations to one or even many other entities.
相比之下,Lucene 和 Elasticsearch 使用文档。每个文档都是一个“字段”集合,每个字段都被分配一个名称(一个唯一字符串)和一个值(可以是文本,也可以是整数或日期等数字数据)。字段还具有类型,该类型不仅确定值(文本/数字)的类型,更重要的是确定该值将被存储的方式:已编入索引、已存储、含 doc 值等。每个文档都可以包含嵌套聚合(“对象”/“嵌套文档”),但顶级文档之间实际上不能有关联。
By contrast, Lucene and Elasticsearch work with documents. Each document is a collection of "fields", each field being assigned a name — a unique string — and a value — which can be text, but also numeric data such as an integer or a date. Fields also have a type, which not only determines the type of values (text/numeric), but more importantly the way this value will be stored: indexed, stored, with doc values, etc. Each document can contain nested aggregates ("objects"/"nested documents"), but there cannot really be associations between top-level documents.
这样:
Thus:
-
Entities are organized as a graph, where each node is an entity and each association is an edge.
-
Documents are organized, at best, as a collection of trees, where each tree is a document, optionally with nested documents.
实体模型和文档模型之间存在多个不匹配:简单属性类型与更复杂的字段类型、关联与无关联、图表与树集合。
There are multiple mismatches between the entity model and the document model: simple property types vs. more complex field types, associations vs. no associations, graph vs. collection of trees.
mapping 在 Hibernate search 中的目标是通过定义如何将一个或多个实体转换为文档,以及如何将搜索结果解析回原始实体,来解决这些不匹配的情况。这是 Hibernate Search 的主要附加值,也是从 indexing 到各种搜索 DSL 的一切事物的基础。
The goal of mapping, in Hibernate search, is to resolve these mismatches by defining how to transform one or more entities into a document, and how to resolve a search hit back into the original entity. This is the main added value of Hibernate Search, the basis for everything else from indexing to the various search DSLs.
映射通常使用实体模型中的注释进行配置,但这也可以使用编程 API 来实现。要了解有关如何配置映射的更多信息,请参见 Mapping entities to indexes。
Mapping is usually configured using annotations in the entity model, but this can also be achieved using a programmatic API. To learn more about how to configure mapping, see Mapping entities to indexes.
要了解如何为生成的文档编制索引,请参见 Indexing entities(提示:对于 Hibernate ORM integration,为 automatic)。
To learn how to index the resulting documents, see Indexing entities (hint: for the Hibernate ORM integration, it’s automatic).
要了解如何使用利用映射来更接近实体模型的 API 进行搜索(特别是将结果作为实体而不是仅仅作为文档标识符进行返回),请参见 Searching。
To learn how to search with an API that takes advantage of the mapping to be closer to the entity model, in particular by returning hits as entities instead of just document identifiers, see Searching.
4.4. Binding
尽管 mapping 定义是声明式的,但需要解释这些声明并将其实际应用于领域模型。
While the mapping definition is declarative, these declarations need to be interpreted and actually applied to the domain model.
这就是 Hibernate Search 所谓的“绑定”:在启动期间,给定的映射指令(例如 @GenericField)将导致实例化并调用“绑定程序”,使它有机会检查其应用于该部分的域模型并“绑定”(分配)一个组件到模型的那一部分(例如“桥”,负责在索引期间从实体中提取数据)。
That’s what Hibernate Search calls "binding": during startup, a given mapping instruction (e.g. @GenericField) will result in a "binder" being instantiated and called, giving it an opportunity to inspect the part of the domain model it’s applied to and to "bind" (assign) a component to that part of the model — for example a "bridge", responsible for extracting data from an entity during indexing.
Hibernate Search 附带绑定程序和桥接程序,用于许多常见用例,还提供了插入自定义绑定程序和桥接程序的能力。
Hibernate Search comes with binders and bridges for many common use cases, and also provides the ability to plug in custom binders and bridges.
有关更多信息,特别是有关如何插入自定义绑定器和桥接器的信息,请参见 Binding and bridges。
For more information, in particular on how to plug in custom binders and bridges, see Binding and bridges.
4.5. Analysis
如 Full-text search 中所述,全文引擎处理标记,这意味着必须在索引(文档处理,构建标记 → 文档索引)和搜索(查询处理,生成要查找的标记列表)时处理文本。
As mentioned in Full-text search, the full-text engine works on tokens, which means text has to be processed both when indexing (document processing, to build the token → document index) and when searching (query processing, to generate a list of tokens to look up).
但是,处理不仅仅是“标记化”。索引查找是精确查找,这意味着查找 Great (大写)不会返回仅包含 great (全小写)的文档。在处理文本时执行额外的步骤来解决此警告:标记过滤器,这可以规范标记。由于这种“规范化”, Great 将被编入索引为 great ,以便查询的索引查找 great 将按预期匹配。
However, the processing is not just about "tokenizing". Index lookups are exact lookups, which means that looking up Great (capitalized) will not return documents containing only great (all lowercase). An extra step is performed when processing text to address this caveat: token filtering, which normalizes tokens. Thanks to that "normalization", Great will be indexed as great, so that an index lookup for the query great will match as expected.
在 Lucene 世界(Lucene、Elasticsearch、Solr,…),在索引和搜索阶段期间的文本处理称为“分析”,并由“分析器”执行。
In the Lucene world (Lucene, Elasticsearch, Solr, …), text processing during both the indexing and searching phases is called "analysis" and is performed by an "analyzer".
分析器由三类组件组成,这些组件将按照以下顺序连续处理文本:
The analyzer is made up of three types of components, which will each process the text successively in the following order:
-
Character filter: transforms the input characters. Replaces, adds or removes characters.
-
Tokenizer: splits the text into several words, called "tokens".
-
Token filter: transforms the tokens. Replaces, add or removes characters in a token, derives new tokens from the existing ones, removes tokens based on some condition, …
分词器通常在空格处分隔(尽管还有其他选项)。令牌筛选器通常是在其中进行自定义的地方。它们可以删除重音字符,删除无意义的后缀(-ing、 -s、…)或令牌(a、 the、…),用所选拼写替换令牌(wi-fi ⇒ wifi),等等。
The tokenizer usually splits on whitespaces (though there are other options). Token filters are usually where customization takes place. They can remove accented characters, remove meaningless suffixes (-ing, -s, …) or tokens (a, the, …), replace tokens with a chosen spelling (wi-fi ⇒ wifi), etc.
字符过滤器虽然有用,但很少使用,因为它们不了解标记边界。 |
Character filters, though useful, are rarely used, because they have no knowledge of token boundaries. |
除非您知道自己在做什么,否则通常应该支持标记过滤器。
Unless you know what you are doing, you should generally favor token filters.
在某些情况下,有必要在不分词的情况下以一个块的方式对文本进行索引:
In some cases, it is necessary to index text in one block, without any tokenization:
-
For some types of text, such as SKUs or other business codes, tokenization simply does not make sense: the text is a single "keyword".
-
For sorts by field value, tokenization is not necessary. It is also forbidden in Hibernate Search due to performance issues; only non-tokenized fields can be sorted on.
为了解决这些用例,可以使用一种称为“规范化器”的特殊分析器类型。规范化器只是保证不使用分词器的分析器:它们只能使用字符筛选器和令牌筛选器。
To address these use cases, a special type of analyzer, called "normalizer", is available. Normalizers are simply analyzers that are guaranteed not to use a tokenizer: they can only use character filters and token filters.
在 Hibernate Search 中,分析器和规范化器由其名称引用,例如 when defining a full-text field。分析器和规范化器具有两个单独的命名空间。
In Hibernate Search, analyzers and normalizers are referenced by their name, for example when defining a full-text field. Analyzers and normalizers have two separate namespaces.
某些名称已分配给内置分析器(特别是在 Elasticsearch 中),但可以(并且建议)为自定义分析器和规范化器分配名称,使用内置组件(分词器、筛选器)组装这些名称以满足您的特定需求。
Some names are already assigned to built-in analyzers (in Elasticsearch in particular), but it is possible (and recommended) to assign names to custom analyzers and normalizers, assembled using built-in components (tokenizers, filters) to address your specific needs.
每个后端都公开其自己的 API 来定义分析器和规范化器,并通常来配置分析。有关更多信息,请参阅每个后端的文档:
Each backend exposes its own APIs to define analyzers and normalizers, and generally to configure analysis. See the documentation of each backend for more information:
4.6. Commit and refresh
为了在索引和搜索时获得最佳吞吐量,Elasticsearch 和 Lucene 在写入和从索引读取时都依赖于“缓冲区”:
In order to get the best throughput when indexing and when searching, both Elasticsearch and Lucene rely on "buffers" when writing to and reading from the index:
-
When writing, changes are not directly written to the index, but to an "index writer" that buffers changes in-memory or in temporary files.
在写入程序为 committed 时,会将更改“推”到实际索引。在提交发生之前,未提交的更改处于“不安全”状态:如果应用程序崩溃或服务器断电,未提交的更改将丢失。
The changes are "pushed" to the actual index when the writer is committed. Until the commit happens, uncommitted changes are in an "unsafe" state: if the application crashes or if the server suffers from a power loss, uncommitted changes will be lost.
-
When reading, e.g. when executing a search query, data is not read directly from the index, but from an "index reader" that exposes a view of the index as it was at some point in the past.
在阅读程序为 refreshed 时,视图会更新。在刷新发生之前,搜索查询的结果可能略有过期:自上次刷新以来添加的文档将丢失,自上次刷新以来删除的文档仍会存在,等等。
The view is updated when the reader is refreshed. Until the refresh happens, results of search queries might be slightly out of date: documents added since the last refresh will be missing, documents delete since the last refresh will still be there, etc.
显然,不安全的更改和不同步索引不可取,但它们是提高性能的权衡之计。
Unsafe changes and out-of-sync indexes are obviously undesirable, but they are a trade-off that improves performance.
以下不同因素会影响刷新和提交发生的时间:
Different factors influence when refreshes and commit happen:
-
Listener-triggered indexing and explicit indexing will, by default, require that a commit of the index writer is performed after each set of changes, meaning the changes are safe after the Hibernate ORM transaction commit returns (for the Hibernate ORM integration) or the SearchSession's close() method returns (for the Standalone POJO Mapper). However, no refresh is requested by default, meaning the changes may only be visible at a later time, when the backend decides to refresh the index reader. This behavior can be customized by setting a different synchronization strategy.
-
The mass indexer will not require any commit or refresh until the very end of mass indexing, to maximize indexing throughput.
-
Whenever there are no particular commit or refresh requirements, backend defaults will apply:
请参阅 here for Elasticsearch 。
请参阅 here for Lucene 。
See here for Lucene.
-
See here for Lucene.
-
A commit may be forced explicitly through the flush() API.
-
A refresh may be forced explicitly though the refresh() API.
即使我们使用“提交”一词,但它与关系数据库事务中的提交概念不同:没有事务且不可能“回滚”。
Even though we use the word "commit", this is not the same concept as a commit in relational database transactions: there is no transaction and no "rollback" is possible. |
也没有隔离的概念。在刷新之后,将考虑对索引的所有更改:那些提交给索引的更改,以及仍缓冲在索引写入器中的更改。
There is no concept of isolation, either. After a refresh, all changes to the index are taken into account: those committed to the index, but also those that are still buffered in the index writer.
出于此原因,提交和刷新可以被视为完全正交的概念:某些设置偶尔会导致提交的更改在搜索查询中不可见,而其他设置则允许即使未提交的更改在搜索查询中可见。
For this reason, commits and refreshes can be treated as completely orthogonal concepts: certain setups will occasionally lead to committed changes not being visible in search queries, while others will allow even uncommitted changes to be visible in search queries.
4.7. Sharding and routing
分片将索引数据拆分为多个称为分片的“较小索引”,以提升在处理大量数据时性能。
Sharding consists in splitting index data into multiple "smaller indexes", called shards, in order to improve performance when dealing with large amounts of data.
在 Hibernate Search 中,与 Elasticsearch 类似,另外一个概念与分片密切相关:路由。路由包含将文档标识符(或一般称为“路由密钥”的任意字符串)解析到对应分片。
In Hibernate Search, similarly to Elasticsearch, another concept is closely related to sharding: routing. Routing consists in resolving a document identifier, or generally any string called a "routing key", into the corresponding shard.
在索引时:
When indexing:
-
A document identifier and optionally a routing key are generated from the indexed entity.
-
The document, along with its identifier and optionally its routing key, is passed to the backend.
-
The backend "routes" the document to the correct shard, and adds the routing key (if any) to a special field in the document (so that it’s indexed).
-
The document is indexed in that shard.
在搜索时:
When searching:
-
The search query can optionally be passed one or more routing keys.
-
If no routing key is passed, the query will be executed on all shards.
-
If one or more routing keys are passed:
后端将这些路由键解析为一组分片,并将仅在所有分片上执行查询,忽略其他分片。
The backend resolves these routing keys into a set of shards, and the query will only be executed on all shards, ignoring the other shards.
添加一个过滤器到查询中,以便仅匹配用给定的一个路由键建立索引的文档。
A filter is added to the query so that only documents indexed with one of the given routing keys are matched.
-
The backend resolves these routing keys into a set of shards, and the query will only be executed on all shards, ignoring the other shards.
-
A filter is added to the query so that only documents indexed with one of the given routing keys are matched.
因此,分片可用于通过以下两种方式提升性能:
Sharding, then, can be leveraged to boost performance in two ways:
-
When indexing: a sharded index can spread the "stress" onto multiple shards, which can be located on different disks (Lucene) or different servers (Elasticsearch).
-
When searching: if one property, let’s call it category, is often used to select a subset of documents, this property can be defined as a routing key in the mapping, so that it’s used to route documents instead of the document ID. As a result, documents with the same value for category will be indexed in the same shard. Then when searching, if a query already filters documents so that it is known that the hits will all have the same value for category, the query can be manually routed to the shards containing documents with this value, and the other shards can be ignored.
要启用分片,需要一些配置:
To enable sharding, some configuration is required:
-
The backends require explicit configuration: see here for Lucene and here for Elasticsearch.
-
In most cases, document IDs are used to route documents to shards by default. This does not allow taking advantage of routing when searching, which requires multiple documents to share the same routing key. Applying routing to a search query in that case will return at most one result. To explicitly define the routing key to assign to each document, assign routing bridges to your entities.[.iokays-translated-2466fbba98a45a5b4b5012ec3cff280c]
分片本质上是静态的:预计每个索引具有相同的碎片,具有相同的标识符,从一个引导到另一个引导。更改分片数或其标识符将需要完全重新索引。
Sharding is static by nature: each index is expected to have the same shards, with the same identifiers, from one boot to the other. Changing the number of shards or their identifiers will require full reindexing.