Weaviate

本部分将指导你设置 Weaviate VectorStore,以存储文档嵌入并执行相似性搜索。

This section will walk you through setting up the Weaviate VectorStore to store document embeddings and perform similarity searches.

What is Weaviate?

Weaviate是一个开源向量数据库。它允许您存储来自您最喜欢的 ML-models 的数据对象和向量嵌入,并无缝扩展到数十亿个数据对象。它提供了存储文档嵌入、内容和元数据以及搜索这些嵌入(包括元数据筛选)的工具。

Weaviate is an open-source vector database. It allows you to store data objects and vector embeddings from your favorite ML-models and scale seamlessly into billions of data objects. It provides tools to store document embeddings, content, and metadata and to search through those embeddings, including metadata filtering.

Prerequisites

  1. EmbeddingClient instance to compute the document embeddings. Several options are available:

    • Transformers Embedding - computes the embedding in your local environment. Follow the ONNX Transformers Embedding instructions.

    • OpenAI Embedding - uses the OpenAI embedding endpoint. You need to create an account at OpenAI Signup and generate the api-key token at API Keys.

    • You can also use the Azure OpenAI Embedding or the PostgresML Embedding Client.

  2. Weaviate cluster. You can set up a cluster locally in a Docker container or create a Weaviate Cloud Service. For the latter, you need to create a Weaviate account, set up a cluster, and get your access API key from the dashboard details.

在启动时,WeaviateVectorStore 创建所需的 SpringAiWeaviate 对象模式(如果尚未配置)。

On startup, the WeaviateVectorStore creates the required SpringAiWeaviate object schema if it’s not already provisioned.

Dependencies

将这些依赖项添加到你的项目中:

Add these dependencies to your project:

  • Embedding Client boot starter, required for calculating embeddings.

  • Transformers Embedding (Local) and follow the ONNX Transformers Embedding instructions.

<dependency>
  <groupId>org.springframework.ai</groupId>
  <artifactId>spring-ai-transformers-spring-boot-starter</artifactId>
</dependency>

或使用 OpenAI(云)

or use OpenAI (Cloud)

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-openai-spring-boot-starter</artifactId>
</dependency>

你需要提供你的 OpenAI API 密钥。将其设置为一个环境变量,如下所示:

You’ll need to provide your OpenAI API Key. Set it as an environment variable like so:

export SPRING_AI_OPENAI_API_KEY='Your_OpenAI_API_Key'
  • Add the Weaviate VectorStore dependency

<dependency>
  <groupId>org.springframework.ai</groupId>
  <artifactId>spring-ai-weaviate-store</artifactId>
</dependency>
  1. 参见 Dependency Management 部分,将 Spring AI BOM 添加到你的构建文件中。

Refer to the Dependency Management section to add the Spring AI BOM to your build file.

Usage

创建一个连接到本地 Weaviate 集群的 WeaviateVectorStore 实例:

Create a WeaviateVectorStore instance connected to the local Weaviate cluster:

@Bean
public VectorStore vectorStore(EmbeddingClient embeddingClient) {
  WeaviateVectorStoreConfig config = WeaviateVectorStoreConfig.builder()
     .withScheme("http")
     .withHost("localhost:8080")
     // Define the metadata fields to be used
     // in the similarity search filters.
     .withFilterableMetadataFields(List.of(
        MetadataField.text("country"),
        MetadataField.number("year"),
        MetadataField.bool("active")))
     // Consistency level can be: ONE, QUORUM, or ALL.
     .withConsistencyLevel(ConsistentLevel.ONE)
     .build();

  return new WeaviateVectorStore(config, embeddingClient);
}

对于筛选表达式中使用的任何元数据密钥,必须显式列出所有元数据字段名称和类型(BOOLEANTEXTNUMBER)。上面的 withFilterableMetadataKeys 注册可筛选元数据字段:类型为 TEXTcountry、类型为 NUMBERyear 和类型为 BOOLEANactive

You must list explicitly all metadata field names and types (BOOLEAN, TEXT, or NUMBER) for any metadata key used in filter expression. The withFilterableMetadataKeys above registers filterable metadata fields: country of type TEXT, year of type NUMBER, and active of type BOOLEAN.

如果可过滤元数据字段通过新条目展开,则必须使用此元数据(重新)上传/更新文档。

If the filterable metadata fields are expanded with new entries, you have to (re)upload/update the documents with this metadata.

您可以在没有明确定义的情况下使用以下 Weaviate system metadata字段:id, _creationTimeUnix, 和 _lastUpdateTimeUnix

You can use the following Weaviate system metadata fields without explicit definition: id, _creationTimeUnix, and _lastUpdateTimeUnix.

然后在你的主代码中创建一些文档:

Then in your main code, create some documents:

List<Document> documents = List.of(
   new Document("Spring AI rocks!! Spring AI rocks!! Spring AI rocks!! Spring AI rocks!! Spring AI rocks!!", Map.of("country", "UK", "active", true, "year", 2020)),
   new Document("The World is Big and Salvation Lurks Around the Corner", Map.of()),
   new Document("You walk forward facing the past and you turn back toward the future.", Map.of("country", "NL", "active", false, "year", 2023)));

现在向你的矢量存储添加文档:

Now add the documents to your vector store:

vectorStore.add(List.of(document));

最后,检索与查询类似的文档:

And finally, retrieve documents similar to a query:

List<Document> results = vectorStore.similaritySearch(
   SearchRequest
      .query("Spring")
      .withTopK(5));

如果一切都顺利,你应该检索包含文本 “Spring AI rocks!!” 的文档。

If all goes well, you should retrieve the document containing the text "Spring AI rocks!!".

Metadata filtering

您还可以将通用、可移植的 metadata filters与 WeaviateVectorStore 一起使用。

You can leverage the generic, portable metadata filters with WeaviateVectorStore as well.

例如,你可以使用文本表达式语言:

For example, you can use either the text expression language:

vectorStore.similaritySearch(
   SearchRequest
      .query("The World")
      .withTopK(TOP_K)
      .withSimilarityThreshold(SIMILARITY_THRESHOLD)
      .withFilterExpression("country in ['UK', 'NL'] && year >= 2020"));

或使用表达式 DSL 以编程方式:

or programmatically using the expression DSL:

FilterExpressionBuilder b = new FilterExpressionBuilder();

vectorStore.similaritySearch(
   SearchRequest
      .query("The World")
      .withTopK(TOP_K)
      .withSimilarityThreshold(SIMILARITY_THRESHOLD)
      .withFilterExpression(b.and(
         b.in("country", "UK", "NL"),
         b.gte("year", 2020)).build()));

便携式过滤表达式会自动转换为专有 Weaviate where filters。例如,以下便携式过滤表达式:

The portable filter expressions get automatically converted into the proprietary Weaviate where filters. For example, the following portable filter expression:

country in ['UK', 'NL'] && year >= 2020

将转换为 Weaviate GraphQL where filter expression

is converted into Weaviate GraphQL where filter expression:

operator:And
   operands:
      [{
         operator:Or
         operands:
            [{
               path:["meta_country"]
               operator:Equal
               valueText:"UK"
            },
            {
               path:["meta_country"]
               operator:Equal
               valueText:"NL"
            }]
      },
      {
         path:["meta_year"]
         operator:GreaterThanEqual
         valueNumber:2020
      }]

Run Weaviate cluster in docker container

在 docker 容器中启动 Weaviate:

Start Weaviate in a docker container:

docker run -it --rm --name weaviate -e AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED=true -e PERSISTENCE_DATA_PATH=/var/lib/weaviate -e QUERY_DEFAULTS_LIMIT=25 -e DEFAULT_VECTORIZER_MODULE=none -e CLUSTER_HOSTNAME=node1 -p 8080:8080 semitechnologies/weaviate:1.22.4

使用 [role="bare"][role="bare"]http://localhost:8080/v1 启动一个 Weaviate 集群,其中 scheme=http、host=localhost:8080 且 apiKey=""。然后按照用法说明进行操作。

Starts a Weaviate cluster at [role="bare"]http://localhost:8080/v1 with scheme=http, host=localhost:8080, and apiKey="". Then follow the usage instructions.