ETL Pipeline

提取、转换和加载 (ETL) 框架充当检索增强生成 (RAG) 使用案例中的数据处理主干。

The Extract, Transform, and Load (ETL) framework serves as the backbone of data processing within the Retrieval Augmented Generation (RAG) use case.

ETL 管道将数据流从原始数据源协调到结构化向量存储中,确保数据处于 AI 模型检索的最佳格式中。

The ETL pipeline orchestrates the flow from raw data sources to a structured vector store, ensuring data is in the optimal format for retrieval by the AI model.

RAG 使用案例是一个文本,它通过从数据主体中检索相关信息来增强生成模型的能力,以提升生成输出的质量和关联性。

The RAG use case is text to augment the capabilities of generative models by retrieving relevant information from a body of data to enhance the quality and relevance of the generated output.

API Overview

ETL 管道有三个主要组件,

There are three main components of the ETL pipeline,

  • DocumentReader that implements Supplier<List<Document>>

  • DocumentTransformer that implements Function<List<Document>, List<Document>>

  • DocumentWriter that implements Consumer<List<Document>>

Document 类包含文本和元数据,并通过 DocumentReader 从 PDF、文本文件和其他文档类型中创建。

The Document class contains text and metadata and is created from PDFs, text files and other document types via the DocumentReader.

要构建一个简单的 ETL 管道,您可以将每种类型的实例链接在一起。

To construct a simple ETL pipeline, you can chain together an instance of each type.

etl pipeline

假设我们有以下三个 ETL 类型的实例

Let’s say we have the following instances of those three ETL types

  • PagePdfDocumentReader an implementation of DocumentReader

  • TokenTextSplitter an implementation of DocumentTransformer

  • VectorStore an implementation of DocumentWriter

要将数据基本加载到 Vector 数据库中,以便与检索增强生成模式配合使用,请使用以下代码。

To perform the basic loading of data into a Vector Database for use with the Retrieval Augmented Generation pattern, use the following code.

vectorStore.accept(tokenTextSplitter.apply(pdfReader.get()));

Getting Started

要开始创建 Spring AI RAG 应用程序,请遵循以下步骤:

To begin creating a Spring AI RAG application, follow these steps:

  1. Download the latest Spring CLI Release and follow the installation instructions.

  2. To create a simple OpenAI-based application, use the command:[source, shell]

spring boot new --from ai-rag --name myrag
  1. Consult the generated README.md file for guidance on obtaining an OpenAI API Key and running your first AI RAG application.

ETL Interfaces and Implementations

ETL 管道由以下接口和实现组成。详细的 ETL 类图显示在 ETL Class Diagram 部分中。

The ETL pipeline is composed of the following interfaces and implementations. Detailed ETL class diagram is shown in the ETL Class Diagram section.

DocumentReader

提供各类来源的文档。

Provides a source of documents from diverse origins.

public interface DocumentReader extends Supplier<List<Document>> {

}

JsonReader

JsonReader 解析 JSON 格式的文档。

The JsonReader Parses documents in JSON format.

示例:

Example:

@Component
public class MyAiApp {

	@Value("classpath:bikes.json") // This is the json document to load
	private Resource resource;

	List<Document> loadJsonAsDocuments() {
		JsonReader jsonReader = new JsonReader(resource, "description");
		return jsonReader.get();
	}
}

TextReader

TextReader 处理纯文本文档。

The TextReader processes plain text documents.

示例:

Example:

@Component
public class MyTextReader {

    @Value("classpath:text-source.txt") // This is the text document to load
	private Resource resource;

	List<Document> loadText() {
		TextReader textReader = new TextReader(resource);
		textReader.getCustomMetadata().put("filename", "text-source.txt");

		return textReader.get();
    }
}

PagePdfDocumentReader

PagePdfDocumentReader 使用 Apache PdfBox 库解析 PDF 文档

The PagePdfDocumentReader uses Apache PdfBox library to parse PDF documents

示例:

Example:

@Component
public class MyPagePdfDocumentReader {

	List<Document> getDocsFromPdf() {

		PagePdfDocumentReader pdfReader = new PagePdfDocumentReader("classpath:/sample1.pdf",
				PdfDocumentReaderConfig.builder()
					.withPageTopMargin(0)
					.withPageExtractedTextFormatter(ExtractedTextFormatter.builder()
						.withNumberOfTopTextLinesToDelete(0)
						.build())
					.withPagesPerDocument(1)
					.build());

		return pdfReader.get();
    }

}

ParagraphPdfDocumentReader

ParagraphPdfDocumentReader 使用 PDF 目录(例如 TOC)信息将输入的 PDF 拆分为文本段落,并为每个段落输出一个 Document。注意:并非所有 PDF 文档都包含 PDF 目录。

The ParagraphPdfDocumentReader uses the PDF catalog (e.g. TOC) information to split the input PDF into text paragraphs and output a single Document per paragraph. NOTE: Not all PDF documents contain the PDF catalog.

示例:

Example:

@Component
public class MyPagePdfDocumentReader {

	List<Document> getDocsFromPdfwithCatalog() {

        new ParagraphPdfDocumentReader("classpath:/sample1.pdf",
                PdfDocumentReaderConfig.builder()
                    .withPageTopMargin(0)
                    .withPageExtractedTextFormatter(ExtractedTextFormatter.builder()
                        .withNumberOfTopTextLinesToDelete(0)
                        .build())
                    .withPagesPerDocument(1)
                    .build());

		return pdfReader.get();
    }
}

TikaDocumentReader

TikaDocumentReader 使用 Apache Tika 从各种文档格式(例如 PDF、DOC/DOCX、PPT/PPTX 和 HTML)中提取文本。有关受支持格式的详尽列表,请参阅 ` Tika documentation`。

The TikaDocumentReader uses Apache Tika to extract text from a variety of document formats, such as PDF, DOC/DOCX, PPT/PPTX, and HTML. For a comprehensive list of supported formats, refer to the Tika documentation.

示例:

Example:

@Component
public class MyTikaDocumentReader {

    @Value("classpath:/word-sample.docx") // This is the word document to load
	private Resource resource;

	List<Document> loadText() {
        TikaDocumentReader tikaDocumentReader = new TikaDocumentReader(resourceUri);
        return tikaDocumentReader.get();
    }
}

DocumentTransformer

将一批文档作为处理工作流程的一部分进行转换。

Transforms a batch of documents as part of the processing workflow.

public interface DocumentTransformer extends Function<List<Document>, List<Document>> {

}

TextSplitter

TextSplitter 一个抽象基类,用于帮助将文档分隔为适合 AI 模型上下文窗口。

The TextSplitter an abstract base class that helps divides documents to fit the AI model’s context window.

TokenTextSplitter

在保留令牌级别完整性的同时拆分文档。

Splits documents while preserving token-level integrity.

ContentFormatTransformer

确保所有文档具有统一的内容格式。

Ensures uniform content formats across all documents.

KeywordMetadataEnricher

使用必要的关键字元数据增强文档。

Augments documents with essential keyword metadata.

SummaryMetadataEnricher

使用摘要元数据丰富文档,以增强检索能力。

Enriches documents with summarization metadata for enhanced retrieval.

DocumentWriter

管理 ETL 流程的最后阶段,为存储准备文档。

Manages the final stage of the ETL process, preparing documents for storage.

public interface DocumentWriter extends Consumer<List<Document>> {

}

FileDocumentWriter

将文档永久存储到文件中。

Persist documents to a file .

VectorStore

与各种向量存储进行集成。有关完整列表,请参阅 Vector DB Documentation

Provides integration with various vector stores. See Vector DB Documentation for a full listing.

ETL Class Diagram

下图展示了 ETL 接口和实现。

The following class diagram illustrates the ETL interfaces and implementations.

etl class diagram