ETL Pipeline
提取、转换和加载 (ETL) 框架充当检索增强生成 (RAG) 使用案例中的数据处理主干。
The Extract, Transform, and Load (ETL) framework serves as the backbone of data processing within the Retrieval Augmented Generation (RAG) use case.
ETL 管道将数据流从原始数据源协调到结构化向量存储中,确保数据处于 AI 模型检索的最佳格式中。
The ETL pipeline orchestrates the flow from raw data sources to a structured vector store, ensuring data is in the optimal format for retrieval by the AI model.
RAG 使用案例是一个文本,它通过从数据主体中检索相关信息来增强生成模型的能力,以提升生成输出的质量和关联性。
The RAG use case is text to augment the capabilities of generative models by retrieving relevant information from a body of data to enhance the quality and relevance of the generated output.
API Overview
ETL 管道有三个主要组件,
There are three main components of the ETL pipeline,
-
DocumentReader
that implementsSupplier<List<Document>>
-
DocumentTransformer
that implementsFunction<List<Document>, List<Document>>
-
DocumentWriter
that implementsConsumer<List<Document>>
Document
类包含文本和元数据,并通过 DocumentReader
从 PDF、文本文件和其他文档类型中创建。
The Document
class contains text and metadata and is created from PDFs, text files and other document types via the DocumentReader
.
要构建一个简单的 ETL 管道,您可以将每种类型的实例链接在一起。
To construct a simple ETL pipeline, you can chain together an instance of each type.
假设我们有以下三个 ETL 类型的实例
Let’s say we have the following instances of those three ETL types
-
PagePdfDocumentReader
an implementation ofDocumentReader
-
TokenTextSplitter
an implementation ofDocumentTransformer
-
VectorStore
an implementation ofDocumentWriter
要将数据基本加载到 Vector 数据库中,以便与检索增强生成模式配合使用,请使用以下代码。
To perform the basic loading of data into a Vector Database for use with the Retrieval Augmented Generation pattern, use the following code.
vectorStore.accept(tokenTextSplitter.apply(pdfReader.get()));
Getting Started
要开始创建 Spring AI RAG 应用程序,请遵循以下步骤:
To begin creating a Spring AI RAG application, follow these steps:
-
Download the latest Spring CLI Release and follow the installation instructions.
-
To create a simple OpenAI-based application, use the command:[source, shell]
spring boot new --from ai-rag --name myrag
-
Consult the generated
README.md
file for guidance on obtaining an OpenAI API Key and running your first AI RAG application.
ETL Interfaces and Implementations
ETL 管道由以下接口和实现组成。详细的 ETL 类图显示在 ETL Class Diagram 部分中。
The ETL pipeline is composed of the following interfaces and implementations. Detailed ETL class diagram is shown in the ETL Class Diagram section.
DocumentReader
提供各类来源的文档。
Provides a source of documents from diverse origins.
public interface DocumentReader extends Supplier<List<Document>> {
}
JsonReader
JsonReader
解析 JSON 格式的文档。
The JsonReader
Parses documents in JSON format.
示例:
Example:
@Component
public class MyAiApp {
@Value("classpath:bikes.json") // This is the json document to load
private Resource resource;
List<Document> loadJsonAsDocuments() {
JsonReader jsonReader = new JsonReader(resource, "description");
return jsonReader.get();
}
}
TextReader
TextReader
处理纯文本文档。
The TextReader
processes plain text documents.
示例:
Example:
@Component
public class MyTextReader {
@Value("classpath:text-source.txt") // This is the text document to load
private Resource resource;
List<Document> loadText() {
TextReader textReader = new TextReader(resource);
textReader.getCustomMetadata().put("filename", "text-source.txt");
return textReader.get();
}
}
PagePdfDocumentReader
PagePdfDocumentReader
使用 Apache PdfBox 库解析 PDF 文档
The PagePdfDocumentReader
uses Apache PdfBox library to parse PDF documents
示例:
Example:
@Component
public class MyPagePdfDocumentReader {
List<Document> getDocsFromPdf() {
PagePdfDocumentReader pdfReader = new PagePdfDocumentReader("classpath:/sample1.pdf",
PdfDocumentReaderConfig.builder()
.withPageTopMargin(0)
.withPageExtractedTextFormatter(ExtractedTextFormatter.builder()
.withNumberOfTopTextLinesToDelete(0)
.build())
.withPagesPerDocument(1)
.build());
return pdfReader.get();
}
}
ParagraphPdfDocumentReader
ParagraphPdfDocumentReader
使用 PDF 目录(例如 TOC)信息将输入的 PDF 拆分为文本段落,并为每个段落输出一个 Document
。注意:并非所有 PDF 文档都包含 PDF 目录。
The ParagraphPdfDocumentReader
uses the PDF catalog (e.g. TOC) information to split the input PDF into text paragraphs and output a single Document
per paragraph.
NOTE: Not all PDF documents contain the PDF catalog.
示例:
Example:
@Component
public class MyPagePdfDocumentReader {
List<Document> getDocsFromPdfwithCatalog() {
new ParagraphPdfDocumentReader("classpath:/sample1.pdf",
PdfDocumentReaderConfig.builder()
.withPageTopMargin(0)
.withPageExtractedTextFormatter(ExtractedTextFormatter.builder()
.withNumberOfTopTextLinesToDelete(0)
.build())
.withPagesPerDocument(1)
.build());
return pdfReader.get();
}
}
TikaDocumentReader
TikaDocumentReader
使用 Apache Tika 从各种文档格式(例如 PDF、DOC/DOCX、PPT/PPTX 和 HTML)中提取文本。有关受支持格式的详尽列表,请参阅 ` Tika documentation`。
The TikaDocumentReader
uses Apache Tika to extract text from a variety of document formats, such as PDF, DOC/DOCX, PPT/PPTX, and HTML. For a comprehensive list of supported formats, refer to the Tika documentation.
示例:
Example:
@Component
public class MyTikaDocumentReader {
@Value("classpath:/word-sample.docx") // This is the word document to load
private Resource resource;
List<Document> loadText() {
TikaDocumentReader tikaDocumentReader = new TikaDocumentReader(resourceUri);
return tikaDocumentReader.get();
}
}
DocumentTransformer
将一批文档作为处理工作流程的一部分进行转换。
Transforms a batch of documents as part of the processing workflow.
public interface DocumentTransformer extends Function<List<Document>, List<Document>> {
}
DocumentWriter
管理 ETL 流程的最后阶段,为存储准备文档。
Manages the final stage of the ETL process, preparing documents for storage.
public interface DocumentWriter extends Consumer<List<Document>> {
}
VectorStore
与各种向量存储进行集成。有关完整列表,请参阅 Vector DB Documentation
。
Provides integration with various vector stores. See Vector DB Documentation for a full listing.