RAG的ETL Pipeline源码解读

news2025/6/6 10:00:57

原文链接:SpringAI(GA):RAG下的ETL源码解读

教程说明

说明:本教程将采用2025年5月20日正式的GA版,给出如下内容

  1. 核心功能模块的快速上手教程
  2. 核心功能模块的源码级解读
  3. Spring ai alibaba增强的快速上手教程 + 源码级解读

版本:JDK21 + SpringBoot3.4.5 + SpringAI 1.0.0 + SpringAI Alibaba 1.0.0.2

将陆续完成如下章节教程。本章是第六章(Rag增强问答质量)下的ETL-pipeline源码解读

代码开源如下:https://github.com/GTyingzi/spring-ai-tutorial

微信推文往届解读可参考:

第一章内容

SpringAI(GA)的chat:快速上手+自动注入源码解读

SpringAI(GA):ChatClient调用链路解读

第二章内容

SpringAI的Advisor:快速上手+源码解读

SpringAI(GA):Sqlite、Mysql、Redis消息存储快速上手

第三章内容

SpringAI(GA):Tool工具整合—快速上手

第五章内容

SpringAI(GA):内存、Redis、ES的向量数据库存储—快速上手

SpringAI(GA):向量数据库理论源码解读+Redis、Es接入源码

第六章内容

SpringAI(GA):RAG快速上手+模块化解读

SpringAI(GA):RAG下的ETL快速上手

获取更好的观赏体验,可付费获取飞书云文档Spring AI最新教程权限,目前49.9,随着内容不断完善,会逐步涨价。

注:M6版快速上手教程+源码解读飞书云文档已免费提供

ETL Pipeline 源码解析

DocumentReader(读取文档数据接口类)

package org.springframework.ai.document;

import java.util.List;
import java.util.function.Supplier;

public interface DocumentReader extends Supplier<List<Document>> {
    default List<Document> read() {
        return (List)this.get();
    }
}

TextReader

用于从资源中读取文本内容并将其转换为 Document 对象

  • Resource resource:读取的资源
  • Map<String, Object> customMetadata:存储与 Document 对象相关的元数据
  • Charset charset:指定读取文本时使用的字符集,默认为 UTF8

方法说明

方法名称
描述
TextReader
通过资源URL、资源对象构造读取器
setCharset
设置读取文本时的字符集,默认为UTF8
getCharset
获取当前使用的字符集
getCustomMetadata
获取自定义元数据
get
读取文本,返回Document列表
getResourceIdentifier
获取资源的唯一标识(如文件名、URI、URL或描述信息)
package org.springframework.ai.reader;

import java.io.IOException;
import java.net.URI;
import java.net.URL;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Objects;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentReader;
import org.springframework.core.io.DefaultResourceLoader;
import org.springframework.core.io.Resource;
import org.springframework.util.StreamUtils;

public class TextReader implements DocumentReader {
    public static final String CHARSETMETADATA = "charset";
    public static final String SOURCEMETADATA = "source";
    private final Resource resource;
    private final Map<String, Object> customMetadata;
    private Charset charset;

    public TextReader(String resourceUrl) {
        this((new DefaultResourceLoader()).getResource(resourceUrl));
    }

    public TextReader(Resource resource) {
        this.customMetadata = new HashMap();
        this.charset = StandardCharsets.UTF8;
        Objects.requireNonNull(resource, "The Spring Resource must not be null");
        this.resource = resource;
    }

    public Charset getCharset() {
        return this.charset;
    }

    public void setCharset(Charset charset) {
        Objects.requireNonNull(charset, "The charset must not be null");
        this.charset = charset;
    }

    public Map<String, Object> getCustomMetadata() {
        return this.customMetadata;
    }

    public List<Document> get() {
        try {
            String document = StreamUtils.copyToString(this.resource.getInputStream(), this.charset);
            this.customMetadata.put("charset", this.charset.name());
            this.customMetadata.put("source", this.resource.getFilename());
            this.customMetadata.put("source", this.getResourceIdentifier(this.resource));
            return List.of(new Document(document, this.customMetadata));
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

    protected String getResourceIdentifier(Resource resource) {
        String filename = resource.getFilename();
        if (filename != null && !filename.isEmpty()) {
            return filename;
        } else {
            try {
                URI uri = resource.getURI();
                if (uri != null) {
                    return uri.toString();
                }
            } catch (IOException var5) {
            }

            try {
                URL url = resource.getURL();
                if (url != null) {
                    return url.toString();
                }
            } catch (IOException var4) {
            }

            return resource.getDescription();
        }
    }
}

JsonReader

用于从 JSON 资源中读取数据并将其转换为 Document 对象

  • Resource resource:表示要读取的 JSON 资源
  • JsonMetadataGenerator jsonMetadataGenerator:用于生成与 JSON 数据相关的元数据
  • ObjectMapper objectMapper:用于解析 JSON 数据
  • List<String> jsonKeysToUse:用于从 JSON 中提取哪些字段作为文档内容,若未指定则使用整个 JSON 对象

方法说明

方法名称
描述
JsonReader
通过资源对象、提取的字段名构造读取器
get
读取json文件,返回Document列表
package org.springframework.ai.reader;

import com.fasterxml.jackson.core.type.TypeReference;
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
import java.io.IOException;
import java.util.Collections;
import java.util.List;
import java.util.Map;
import java.util.Objects;
import java.util.stream.Stream;
import java.util.stream.StreamSupport;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentReader;
import org.springframework.core.io.Resource;

public class JsonReader implements DocumentReader {
    private final Resource resource;
    private final JsonMetadataGenerator jsonMetadataGenerator;
    private final ObjectMapper objectMapper;
    private final List<String> jsonKeysToUse;

    public JsonReader(Resource resource) {
        this(resource);
    }

    public JsonReader(Resource resource, String... jsonKeysToUse) {
        this(resource, new EmptyJsonMetadataGenerator(), jsonKeysToUse);
    }

    public JsonReader(Resource resource, JsonMetadataGenerator jsonMetadataGenerator, String... jsonKeysToUse) {
        this.objectMapper = new ObjectMapper();
        Objects.requireNonNull(jsonKeysToUse, "keys must not be null");
        Objects.requireNonNull(jsonMetadataGenerator, "jsonMetadataGenerator must not be null");
        Objects.requireNonNull(resource, "The Spring Resource must not be null");
        this.resource = resource;
        this.jsonMetadataGenerator = jsonMetadataGenerator;
        this.jsonKeysToUse = List.of(jsonKeysToUse);
    }

    public List<Document> get() {
        try {
            JsonNode rootNode = this.objectMapper.readTree(this.resource.getInputStream());
            return rootNode.isArray() ? StreamSupport.stream(rootNode.spliterator(), true).map((jsonNode) -> this.parseJsonNode(jsonNode, this.objectMapper)).toList() : Collections.singletonList(this.parseJsonNode(rootNode, this.objectMapper));
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

    private Document parseJsonNode(JsonNode jsonNode, ObjectMapper objectMapper) {
        Map<String, Object> item = (Map)objectMapper.convertValue(jsonNode, new TypeReference<Map<String, Object>>() {
        });
        StringBuilder sb = new StringBuilder();
        Stream var10000 = this.jsonKeysToUse.stream();
        Objects.requireNonNull(item);
        var10000.filter(item::containsKey).forEach((key) -> sb.append(key).append(": ").append(item.get(key)).append(System.lineSeparator()));
        Map<String, Object> metadata = this.jsonMetadataGenerator.generate(item);
        String content = sb.isEmpty() ? item.toString() : sb.toString();
        return new Document(content, metadata);
    }

    protected List<Document> get(JsonNode rootNode) {
        return rootNode.isArray() ? StreamSupport.stream(rootNode.spliterator(), true).map((jsonNode) -> this.parseJsonNode(jsonNode, this.objectMapper)).toList() : Collections.singletonList(this.parseJsonNode(rootNode, this.objectMapper));
    }

    public List<Document> get(String pointer) {
        try {
            JsonNode rootNode = this.objectMapper.readTree(this.resource.getInputStream());
            JsonNode targetNode = rootNode.at(pointer);
            if (targetNode.isMissingNode()) {
                throw new IllegalArgumentException("Invalid JSON Pointer: " + pointer);
            } else {
                return this.get(targetNode);
            }
        } catch (IOException e) {
            throw new RuntimeException("Error reading JSON resource", e);
        }
    }
}

JsoupDocumentReader

用于从 HTML 文档中提取文本内容,并将其转换为 Document 对象

各字段含义:

  • Resource htmlResource:要读取的 HTML 资源
  • JsoupDocumentReaderConfig config:配置 HTML 文档读取行为,包括字符集、选择器、是否提取所有元素,是否按元素分组等

方法说明

方法名称
描述
JsoupDocumentReader
通过资源URL、资源对象、解析HTML配置等构造读取器
get
读取html文件,返回Document列表
package org.springframework.ai.reader.jsoup;

import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentReader;
import org.springframework.ai.reader.jsoup.config.JsoupDocumentReaderConfig;
import org.springframework.core.io.DefaultResourceLoader;
import org.springframework.core.io.Resource;

public class JsoupDocumentReader implements DocumentReader {
    private final Resource htmlResource;
    private final JsoupDocumentReaderConfig config;

    public JsoupDocumentReader(String htmlResource) {
        this((new DefaultResourceLoader()).getResource(htmlResource));
    }

    public JsoupDocumentReader(Resource htmlResource) {
        this(htmlResource, JsoupDocumentReaderConfig.defaultConfig());
    }

    public JsoupDocumentReader(String htmlResource, JsoupDocumentReaderConfig config) {
        this((new DefaultResourceLoader()).getResource(htmlResource), config);
    }

    public JsoupDocumentReader(Resource htmlResource, JsoupDocumentReaderConfig config) {
        this.htmlResource = htmlResource;
        this.config = config;
    }

    public List<Document> get() {
        try (InputStream inputStream = this.htmlResource.getInputStream()) {
            org.jsoup.nodes.Document doc = Jsoup.parse(inputStream, this.config.charset, "");
            List<Document> documents = new ArrayList();
            if (this.config.allElements) {
                String allText = doc.body().text();
                Document document = new Document(allText);
                this.addMetadata(doc, document);
                documents.add(document);
            } else if (this.config.groupByElement) {
                for(Element element : doc.select(this.config.selector)) {
                    String elementText = element.text();
                    Document document = new Document(elementText);
                    this.addMetadata(doc, document);
                    documents.add(document);
                }
            } else {
                Elements elements = doc.select(this.config.selector);
                String text = (String)elements.stream().map(Element::text).collect(Collectors.joining(this.config.separator));
                Document document = new Document(text);
                this.addMetadata(doc, document);
                documents.add(document);
            }

            return documents;
        } catch (IOException e) {
            throw new RuntimeException("Failed to read HTML resource: " + String.valueOf(this.htmlResource), e);
        }
    }

    private void addMetadata(org.jsoup.nodes.Document jsoupDoc, Document springDoc) {
        Map<String, Object> metadata = new HashMap();
        metadata.put("title", jsoupDoc.title());

        for(String metaTag : this.config.metadataTags) {
            String value = jsoupDoc.select("meta[name=" + metaTag + "]").attr("content");
            if (!value.isEmpty()) {
                metadata.put(metaTag, value);
            }
        }

        if (this.config.includeLinkUrls) {
            Elements links = jsoupDoc.select("a[href]");
            List<String> linkUrls = links.stream().map((link) -> link.attr("abs:href")).toList();
            metadata.put("linkUrls", linkUrls);
        }

        metadata.putAll(this.config.additionalMetadata);
        springDoc.getMetadata().putAll(metadata);
    }
}
JsoupDocumentReaderConfig

配置 JsoupDocumentReader 行为的工具类

  • String charset:读取 HTML 文档时使用的字符编码,默认值为 “UTF-8”
  • String selector:用于提取 HTML 元素的 CSS 选择器,默认值为 “body”
  • String separator:在提取多个元素的文本内容时使用的分隔符,默认值为 “\n”
  • boolean allElements:是否提取 HTML 文档中所有元素的文本内容,并生成一个 Document 对象,默认值为 false
  • boolean groupByElement:是否按元素分组提取文本内容,并为每个元素生成一个 Document 对象,默认值为 false
  • boolean includeLinkUrls:是否将 HTML 文档中的链接 URL 包含在元数据中,默认值为 false
  • List<String> metadataTags:指定从 HTML 文档的 标签中提取哪些元数据,默认包含 “description” 和 “keywords”
  • Map<String, Object> additionalMetadata:用于添加额外的元数据到生成的 Document 对象中
package org.springframework.ai.reader.jsoup.config;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.springframework.util.Assert;

public final class JsoupDocumentReaderConfig {
    public final String charset;
    public final String selector;
    public final String separator;
    public final boolean allElements;
    public final boolean groupByElement;
    public final boolean includeLinkUrls;
    public final List<String> metadataTags;
    public final Map<String, Object> additionalMetadata;

    private JsoupDocumentReaderConfig(Builder builder) {
        this.charset = builder.charset;
        this.selector = builder.selector;
        this.separator = builder.separator;
        this.allElements = builder.allElements;
        this.includeLinkUrls = builder.includeLinkUrls;
        this.metadataTags = builder.metadataTags;
        this.groupByElement = builder.groupByElement;
        this.additionalMetadata = builder.additionalMetadata;
    }

    public static Builder builder() {
        return new Builder();
    }

    public static JsoupDocumentReaderConfig defaultConfig() {
        return builder().build();
    }

    public static final class Builder {
        private String charset = "UTF-8";
        private String selector = "body";
        private String separator = "\n";
        private boolean allElements = false;
        private boolean includeLinkUrls = false;
        private List<String> metadataTags = new ArrayList(List.of("description", "keywords"));
        private boolean groupByElement = false;
        private Map<String, Object> additionalMetadata = new HashMap();

        private Builder() {
        }

        public Builder charset(String charset) {
            this.charset = charset;
            return this;
        }

        public Builder selector(String selector) {
            this.selector = selector;
            return this;
        }

        public Builder separator(String separator) {
            this.separator = separator;
            return this;
        }

        public Builder allElements(boolean allElements) {
            this.allElements = allElements;
            return this;
        }

        public Builder groupByElement(boolean groupByElement) {
            this.groupByElement = groupByElement;
            return this;
        }

        public Builder includeLinkUrls(boolean includeLinkUrls) {
            this.includeLinkUrls = includeLinkUrls;
            return this;
        }

        public Builder metadataTag(String metadataTag) {
            this.metadataTags.add(metadataTag);
            return this;
        }

        public Builder metadataTags(List<String> metadataTags) {
            this.metadataTags = new ArrayList(metadataTags);
            return this;
        }

        public Builder additionalMetadata(String key, Object value) {
            Assert.notNull(key, "key must not be null");
            Assert.notNull(value, "value must not be null");
            this.additionalMetadata.put(key, value);
            return this;
        }

        public Builder additionalMetadata(Map<String, Object> additionalMetadata) {
            Assert.notNull(additionalMetadata, "additionalMetadata must not be null");
            this.additionalMetadata = additionalMetadata;
            return this;
        }

        public JsoupDocumentReaderConfig build() {
            return new JsoupDocumentReaderConfig(this);
        }
    }
}

MarkdownDocumentReader

用于从 Markdown 文件中读取内容并将其转换为 Document 对象。基于 CommonMark 库解析 Markdown 文档,支持将标题、段落、代码块等内容分组为 Document 对象,并生成相关元数据

  • Resource markdownResource:要读取的 Markdown 资源
  • MarkdownDocumentReaderConfig config:配置 Markdown 文档读取行为,包括是否将水平分割线视为文档分隔符、是否包含代码块、是否包含引用块等
  • Parser parser:解析 Markdown 文档的 CommonMark 解析器,用于将 Markdown 文本解析为节点树

DocumentVisitor 作为内部静态类,继承自 CommonMark 的 AbstractVisitor,用于遍历和解析 Markdown 的语法树节点,将其内容按配置分组、提取为结构化的 Document 对象

  1. 历 Markdown 解析后的节点树,根据配置(如是否按水平线分组、是否包含代码块/引用等)将内容分组
  2. 识别标题、段落、代码块、引用等不同类型节点,提取文本和元数据,构建 Document
  3. 支持为不同类型内容(如标题、代码块、引用)添加分类、标题、语言等元数据,便于后续 AI 处理。
方法名称
描述
MarkdownDocumentReader
通过资源URL、资源对象、解析markdown配置等构造读取器
get
读取markdown文件,返回Document列表
package org.springframework.ai.reader.markdown;

import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.Objects;
import org.commonmark.node.AbstractVisitor;
import org.commonmark.node.BlockQuote;
import org.commonmark.node.Code;
import org.commonmark.node.FencedCodeBlock;
import org.commonmark.node.HardLineBreak;
import org.commonmark.node.Heading;
import org.commonmark.node.ListItem;
import org.commonmark.node.Node;
import org.commonmark.node.SoftLineBreak;
import org.commonmark.node.Text;
import org.commonmark.node.ThematicBreak;
import org.commonmark.parser.Parser;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentReader;
import org.springframework.ai.reader.markdown.config.MarkdownDocumentReaderConfig;
import org.springframework.core.io.DefaultResourceLoader;
import org.springframework.core.io.Resource;

public class MarkdownDocumentReader implements DocumentReader {
    private final Resource markdownResource;
    private final MarkdownDocumentReaderConfig config;
    private final Parser parser;

    public MarkdownDocumentReader(String markdownResource) {
        this((new DefaultResourceLoader()).getResource(markdownResource), MarkdownDocumentReaderConfig.defaultConfig());
    }

    public MarkdownDocumentReader(String markdownResource, MarkdownDocumentReaderConfig config) {
        this((new DefaultResourceLoader()).getResource(markdownResource), config);
    }

    public MarkdownDocumentReader(Resource markdownResource, MarkdownDocumentReaderConfig config) {
        this.markdownResource = markdownResource;
        this.config = config;
        this.parser = Parser.builder().build();
    }

    public List<Document> get() {
        try (InputStream input = this.markdownResource.getInputStream()) {
            Node node = this.parser.parseReader(new InputStreamReader(input));
            DocumentVisitor documentVisitor = new DocumentVisitor(this.config);
            node.accept(documentVisitor);
            return documentVisitor.getDocuments();
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

    static class DocumentVisitor extends AbstractVisitor {
        private final List<Document> documents = new ArrayList();
        private final List<String> currentParagraphs = new ArrayList();
        private final MarkdownDocumentReaderConfig config;
        private Document.Builder currentDocumentBuilder;

        DocumentVisitor(MarkdownDocumentReaderConfig config) {
            this.config = config;
        }

        public void visit(org.commonmark.node.Document document) {
            this.currentDocumentBuilder = Document.builder();
            super.visit(document);
        }

        public void visit(Heading heading) {
            this.buildAndFlush();
            super.visit(heading);
        }

        public void visit(ThematicBreak thematicBreak) {
            if (this.config.horizontalRuleCreateDocument) {
                this.buildAndFlush();
            }

            super.visit(thematicBreak);
        }

        public void visit(SoftLineBreak softLineBreak) {
            this.translateLineBreakToSpace();
            super.visit(softLineBreak);
        }

        public void visit(HardLineBreak hardLineBreak) {
            this.translateLineBreakToSpace();
            super.visit(hardLineBreak);
        }

        public void visit(ListItem listItem) {
            this.translateLineBreakToSpace();
            super.visit(listItem);
        }

        public void visit(BlockQuote blockQuote) {
            if (!this.config.includeBlockquote) {
                this.buildAndFlush();
            }

            this.translateLineBreakToSpace();
            this.currentDocumentBuilder.metadata("category", "blockquote");
            super.visit(blockQuote);
        }

        public void visit(Code code) {
            this.currentParagraphs.add(code.getLiteral());
            this.currentDocumentBuilder.metadata("category", "codeinline");
            super.visit(code);
        }

        public void visit(FencedCodeBlock fencedCodeBlock) {
            if (!this.config.includeCodeBlock) {
                this.buildAndFlush();
            }

            this.translateLineBreakToSpace();
            this.currentParagraphs.add(fencedCodeBlock.getLiteral());
            this.currentDocumentBuilder.metadata("category", "codeblock");
            this.currentDocumentBuilder.metadata("lang", fencedCodeBlock.getInfo());
            this.buildAndFlush();
            super.visit(fencedCodeBlock);
        }

        public void visit(Text text) {
            Node var3 = text.getParent();
            if (var3 instanceof Heading heading) {
                this.currentDocumentBuilder.metadata("category", "header%d".formatted(heading.getLevel())).metadata("title", text.getLiteral());
            } else {
                this.currentParagraphs.add(text.getLiteral());
            }

            super.visit(text);
        }

        public List<Document> getDocuments() {
            this.buildAndFlush();
            return this.documents;
        }

        private void buildAndFlush() {
            if (!this.currentParagraphs.isEmpty()) {
                String content = String.join("", this.currentParagraphs);
                Document.Builder builder = this.currentDocumentBuilder.text(content);
                Map var10000 = this.config.additionalMetadata;
                Objects.requireNonNull(builder);
                var10000.forEach(builder::metadata);
                Document document = builder.build();
                this.documents.add(document);
                this.currentParagraphs.clear();
            }

            this.currentDocumentBuilder = Document.builder();
        }

        private void translateLineBreakToSpace() {
            if (!this.currentParagraphs.isEmpty()) {
                this.currentParagraphs.add(" ");
            }

        }
    }
}
MarkdownDocumentReaderConfig

配置 MarkdownDocumentReader 的行为

  • boolean horizontalRuleCreateDocument:是否将水平分割线分隔的文本创建为新的 Document
  • boolean includeCodeBlock:是否将代码块包含在段落文档中,还是单独创建新文档
  • boolean includeBlockquote:是否将引用块包含在段落文档中,还是单独创建新文档
  • Map<String, Object> additionalMetadata:添加额外元数据
package org.springframework.ai.reader.markdown.config;

import java.util.HashMap;
import java.util.Map;
import org.springframework.util.Assert;

public class MarkdownDocumentReaderConfig {
    public final boolean horizontalRuleCreateDocument;
    public final boolean includeCodeBlock;
    public final boolean includeBlockquote;
    public final Map<String, Object> additionalMetadata;

    public MarkdownDocumentReaderConfig(Builder builder) {
        this.horizontalRuleCreateDocument = builder.horizontalRuleCreateDocument;
        this.includeCodeBlock = builder.includeCodeBlock;
        this.includeBlockquote = builder.includeBlockquote;
        this.additionalMetadata = builder.additionalMetadata;
    }

    public static MarkdownDocumentReaderConfig defaultConfig() {
        return builder().build();
    }

    public static Builder builder() {
        return new Builder();
    }

    public static final class Builder {
        private boolean horizontalRuleCreateDocument = false;
        private boolean includeCodeBlock = false;
        private boolean includeBlockquote = false;
        private Map<String, Object> additionalMetadata = new HashMap();

        private Builder() {
        }

        public Builder withHorizontalRuleCreateDocument(boolean horizontalRuleCreateDocument) {
            this.horizontalRuleCreateDocument = horizontalRuleCreateDocument;
            return this;
        }

        public Builder withIncludeCodeBlock(boolean includeCodeBlock) {
            this.includeCodeBlock = includeCodeBlock;
            return this;
        }

        public Builder withIncludeBlockquote(boolean includeBlockquote) {
            this.includeBlockquote = includeBlockquote;
            return this;
        }

        public Builder withAdditionalMetadata(String key, Object value) {
            Assert.notNull(key, "key must not be null");
            Assert.notNull(value, "value must not be null");
            this.additionalMetadata.put(key, value);
            return this;
        }

        public Builder withAdditionalMetadata(Map<String, Object> additionalMetadata) {
            Assert.notNull(additionalMetadata, "additionalMetadata must not be null");
            this.additionalMetadata = additionalMetadata;
            return this;
        }

        public MarkdownDocumentReaderConfig build() {
            return new MarkdownDocumentReaderConfig(this);
        }
    }
}

PagePdfDocumentReader

用于将 PDF 文件按页分组解析为多个 Document,每个 Document 可包含一页或多页内容,支持自定义分组和页面裁剪

  • PDDocument document:要读取的 PDF 文档对象
  • String resourceFileName:存储 PDF 文件的名字
  • PdfDocumentReaderConfig config:配置 PDF 文档读取行为,包括每份文档包含的页数、页边距
方法名称
描述
PagePdfDocumentReader
通过资源URL、资源对象、解析PDF配置等构造读取器
get
读取PDF,返回Document列表
toDocument
将指定页内容和元数据封装为 Document
package org.springframework.ai.reader.pdf;

import java.awt.Rectangle;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.stream.Collectors;
import org.apache.pdfbox.io.RandomAccessReadBuffer;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentReader;
import org.springframework.ai.reader.pdf.config.PdfDocumentReaderConfig;
import org.springframework.ai.reader.pdf.layout.PDFLayoutTextStripperByArea;
import org.springframework.core.io.DefaultResourceLoader;
import org.springframework.core.io.Resource;
import org.springframework.util.CollectionUtils;
import org.springframework.util.StringUtils;

public class PagePdfDocumentReader implements DocumentReader {
    public static final String METADATASTARTPAGENUMBER = "pagenumber";
    public static final String METADATAENDPAGENUMBER = "endpagenumber";
    public static final String METADATAFILENAME = "filename";
    private static final String PDFPAGEREGION = "pdfPageRegion";
    protected final PDDocument document;
    private final Logger logger;
    protected String resourceFileName;
    private PdfDocumentReaderConfig config;

    public PagePdfDocumentReader(String resourceUrl) {
        this((new DefaultResourceLoader()).getResource(resourceUrl));
    }

    public PagePdfDocumentReader(Resource pdfResource) {
        this(pdfResource, PdfDocumentReaderConfig.defaultConfig());
    }

    public PagePdfDocumentReader(String resourceUrl, PdfDocumentReaderConfig config) {
        this((new DefaultResourceLoader()).getResource(resourceUrl), config);
    }

    public PagePdfDocumentReader(Resource pdfResource, PdfDocumentReaderConfig config) {
        this.logger = LoggerFactory.getLogger(this.getClass());

        try {
            PDFParser pdfParser = new PDFParser(new RandomAccessReadBuffer(pdfResource.getInputStream()));
            this.document = pdfParser.parse();
            this.resourceFileName = pdfResource.getFilename();
            this.config = config;
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

    public List<Document> get() {
        List<Document> readDocuments = new ArrayList();

        try {
            PDFLayoutTextStripperByArea pdfTextStripper = new PDFLayoutTextStripperByArea();
            int pageNumber = 0;
            int pagesPerDocument = 0;
            int startPageNumber = pageNumber;
            List<String> pageTextGroupList = new ArrayList();
            int totalPages = this.document.getDocumentCatalog().getPages().getCount();
            int logFrequency = totalPages > 10 ? totalPages / 10 : 1;
            int counter = 0;
            PDPage lastPage = (PDPage)this.document.getDocumentCatalog().getPages().iterator().next();

            for(PDPage page : this.document.getDocumentCatalog().getPages()) {
                lastPage = page;
                if (counter % logFrequency == 0 && counter / logFrequency < 10) {
                    this.logger.info("Processing PDF page: {}", counter + 1);
                }

                ++counter;
                ++pagesPerDocument;
                if (this.config.pagesPerDocument != 0 && pagesPerDocument >= this.config.pagesPerDocument) {
                    pagesPerDocument = 0;
                    String aggregatedPageTextGroup = (String)pageTextGroupList.stream().collect(Collectors.joining());
                    if (StringUtils.hasText(aggregatedPageTextGroup)) {
                        readDocuments.add(this.toDocument(page, aggregatedPageTextGroup, startPageNumber, pageNumber));
                    }

                    pageTextGroupList.clear();
                    startPageNumber = pageNumber + 1;
                }

                int x0 = (int)page.getMediaBox().getLowerLeftX();
                int xW = (int)page.getMediaBox().getWidth();
                int y0 = (int)page.getMediaBox().getLowerLeftY() + this.config.pageTopMargin;
                int yW = (int)page.getMediaBox().getHeight() - (this.config.pageTopMargin + this.config.pageBottomMargin);
                pdfTextStripper.addRegion("pdfPageRegion", new Rectangle(x0, y0, xW, yW));
                pdfTextStripper.extractRegions(page);
                String pageText = pdfTextStripper.getTextForRegion("pdfPageRegion");
                if (StringUtils.hasText(pageText)) {
                    pageText = this.config.pageExtractedTextFormatter.format(pageText, pageNumber);
                    pageTextGroupList.add(pageText);
                }

                ++pageNumber;
                pdfTextStripper.removeRegion("pdfPageRegion");
            }

            if (!CollectionUtils.isEmpty(pageTextGroupList)) {
                readDocuments.add(this.toDocument(lastPage, (String)pageTextGroupList.stream().collect(Collectors.joining()), startPageNumber, pageNumber));
            }

            this.logger.info("Processing {} pages", totalPages);
            return readDocuments;
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

    protected Document toDocument(PDPage page, String docText, int startPageNumber, int endPageNumber) {
        Document doc = new Document(docText);
        doc.getMetadata().put("pagenumber", startPageNumber);
        if (startPageNumber != endPageNumber) {
            doc.getMetadata().put("endpagenumber", endPageNumber);
        }

        doc.getMetadata().put("filename", this.resourceFileName);
        return doc;
    }
}
PdfDocumentReaderConfig

PDF 文档读取器的配置类,用于控制 PDF 解析和分组行为

  • int ALLPAGES:常量,值为 0,表示将所有页合并为一个 Document
  • boolean reversedParagraphPosition:是否反转每页内段落顺序,默认为 false
  • int pagesPerDocument:每个 Document 包含的页数,0 表示所有页合并,默认 1
  • int pageTopMargin:每页顶部裁剪的像素数,默认 0
  • int pageBottomMargin:每页底部裁剪的像素数,默认 0
  • int pageExtractedTextFormatter:提取文本后的格式化器,可自定义每页文本的处理方式
package org.springframework.ai.reader.pdf.config;

import org.springframework.ai.reader.ExtractedTextFormatter;
import org.springframework.util.Assert;

public final class PdfDocumentReaderConfig {
    public static final int ALLPAGES = 0;
    public final boolean reversedParagraphPosition;
    public final int pagesPerDocument;
    public final int pageTopMargin;
    public final int pageBottomMargin;
    public final ExtractedTextFormatter pageExtractedTextFormatter;

    private PdfDocumentReaderConfig(Builder builder) {
        this.pagesPerDocument = builder.pagesPerDocument;
        this.pageBottomMargin = builder.pageBottomMargin;
        this.pageTopMargin = builder.pageTopMargin;
        this.pageExtractedTextFormatter = builder.pageExtractedTextFormatter;
        this.reversedParagraphPosition = builder.reversedParagraphPosition;
    }

    public static Builder builder() {
        return new Builder();
    }

    public static PdfDocumentReaderConfig defaultConfig() {
        return builder().build();
    }

    public static final class Builder {
        private int pagesPerDocument = 1;
        private int pageTopMargin = 0;
        private int pageBottomMargin = 0;
        private ExtractedTextFormatter pageExtractedTextFormatter = ExtractedTextFormatter.defaults();
        private boolean reversedParagraphPosition = false;

        private Builder() {
        }

        public Builder withPageExtractedTextFormatter(ExtractedTextFormatter pageExtractedTextFormatter) {
            Assert.notNull(pageExtractedTextFormatter, "PageExtractedTextFormatter must not be null.");
            this.pageExtractedTextFormatter = pageExtractedTextFormatter;
            return this;
        }

        public Builder withPagesPerDocument(int pagesPerDocument) {
            Assert.isTrue(pagesPerDocument >= 0, "Page count must be a positive value.");
            this.pagesPerDocument = pagesPerDocument;
            return this;
        }

        public Builder withPageTopMargin(int topMargin) {
            Assert.isTrue(topMargin >= 0, "Page margins must be a positive value.");
            this.pageTopMargin = topMargin;
            return this;
        }

        public Builder withPageBottomMargin(int bottomMargin) {
            Assert.isTrue(bottomMargin >= 0, "Page margins must be a positive value.");
            this.pageBottomMargin = bottomMargin;
            return this;
        }

        public Builder withReversedParagraphPosition(boolean reversedParagraphPosition) {
            this.reversedParagraphPosition = reversedParagraphPosition;
            return this;
        }

        public PdfDocumentReaderConfig build() {
            return new PdfDocumentReaderConfig(this);
        }
    }
}

ParagraphPdfDocumentReader

用于将 PDF 文件按段落(基于目录/结构信息)解析为多个 Document,每个 Document 通常对应一个段落

  • PDDocument document:要读取的 PDF 文档对象
  • String resourceFileName:存储 PDF 文件的名字
  • PdfDocumentReaderConfig config:配置 PDF 文档读取行为,包括每份文档包含的页数、页边距
  • ParagraphManager paragraphTextExtractor:负责解析 PDF 并提取段落信息
方法名称
描述
ParagraphPdfDocumentReader
通过资源URL、资源对象、解析PDF配置等构造读取器
get
读取带目录的PDF,返回Document列表
toDocument
将指定段落内容和元数据封装为 Document
addMetadata
为 Document 添加元数据
getTextBetweenParagraphs
提取两个段落之间的文本内容
package org.springframework.ai.reader.pdf;

import java.awt.Rectangle;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import org.apache.pdfbox.io.RandomAccessReadBuffer;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentReader;
import org.springframework.ai.reader.pdf.config.ParagraphManager;
import org.springframework.ai.reader.pdf.config.PdfDocumentReaderConfig;
import org.springframework.ai.reader.pdf.layout.PDFLayoutTextStripperByArea;
import org.springframework.core.io.DefaultResourceLoader;
import org.springframework.core.io.Resource;
import org.springframework.util.CollectionUtils;
import org.springframework.util.StringUtils;

public class ParagraphPdfDocumentReader implements DocumentReader {
    private static final String METADATASTARTPAGE = "pagenumber";
    private static final String METADATAENDPAGE = "endpagenumber";
    private static final String METADATATITLE = "title";
    private static final String METADATALEVEL = "level";
    private static final String METADATAFILENAME = "filename";
    protected final PDDocument document;
    private final Logger logger;
    private final ParagraphManager paragraphTextExtractor;
    protected String resourceFileName;
    private PdfDocumentReaderConfig config;

    public ParagraphPdfDocumentReader(String resourceUrl) {
        this((new DefaultResourceLoader()).getResource(resourceUrl));
    }

    public ParagraphPdfDocumentReader(Resource pdfResource) {
        this(pdfResource, PdfDocumentReaderConfig.defaultConfig());
    }

    public ParagraphPdfDocumentReader(String resourceUrl, PdfDocumentReaderConfig config) {
        this((new DefaultResourceLoader()).getResource(resourceUrl), config);
    }

    public ParagraphPdfDocumentReader(Resource pdfResource, PdfDocumentReaderConfig config) {
        this.logger = LoggerFactory.getLogger(this.getClass());

        try {
            PDFParser pdfParser = new PDFParser(new RandomAccessReadBuffer(pdfResource.getInputStream()));
            this.document = pdfParser.parse();
            this.config = config;
            this.paragraphTextExtractor = new ParagraphManager(this.document);
            this.resourceFileName = pdfResource.getFilename();
        } catch (IllegalArgumentException iae) {
            throw iae;
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

    public List<Document> get() {
        List<ParagraphManager.Paragraph> paragraphs = this.paragraphTextExtractor.flatten();
        List<Document> documents = new ArrayList(paragraphs.size());
        if (!CollectionUtils.isEmpty(paragraphs)) {
            this.logger.info("Start processing paragraphs from PDF");
            Iterator<ParagraphManager.Paragraph> itr = paragraphs.iterator();
            ParagraphManager.Paragraph current = (ParagraphManager.Paragraph)itr.next();
            ParagraphManager.Paragraph next;
            if (!itr.hasNext()) {
                documents.add(this.toDocument(current, current));
            } else {
                for(; itr.hasNext(); current = next) {
                    next = (ParagraphManager.Paragraph)itr.next();
                    Document document = this.toDocument(current, next);
                    if (document != null && StringUtils.hasText(document.getText())) {
                        documents.add(this.toDocument(current, next));
                    }
                }
            }
        }

        this.logger.info("End processing paragraphs from PDF");
        return documents;
    }

    protected Document toDocument(ParagraphManager.Paragraph from, ParagraphManager.Paragraph to) {
        String docText = this.getTextBetweenParagraphs(from, to);
        if (!StringUtils.hasText(docText)) {
            return null;
        } else {
            Document document = new Document(docText);
            this.addMetadata(from, to, document);
            return document;
        }
    }

    protected void addMetadata(ParagraphManager.Paragraph from, ParagraphManager.Paragraph to, Document document) {
        document.getMetadata().put("title", from.title());
        document.getMetadata().put("pagenumber", from.startPageNumber());
        document.getMetadata().put("endpagenumber", to.startPageNumber());
        document.getMetadata().put("level", from.level());
        document.getMetadata().put("filename", this.resourceFileName);
    }

    public String getTextBetweenParagraphs(ParagraphManager.Paragraph fromParagraph, ParagraphManager.Paragraph toParagraph) {
        int startPage = fromParagraph.startPageNumber() - 1;
        int endPage = toParagraph.startPageNumber() - 1;

        try {
            StringBuilder sb = new StringBuilder();
            PDFLayoutTextStripperByArea pdfTextStripper = new PDFLayoutTextStripperByArea();
            pdfTextStripper.setSortByPosition(true);

            for(int pageNumber = startPage; pageNumber <= endPage; ++pageNumber) {
                PDPage page = this.document.getPage(pageNumber);
                int fromPosition = fromParagraph.position();
                int toPosition = toParagraph.position();
                if (this.config.reversedParagraphPosition) {
                    fromPosition = (int)(page.getMediaBox().getHeight() - (float)fromPosition);
                    toPosition = (int)(page.getMediaBox().getHeight() - (float)toPosition);
                }

                int x0 = (int)page.getMediaBox().getLowerLeftX();
                int xW = (int)page.getMediaBox().getWidth();
                int y0 = (int)page.getMediaBox().getLowerLeftY();
                int yW = (int)page.getMediaBox().getHeight();
                if (pageNumber == startPage) {
                    y0 = fromPosition;
                    yW = (int)page.getMediaBox().getHeight() - fromPosition;
                }

                if (pageNumber == endPage) {
                    yW = toPosition - y0;
                }

                if (y0 + yW == (int)page.getMediaBox().getHeight()) {
                    yW -= this.config.pageBottomMargin;
                }

                if (y0 == 0) {
                    y0 += this.config.pageTopMargin;
                    yW -= this.config.pageTopMargin;
                }

                pdfTextStripper.addRegion("pdfPageRegion", new Rectangle(x0, y0, xW, yW));
                pdfTextStripper.extractRegions(page);
                String text = pdfTextStripper.getTextForRegion("pdfPageRegion");
                if (StringUtils.hasText(text)) {
                    sb.append(text);
                }

                pdfTextStripper.removeRegion("pdfPageRegion");
            }

            String text = sb.toString();
            if (StringUtils.hasText(text)) {
                text = this.config.pageExtractedTextFormatter.format(text, startPage);
            }

            return text;
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }
}
ParagraphManager

类用于管理 PDF 文档的段落结构,主要通过解析 PDF 目录(TOC/书签)生成段落树,并可将其扁平化为段落列表,便于后续内容提取和分组

  • Paragraph rootParagraph:段落树的根节点,类型为 Paragraph,包含所有段落的层级结构
  • PDDocument document:PDFBox 的 PDDocument,表示当前处理的 PDF 文档
方法名称
描述
ParagraphManager
传入 PDF 文档,自动解析目录生成段落树
flatten
将段落树扁平化为 Paragraph 列表,便于顺序遍历
getParagraphsByLevel
按指定层级获取段落列表,可选是否包含跨层级段落
Paragraph
静态内部类,表示段落的元数据(标题、层级、起止页码、位置、子段落等)
generateParagraphs
ParagraphManager 的核心递归方法,用于遍历 PDF 目录(TOC/书签)的树结构,将每个目录项(PDOutlineItem)转换为 Paragraph,并构建出完整的段落树(章节层级结构)
package org.springframework.ai.reader.pdf.config;

import java.io.IOException;
import java.io.PrintStream;
import java.util.ArrayList;
import java.util.List;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageTree;
import org.apache.pdfbox.pdmodel.interactive.documentnavigation.destination.PDPageXYZDestination;
import org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem;
import org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineNode;
import org.springframework.util.Assert;
import org.springframework.util.CollectionUtils;

public class ParagraphManager {
    private final Paragraph rootParagraph;
    private final PDDocument document;

    public ParagraphManager(PDDocument document) {
        Assert.notNull(document, "PDDocument must not be null");
        Assert.notNull(document.getDocumentCatalog().getDocumentOutline(), "Document outline (e.g. TOC) is null. Make sure the PDF document has a table of contents (TOC). If not, consider the PagePdfDocumentReader or the TikaDocumentReader instead.");

        try {
            this.document = document;
            this.rootParagraph = this.generateParagraphs(new Paragraph((Paragraph)null, "root", -1, 1, this.document.getNumberOfPages(), 0), this.document.getDocumentCatalog().getDocumentOutline(), 0);
            this.printParagraph(this.rootParagraph, System.out);
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

    public List<Paragraph> flatten() {
        List<Paragraph> paragraphs = new ArrayList();

        for(Paragraph child : this.rootParagraph.children()) {
            this.flatten(child, paragraphs);
        }

        return paragraphs;
    }

    private void flatten(Paragraph current, List<Paragraph> paragraphs) {
        paragraphs.add(current);

        for(Paragraph child : current.children()) {
            this.flatten(child, paragraphs);
        }

    }

    private void printParagraph(Paragraph paragraph, PrintStream printStream) {
        printStream.println(paragraph);

        for(Paragraph childParagraph : paragraph.children()) {
            this.printParagraph(childParagraph, printStream);
        }

    }

    protected Paragraph generateParagraphs(Paragraph parentParagraph, PDOutlineNode bookmark, Integer level) throws IOException {
        for(PDOutlineItem current = bookmark.getFirstChild(); current != null; current = current.getNextSibling()) {
            int pageNumber = this.getPageNumber(current);
            int nextSiblingNumber = this.getPageNumber(current.getNextSibling());
            if (nextSiblingNumber < 0) {
                nextSiblingNumber = this.getPageNumber(current.getLastChild());
            }

            int paragraphPosition = current.getDestination() instanceof PDPageXYZDestination ? ((PDPageXYZDestination)current.getDestination()).getTop() : 0;
            Paragraph currentParagraph = new Paragraph(parentParagraph, current.getTitle(), level, pageNumber, nextSiblingNumber, paragraphPosition);
            parentParagraph.children().add(currentParagraph);
            this.generateParagraphs(currentParagraph, current, level + 1);
        }

        return parentParagraph;
    }

    private int getPageNumber(PDOutlineItem current) throws IOException {
        if (current == null) {
            return -1;
        } else {
            PDPage currentPage = current.findDestinationPage(this.document);
            PDPageTree pages = this.document.getDocumentCatalog().getPages();

            for(int i = 0; i < pages.getCount(); ++i) {
                PDPage page = pages.get(i);
                if (page.equals(currentPage)) {
                    return i + 1;
                }
            }

            return -1;
        }
    }

    public List<Paragraph> getParagraphsByLevel(Paragraph paragraph, int level, boolean interLevelText) {
        List<Paragraph> resultList = new ArrayList();
        if (paragraph.level() < level) {
            if (!CollectionUtils.isEmpty(paragraph.children())) {
                if (interLevelText) {
                    Paragraph interLevelParagraph = new Paragraph(paragraph.parent(), paragraph.title(), paragraph.level(), paragraph.startPageNumber(), ((Paragraph)paragraph.children().get(0)).startPageNumber(), paragraph.position());
                    resultList.add(interLevelParagraph);
                }

                for(Paragraph child : paragraph.children()) {
                    resultList.addAll(this.getParagraphsByLevel(child, level, interLevelText));
                }
            }
        } else if (paragraph.level() == level) {
            resultList.add(paragraph);
        }

        return resultList;
    }

    public static record Paragraph(Paragraph parent, String title, int level, int startPageNumber, int endPageNumber, int position, List<Paragraph> children) {
        public Paragraph(Paragraph parent, String title, int level, int startPageNumber, int endPageNumber, int position) {
            this(parent, title, level, startPageNumber, endPageNumber, position, new ArrayList());
        }

        public String toString() {
            String indent = this.level < 0 ? "" : (new String(new char[this.level * 2])).replace('\u0000', ' ');
            return indent + " " + this.level + ") " + this.title + " [" + this.startPageNumber + "," + this.endPageNumber + "], children = " + this.children.size() + ", pos = " + this.position;
        }
    }
}

TikaDocumentReader

用于从多种文档格式(如 PDF、DOC/DOCX、PPT/PPTX、HTML 等)中提取文本,并将其封装为 Document 对象,基于 Apache Tika 库实现,支持广泛的文档格式。

  • AutoDetectParser parser:自动检索文档类型并文本的解析器
  • ContentHandler handler:管理内容提取的处理器
  • Metadata metadata:读取文档相关的元数据
  • ParseContext context:解析过程信息的上下文
  • Resource resource:指向文档的资源对象
  • ExtractedTextFormatter textFormatter: 格式化提取的文本
方法名称
描述
TikaDocumentReader
通过资源URL、资源对象、文本格式化器等构造读取器
get
从多种文档格式读取,返回Document列表
package org.springframework.ai.reader.tika;

import java.io.IOException;
import java.io.InputStream;
import java.util.List;
import java.util.Objects;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentReader;
import org.springframework.ai.reader.ExtractedTextFormatter;
import org.springframework.core.io.DefaultResourceLoader;
import org.springframework.core.io.Resource;
import org.springframework.util.StringUtils;
import org.xml.sax.ContentHandler;

public class TikaDocumentReader implements DocumentReader {
    public static final String METADATASOURCE = "source";
    private final AutoDetectParser parser;
    private final ContentHandler handler;
    private final Metadata metadata;
    private final ParseContext context;
    private final Resource resource;
    private final ExtractedTextFormatter textFormatter;

    public TikaDocumentReader(String resourceUrl) {
        this(resourceUrl, ExtractedTextFormatter.defaults());
    }

    public TikaDocumentReader(String resourceUrl, ExtractedTextFormatter textFormatter) {
        this((new DefaultResourceLoader()).getResource(resourceUrl), textFormatter);
    }

    public TikaDocumentReader(Resource resource) {
        this(resource, ExtractedTextFormatter.defaults());
    }

    public TikaDocumentReader(Resource resource, ExtractedTextFormatter textFormatter) {
        this(resource, new BodyContentHandler(-1), textFormatter);
    }

    public TikaDocumentReader(Resource resource, ContentHandler contentHandler, ExtractedTextFormatter textFormatter) {
        this.parser = new AutoDetectParser();
        this.handler = contentHandler;
        this.metadata = new Metadata();
        this.context = new ParseContext();
        this.resource = resource;
        this.textFormatter = textFormatter;
    }

    public List<Document> get() {
        try (InputStream stream = this.resource.getInputStream()) {
            this.parser.parse(stream, this.handler, this.metadata, this.context);
            return List.of(this.toDocument(this.handler.toString()));
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

    private Document toDocument(String docText) {
        docText = (String)Objects.requireNonNullElse(docText, "");
        docText = this.textFormatter.format(docText);
        Document doc = new Document(docText);
        doc.getMetadata().put("source", this.resourceName());
        return doc;
    }

    private String resourceName() {
        try {
            String resourceName = this.resource.getFilename();
            if (!StringUtils.hasText(resourceName)) {
                resourceName = this.resource.getURI().toString();
            }

            return resourceName;
        } catch (IOException e) {
            return String.format("Invalid source URI: %s", e.getMessage());
        }
    }
}

DocumentTransformer(转换文档数据接口类)

package org.springframework.ai.document;

import java.util.List;
import java.util.function.Function;

public interface DocumentTransformer extends Function<List<Document>, List<Document>> {
    default List<Document> transform(List<Document> transform) {
        return (List)this.apply(transform);
    }
}

TextSplitter

主要用于将长文本型 Document 拆分为多个较小的文本块(chunk),它为具体的文本分割策略(如按长度、按句子、按段落等)提供了通用框架

  • boolean copyContentFormatter:表示是否将文档内容格式化后,拆分复制到子文档中
方法名称
描述
apply
对输入文档列表进行拆分,返回拆分后的文档列表
split
拆分文档,返回拆分后的文档列表
setCopyContentFormatter
控制是否继承内容格式化器
isCopyContentFormatter
获取 copyContentFormatter 当前值
package org.springframework.ai.transformer.splitter;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.ai.document.ContentFormatter;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentTransformer;

public abstract class TextSplitter implements DocumentTransformer {
    private static final Logger logger = LoggerFactory.getLogger(TextSplitter.class);
    private boolean copyContentFormatter = true;

    public List<Document> apply(List<Document> documents) {
        return this.doSplitDocuments(documents);
    }

    public List<Document> split(List<Document> documents) {
        return this.apply(documents);
    }

    public List<Document> split(Document document) {
        return this.apply(List.of(document));
    }

    public boolean isCopyContentFormatter() {
        return this.copyContentFormatter;
    }

    public void setCopyContentFormatter(boolean copyContentFormatter) {
        this.copyContentFormatter = copyContentFormatter;
    }

    private List<Document> doSplitDocuments(List<Document> documents) {
        List<String> texts = new ArrayList();
        List<Map<String, Object>> metadataList = new ArrayList();
        List<ContentFormatter> formatters = new ArrayList();

        for(Document doc : documents) {
            texts.add(doc.getText());
            metadataList.add(doc.getMetadata());
            formatters.add(doc.getContentFormatter());
        }

        return this.createDocuments(texts, formatters, metadataList);
    }

    private List<Document> createDocuments(List<String> texts, List<ContentFormatter> formatters, List<Map<String, Object>> metadataList) {
        List<Document> documents = new ArrayList();

        for(int i = 0; i < texts.size(); ++i) {
            String text = (String)texts.get(i);
            Map<String, Object> metadata = (Map)metadataList.get(i);
            List<String> chunks = this.splitText(text);
            if (chunks.size() > 1) {
                logger.info("Splitting up document into " + chunks.size() + " chunks.");
            }

            for(String chunk : chunks) {
                Map<String, Object> metadataCopy = (Map)metadata.entrySet().stream().filter((e) -> e.getKey() != null && e.getValue() != null).collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue));
                Document newDoc = new Document(chunk, metadataCopy);
                if (this.copyContentFormatter) {
                    newDoc.setContentFormatter((ContentFormatter)formatters.get(i));
                }

                documents.add(newDoc);
            }
        }

        return documents;
    }

    protected abstract List<String> splitText(String text);
}
TokenTextSplitter

用于将文本按 token 拆分为指定大小块,基于 jtokit 库实现,适用于需要按 token 粒度处理文本的场景,如 LLM 的输入处理。

  • int chunkSize:每个文本块的目标 token 数量,默认为 800
  • int minChunkSizeChars:每个文本块的最小字符数,默认为 350
  • int minChunkLengthToEmbed:丢弃小于此长度的文本块,默认为 5
  • int maxNumChunks:文本中生成的最大块数,默认为 10000
  • boolean keepSeparator:是否保留分隔符(如换号符),默认
  • EncodingRegistry registry:用于获取编码的注册表
  • Encoding encoding:用于编码和解码的 token 的编码器
方法名称
描述
splitText
实现自 TextSplitter,将文本按 token 分块,返回分块后的字符串列表
doSplit
核心分块逻辑,按 token 长度切分文本
package org.springframework.ai.transformer.splitter;

import com.knuddels.jtokkit.Encodings;
import com.knuddels.jtokkit.api.Encoding;
import com.knuddels.jtokkit.api.EncodingRegistry;
import com.knuddels.jtokkit.api.EncodingType;
import com.knuddels.jtokkit.api.IntArrayList;
import java.util.ArrayList;
import java.util.List;
import java.util.Objects;
import org.springframework.util.Assert;

public class TokenTextSplitter extends TextSplitter {
    private static final int DEFAULTCHUNKSIZE = 800;
    private static final int MINCHUNKSIZECHARS = 350;
    private static final int MINCHUNKLENGTHTOEMBED = 5;
    private static final int MAXNUMCHUNKS = 10000;
    private static final boolean KEEPSEPARATOR = true;
    private final EncodingRegistry registry;
    private final Encoding encoding;
    private final int chunkSize;
    private final int minChunkSizeChars;
    private final int minChunkLengthToEmbed;
    private final int maxNumChunks;
    private final boolean keepSeparator;

    public TokenTextSplitter() {
        this(800, 350, 5, 10000, true);
    }

    public TokenTextSplitter(boolean keepSeparator) {
        this(800, 350, 5, 10000, keepSeparator);
    }

    public TokenTextSplitter(int chunkSize, int minChunkSizeChars, int minChunkLengthToEmbed, int maxNumChunks, boolean keepSeparator) {
        this.registry = Encodings.newLazyEncodingRegistry();
        this.encoding = this.registry.getEncoding(EncodingType.CL100KBASE);
        this.chunkSize = chunkSize;
        this.minChunkSizeChars = minChunkSizeChars;
        this.minChunkLengthToEmbed = minChunkLengthToEmbed;
        this.maxNumChunks = maxNumChunks;
        this.keepSeparator = keepSeparator;
    }

    public static Builder builder() {
        return new Builder();
    }

    protected List<String> splitText(String text) {
        return this.doSplit(text, this.chunkSize);
    }

    protected List<String> doSplit(String text, int chunkSize) {
        if (text != null && !text.trim().isEmpty()) {
            List<Integer> tokens = this.getEncodedTokens(text);
            List<String> chunks = new ArrayList();
            int numchunks = 0;

            while(!tokens.isEmpty() && numchunks < this.maxNumChunks) {
                List<Integer> chunk = tokens.subList(0, Math.min(chunkSize, tokens.size()));
                String chunkText = this.decodeTokens(chunk);
                if (chunkText.trim().isEmpty()) {
                    tokens = tokens.subList(chunk.size(), tokens.size());
                } else {
                    int lastPunctuation = Math.max(chunkText.lastIndexOf(46), Math.max(chunkText.lastIndexOf(63), Math.max(chunkText.lastIndexOf(33), chunkText.lastIndexOf(10))));
                    if (lastPunctuation != -1 && lastPunctuation > this.minChunkSizeChars) {
                        chunkText = chunkText.substring(0, lastPunctuation + 1);
                    }

                    String chunkTextToAppend = this.keepSeparator ? chunkText.trim() : chunkText.replace(System.lineSeparator(), " ").trim();
                    if (chunkTextToAppend.length() > this.minChunkLengthToEmbed) {
                        chunks.add(chunkTextToAppend);
                    }

                    tokens = tokens.subList(this.getEncodedTokens(chunkText).size(), tokens.size());
                    ++numchunks;
                }
            }

            if (!tokens.isEmpty()) {
                String remainingtext = this.decodeTokens(tokens).replace(System.lineSeparator(), " ").trim();
                if (remainingtext.length() > this.minChunkLengthToEmbed) {
                    chunks.add(remainingtext);
                }
            }

            return chunks;
        } else {
            return new ArrayList();
        }
    }

    private List<Integer> getEncodedTokens(String text) {
        Assert.notNull(text, "Text must not be null");
        return this.encoding.encode(text).boxed();
    }

    private String decodeTokens(List<Integer> tokens) {
        Assert.notNull(tokens, "Tokens must not be null");
        IntArrayList tokensIntArray = new IntArrayList(tokens.size());
        Objects.requireNonNull(tokensIntArray);
        tokens.forEach(tokensIntArray::add);
        return this.encoding.decode(tokensIntArray);
    }

    public static final class Builder {
        private int chunkSize = 800;
        private int minChunkSizeChars = 350;
        private int minChunkLengthToEmbed = 5;
        private int maxNumChunks = 10000;
        private boolean keepSeparator = true;

        private Builder() {
        }

        public Builder withChunkSize(int chunkSize) {
            this.chunkSize = chunkSize;
            return this;
        }

        public Builder withMinChunkSizeChars(int minChunkSizeChars) {
            this.minChunkSizeChars = minChunkSizeChars;
            return this;
        }

        public Builder withMinChunkLengthToEmbed(int minChunkLengthToEmbed) {
            this.minChunkLengthToEmbed = minChunkLengthToEmbed;
            return this;
        }

        public Builder withMaxNumChunks(int maxNumChunks) {
            this.maxNumChunks = maxNumChunks;
            return this;
        }

        public Builder withKeepSeparator(boolean keepSeparator) {
            this.keepSeparator = keepSeparator;
            return this;
        }

        public TokenTextSplitter build() {
            return new TokenTextSplitter(this.chunkSize, this.minChunkSizeChars, this.minChunkLengthToEmbed, this.maxNumChunks, this.keepSeparator);
        }
    }
}

ContentFormatTransformer

对 Document 列表中的每个文档应用内容格式化器,以格式化文档

  • boolean disableTemplateRewrite:表示是否禁用内容格式化器的模版重写功能
  • ContentFormatter contentFormatter:用于格式化文档内容的实例
package org.springframework.ai.transformer;

import java.util.ArrayList;
import java.util.List;
import org.springframework.ai.document.ContentFormatter;
import org.springframework.ai.document.DefaultContentFormatter;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentTransformer;

public class ContentFormatTransformer implements DocumentTransformer {
    private final boolean disableTemplateRewrite;
    private final ContentFormatter contentFormatter;

    public ContentFormatTransformer(ContentFormatter contentFormatter) {
        this(contentFormatter, false);
    }

    public ContentFormatTransformer(ContentFormatter contentFormatter, boolean disableTemplateRewrite) {
        this.contentFormatter = contentFormatter;
        this.disableTemplateRewrite = disableTemplateRewrite;
    }

    public List<Document> apply(List<Document> documents) {
        if (this.contentFormatter != null) {
            documents.forEach(this::processDocument);
        }

        return documents;
    }

    private void processDocument(Document document) {
        ContentFormatter var4 = document.getContentFormatter();
        if (var4 instanceof DefaultContentFormatter docFormatter) {
            var4 = this.contentFormatter;
            if (var4 instanceof DefaultContentFormatter toUpdateFormatter) {
                this.updateFormatter(document, docFormatter, toUpdateFormatter);
                return;
            }
        }

        this.overrideFormatter(document);
    }

    private void updateFormatter(Document document, DefaultContentFormatter docFormatter, DefaultContentFormatter toUpdateFormatter) {
        List<String> updatedEmbedExcludeKeys = new ArrayList(docFormatter.getExcludedEmbedMetadataKeys());
        updatedEmbedExcludeKeys.addAll(toUpdateFormatter.getExcludedEmbedMetadataKeys());
        List<String> updatedInterfaceExcludeKeys = new ArrayList(docFormatter.getExcludedInferenceMetadataKeys());
        updatedInterfaceExcludeKeys.addAll(toUpdateFormatter.getExcludedInferenceMetadataKeys());
        DefaultContentFormatter.Builder builder = DefaultContentFormatter.builder().withExcludedEmbedMetadataKeys(updatedEmbedExcludeKeys).withExcludedInferenceMetadataKeys(updatedInterfaceExcludeKeys).withMetadataTemplate(docFormatter.getMetadataTemplate()).withMetadataSeparator(docFormatter.getMetadataSeparator());
        if (!this.disableTemplateRewrite) {
            builder.withTextTemplate(docFormatter.getTextTemplate());
        }

        document.setContentFormatter(builder.build());
    }

    private void overrideFormatter(Document document) {
        document.setContentFormatter(this.contentFormatter);
    }
}
ContentFormatte(格式化接口类)
public interface ContentFormatter {

    String format(Document document, MetadataMode mode);

}
DefaultContentFormatter

用于格式化 Document 对象的内容和元数据,通过模版和配置来控制文档显示方式

  • String metadataTemplate:元数据格式化模版,包含{key}和{value}占位符
  • String metadataSeparator:元数据字段之间的分隔符
  • String textTemplate:文档文本格式化模板,包含{content}和{metadatastring}占位符
  • List<String> excludedInferenceMetadataKeys:在推理模式下排除的元数据键列表
  • List<String> excludedEmbedMetadataKeys:在嵌入模式下排除的元数据键列表
package org.springframework.ai.document;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.stream.Collectors;
import org.springframework.util.Assert;

public final class DefaultContentFormatter implements ContentFormatter {
    private static final String TEMPLATECONTENTPLACEHOLDER = "{content}";
    private static final String TEMPLATEMETADATASTRINGPLACEHOLDER = "{metadatastring}";
    private static final String TEMPLATEVALUEPLACEHOLDER = "{value}";
    private static final String TEMPLATEKEYPLACEHOLDER = "{key}";
    private static final String DEFAULTMETADATATEMPLATE = String.format("%s: %s", "{key}", "{value}");
    private static final String DEFAULTMETADATASEPARATOR = System.lineSeparator();
    private static final String DEFAULTTEXTTEMPLATE = String.format("%s\n\n%s", "{metadatastring}", "{content}");
    private final String metadataTemplate;
    private final String metadataSeparator;
    private final String textTemplate;
    private final List<String> excludedInferenceMetadataKeys;
    private final List<String> excludedEmbedMetadataKeys;

    private DefaultContentFormatter(Builder builder) {
        this.metadataTemplate = builder.metadataTemplate;
        this.metadataSeparator = builder.metadataSeparator;
        this.textTemplate = builder.textTemplate;
        this.excludedInferenceMetadataKeys = builder.excludedInferenceMetadataKeys;
        this.excludedEmbedMetadataKeys = builder.excludedEmbedMetadataKeys;
    }

    public static Builder builder() {
        return new Builder();
    }

    public static DefaultContentFormatter defaultConfig() {
        return builder().build();
    }

    public String format(Document document, MetadataMode metadataMode) {
        Map<String, Object> metadata = this.metadataFilter(document.getMetadata(), metadataMode);
        String metadataText = (String)metadata.entrySet().stream().map((metadataEntry) -> this.metadataTemplate.replace("{key}", (CharSequence)metadataEntry.getKey()).replace("{value}", metadataEntry.getValue().toString())).collect(Collectors.joining(this.metadataSeparator));
        return this.textTemplate.replace("{metadatastring}", metadataText).replace("{content}", document.getText());
    }

    protected Map<String, Object> metadataFilter(Map<String, Object> metadata, MetadataMode metadataMode) {
        if (metadataMode == MetadataMode.ALL) {
            return new HashMap(metadata);
        } else if (metadataMode == MetadataMode.NONE) {
            return new HashMap(Collections.emptyMap());
        } else {
            Set<String> usableMetadataKeys = new HashSet(metadata.keySet());
            if (metadataMode == MetadataMode.INFERENCE) {
                usableMetadataKeys.removeAll(this.excludedInferenceMetadataKeys);
            } else if (metadataMode == MetadataMode.EMBED) {
                usableMetadataKeys.removeAll(this.excludedEmbedMetadataKeys);
            }

            return new HashMap((Map)metadata.entrySet().stream().filter((e) -> usableMetadataKeys.contains(e.getKey())).collect(Collectors.toMap((e) -> (String)e.getKey(), (e) -> e.getValue())));
        }
    }

    public String getMetadataTemplate() {
        return this.metadataTemplate;
    }

    public String getMetadataSeparator() {
        return this.metadataSeparator;
    }

    public String getTextTemplate() {
        return this.textTemplate;
    }

    public List<String> getExcludedInferenceMetadataKeys() {
        return Collections.unmodifiableList(this.excludedInferenceMetadataKeys);
    }

    public List<String> getExcludedEmbedMetadataKeys() {
        return Collections.unmodifiableList(this.excludedEmbedMetadataKeys);
    }

    public static final class Builder {
        private String metadataTemplate;
        private String metadataSeparator;
        private String textTemplate;
        private List<String> excludedInferenceMetadataKeys;
        private List<String> excludedEmbedMetadataKeys;

        private Builder() {
            this.metadataTemplate = DefaultContentFormatter.DEFAULTMETADATATEMPLATE;
            this.metadataSeparator = DefaultContentFormatter.DEFAULTMETADATASEPARATOR;
            this.textTemplate = DefaultContentFormatter.DEFAULTTEXTTEMPLATE;
            this.excludedInferenceMetadataKeys = new ArrayList();
            this.excludedEmbedMetadataKeys = new ArrayList();
        }

        public Builder from(DefaultContentFormatter fromFormatter) {
            this.withExcludedEmbedMetadataKeys(fromFormatter.getExcludedEmbedMetadataKeys()).withExcludedInferenceMetadataKeys(fromFormatter.getExcludedInferenceMetadataKeys()).withMetadataSeparator(fromFormatter.getMetadataSeparator()).withMetadataTemplate(fromFormatter.getMetadataTemplate()).withTextTemplate(fromFormatter.getTextTemplate());
            return this;
        }

        public Builder withMetadataTemplate(String metadataTemplate) {
            Assert.hasText(metadataTemplate, "Metadata Template must not be empty");
            this.metadataTemplate = metadataTemplate;
            return this;
        }

        public Builder withMetadataSeparator(String metadataSeparator) {
            Assert.notNull(metadataSeparator, "Metadata separator must not be empty");
            this.metadataSeparator = metadataSeparator;
            return this;
        }

        public Builder withTextTemplate(String textTemplate) {
            Assert.hasText(textTemplate, "Document's text template must not be empty");
            this.textTemplate = textTemplate;
            return this;
        }

        public Builder withExcludedInferenceMetadataKeys(List<String> excludedInferenceMetadataKeys) {
            Assert.notNull(excludedInferenceMetadataKeys, "Excluded inference metadata keys must not be null");
            this.excludedInferenceMetadataKeys = excludedInferenceMetadataKeys;
            return this;
        }

        public Builder withExcludedInferenceMetadataKeys(String... keys) {
            Assert.notNull(keys, "Excluded inference metadata keys must not be null");
            this.excludedInferenceMetadataKeys.addAll(Arrays.asList(keys));
            return this;
        }

        public Builder withExcludedEmbedMetadataKeys(List<String> excludedEmbedMetadataKeys) {
            Assert.notNull(excludedEmbedMetadataKeys, "Excluded Embed metadata keys must not be null");
            this.excludedEmbedMetadataKeys = excludedEmbedMetadataKeys;
            return this;
        }

        public Builder withExcludedEmbedMetadataKeys(String... keys) {
            Assert.notNull(keys, "Excluded Embed metadata keys must not be null");
            this.excludedEmbedMetadataKeys.addAll(Arrays.asList(keys));
            return this;
        }

        public DefaultContentFormatter build() {
            return new DefaultContentFormatter(this);
        }
    }
}

KeywordMetadataEnricher

从文档中提取关键词,并将其作为元数据添加到文档中。通过调用 ChatModel 生成关键词,并将关键词存储在文档的元数据中

  • ChatModel chatModel:与 LLM 交互,生成关键词
  • int keywordCount:要提取的关键词数量
package org.springframework.ai.model.transformer;

import java.util.List;
import java.util.Map;
import org.springframework.ai.chat.model.ChatModel;
import org.springframework.ai.chat.prompt.Prompt;
import org.springframework.ai.chat.prompt.PromptTemplate;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentTransformer;
import org.springframework.util.Assert;

public class KeywordMetadataEnricher implements DocumentTransformer {
    public static final String CONTEXTSTRPLACEHOLDER = "contextstr";
    public static final String KEYWORDSTEMPLATE = "{contextstr}. Give %s unique keywords for this\ndocument. Format as comma separated. Keywords:";
    private static final String EXCERPTKEYWORDSMETADATAKEY = "excerptkeywords";
    private final ChatModel chatModel;
    private final int keywordCount;

    public KeywordMetadataEnricher(ChatModel chatModel, int keywordCount) {
        Assert.notNull(chatModel, "ChatModel must not be null");
        Assert.isTrue(keywordCount >= 1, "Document count must be >= 1");
        this.chatModel = chatModel;
        this.keywordCount = keywordCount;
    }

    public List<Document> apply(List<Document> documents) {
        for(Document document : documents) {
            PromptTemplate template = new PromptTemplate(String.format("{contextstr}. Give %s unique keywords for this\ndocument. Format as comma separated. Keywords:", this.keywordCount));
            Prompt prompt = template.create(Map.of("contextstr", document.getText()));
            String keywords = this.chatModel.call(prompt).getResult().getOutput().getText();
            document.getMetadata().putAll(Map.of("excerptkeywords", keywords));
        }

        return documents;
    }
}

SummaryMetadataEnricher

用于从文档中提取摘要,并将其作为元数据添加到文档中。支持提取当前文档、前一个文档和下一个文档的摘要,并将这些摘要存储在文档的元数据中

  • ChatModel chatModel:与 LLM 交互,生成摘要

  • List<SummaryType> summaryTypes:要提取的摘要类型列表(当前、前一个、后一个)

  • MetadataMode metadataMode:元数据模式,用于控制文档内容的格式化方式

    • ALL:格式化内容时包含所有元数据(如作者、页码、标题等),适合需要上下文丰富信息的场景
    • EMBED:仅包含用于向量嵌入相关的元数据。通常用于向量数据库检索,保证只输出对嵌入有用的元数据,减少无关信息干扰
    • INFERENCE:仅包含推理相关的元数据。适合推理、问答等场景,输出对模型推理有帮助的元数据,过滤掉无关内容
    • NONE:只输出纯文本内容,不包含任何元数据,适合只关心正文的场景
  • String summaryTemplate:用于生成摘要的模版

package org.springframework.ai.model.transformer;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.springframework.ai.chat.model.ChatModel;
import org.springframework.ai.chat.prompt.Prompt;
import org.springframework.ai.chat.prompt.PromptTemplate;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentTransformer;
import org.springframework.ai.document.MetadataMode;
import org.springframework.util.Assert;
import org.springframework.util.CollectionUtils;

public class SummaryMetadataEnricher implements DocumentTransformer {
    public static final String DEFAULTSUMMARYEXTRACTTEMPLATE = "Here is the content of the section:\n{contextstr}\n\nSummarize the key topics and entities of the section.\n\nSummary:";
    private static final String SECTIONSUMMARYMETADATAKEY = "sectionsummary";
    private static final String NEXTSECTIONSUMMARYMETADATAKEY = "nextsectionsummary";
    private static final String PREVSECTIONSUMMARYMETADATAKEY = "prevsectionsummary";
    private static final String CONTEXTSTRPLACEHOLDER = "contextstr";
    private final ChatModel chatModel;
    private final List<SummaryType> summaryTypes;
    private final MetadataMode metadataMode;
    private final String summaryTemplate;

    public SummaryMetadataEnricher(ChatModel chatModel, List<SummaryType> summaryTypes) {
        this(chatModel, summaryTypes, "Here is the content of the section:\n{contextstr}\n\nSummarize the key topics and entities of the section.\n\nSummary:", MetadataMode.ALL);
    }

    public SummaryMetadataEnricher(ChatModel chatModel, List<SummaryType> summaryTypes, String summaryTemplate, MetadataMode metadataMode) {
        Assert.notNull(chatModel, "ChatModel must not be null");
        Assert.hasText(summaryTemplate, "Summary template must not be empty");
        this.chatModel = chatModel;
        this.summaryTypes = CollectionUtils.isEmpty(summaryTypes) ? List.of(SummaryMetadataEnricher.SummaryType.CURRENT) : summaryTypes;
        this.metadataMode = metadataMode;
        this.summaryTemplate = summaryTemplate;
    }

    public List<Document> apply(List<Document> documents) {
        List<String> documentSummaries = new ArrayList();

        for(Document document : documents) {
            String documentContext = document.getFormattedContent(this.metadataMode);
            Prompt prompt = (new PromptTemplate(this.summaryTemplate)).create(Map.of("contextstr", documentContext));
            documentSummaries.add(this.chatModel.call(prompt).getResult().getOutput().getText());
        }

        for(int i = 0; i < documentSummaries.size(); ++i) {
            Map<String, Object> summaryMetadata = this.getSummaryMetadata(i, documentSummaries);
            ((Document)documents.get(i)).getMetadata().putAll(summaryMetadata);
        }

        return documents;
    }

    private Map<String, Object> getSummaryMetadata(int i, List<String> documentSummaries) {
        Map<String, Object> summaryMetadata = new HashMap();
        if (i > 0 && this.summaryTypes.contains(SummaryMetadataEnricher.SummaryType.PREVIOUS)) {
            summaryMetadata.put("prevsectionsummary", documentSummaries.get(i - 1));
        }

        if (i < documentSummaries.size() - 1 && this.summaryTypes.contains(SummaryMetadataEnricher.SummaryType.NEXT)) {
            summaryMetadata.put("nextsectionsummary", documentSummaries.get(i + 1));
        }

        if (this.summaryTypes.contains(SummaryMetadataEnricher.SummaryType.CURRENT)) {
            summaryMetadata.put("sectionsummary", documentSummaries.get(i));
        }

        return summaryMetadata;
    }

    public static enum SummaryType {
        PREVIOUS,
        CURRENT,
        NEXT;
    }
}

DocumentWriter(文档写入接口类)

package org.springframework.ai.document;

import java.util.List;
import java.util.function.Consumer;

public interface DocumentWriter extends Consumer<List<Document>> {
    default void write(List<Document> documents) {
        this.accept(documents);
    }
}

FileDocumentWriter

将一组 Document 文档对象的内容写入到指定文件,支持追加写入、文档分隔标记、元数据格式化等功能

  • String fileName:写入文件的名称
  • boolean withDocumentMarkers:表示是否在文件中包含文档标记(如文档索引、页码)
  • MetadataMode metadataMode:元数据模式,控制文档内容的格式化方式
  • boolean append:是否将内容追加到文件末尾,而不是覆盖
方法名称
描述
FileDocumentWriter
通过文件名、分隔标记、元数据、追加等构造写入器
accept
将文档内容写入文件,支持分隔标记和元数据格式化
package org.springframework.ai.writer;

import java.io.FileWriter;
import java.util.List;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentWriter;
import org.springframework.ai.document.MetadataMode;
import org.springframework.util.Assert;

public class FileDocumentWriter implements DocumentWriter {
    public static final String METADATASTARTPAGENUMBER = "pagenumber";
    public static final String METADATAENDPAGENUMBER = "endpagenumber";
    private final String fileName;
    private final boolean withDocumentMarkers;
    private final MetadataMode metadataMode;
    private final boolean append;

    public FileDocumentWriter(String fileName) {
        this(fileName, false, MetadataMode.NONE, false);
    }

    public FileDocumentWriter(String fileName, boolean withDocumentMarkers) {
        this(fileName, withDocumentMarkers, MetadataMode.NONE, false);
    }

    public FileDocumentWriter(String fileName, boolean withDocumentMarkers, MetadataMode metadataMode, boolean append) {
        Assert.hasText(fileName, "File name must have a text.");
        Assert.notNull(metadataMode, "MetadataMode must not be null.");
        this.fileName = fileName;
        this.withDocumentMarkers = withDocumentMarkers;
        this.metadataMode = metadataMode;
        this.append = append;
    }

    public void accept(List<Document> docs) {
        try {
            try (FileWriter writer = new FileWriter(this.fileName, this.append)) {
                int index = 0;

                for(Document doc : docs) {
                    if (this.withDocumentMarkers) {
                        writer.write(String.format("%n### Doc: %s, pages:[%s,%s]\n", index, doc.getMetadata().get("pagenumber"), doc.getMetadata().get("endpagenumber")));
                    }

                    writer.write(doc.getFormattedContent(this.metadataMode));
                    ++index;
                }
            }

        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }
}

VectorStore

VectorStore 继承了 DocumentWriter 接口,详情可见 《Vector Databases》

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/2398505.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

杭州白塔岭画室怎么样?和燕壹画室哪个好?

杭州作为全国美术艺考集训的核心区域&#xff0c;汇聚了众多实力强劲的画室&#xff0c;其中白塔岭画室和燕壹画室备受美术生关注。对于怀揣艺术梦想的考生而言&#xff0c;选择一所契合自身需求的画室&#xff0c;对未来的艺术之路影响深远。接下来&#xff0c;我们将从多个维…

晶台光耦在手机PD快充上的应用

光耦&#xff08;光电隔离器&#xff09;作为关键电子元件&#xff0c;在手机PD快充中扮演信号隔离与传输的“安全卫士”。其通过光信号实现电气隔离&#xff0c;保护手机电路免受高电压损害&#xff0c;同时支持实时信号反馈&#xff0c;优化充电效率。 晶台品牌推出KL817、KL…

【亲测有效 | Cursor Pro每月500次快速请求扩5倍】(Windows版)Cursor中集成interactive-feedback-mcp

前言&#xff1a;使用这个interactive-feedback-mcp组件可以根据用户反馈来决定是否结束这一次的请求。如果本次请求并没有解决我们的问题&#xff0c;那我们便可以选择继续这次请求流程&#xff0c;直到问题解决。这样的话&#xff0c;就可以避免为了修复bug而白白多出的请求。…

CRM管理软件的数据可视化功能使用技巧:让数据驱动决策

在当今数据驱动的商业环境中&#xff0c;CRM管理系统的数据可视化功能已成为企业优化客户管理、提升销售效率的核心工具。据企销客研究显示&#xff0c;具备优秀可视化能力的CRM系统&#xff0c;用户决策效率可提升47%。本文将深入解析如何通过数据可视化功能最大化CRM管理软件…

linux批量创建文件

文章目录 批量创建空文件touch命令批量创建空文件循环结构创建 创建含内容文件echo重定向多行内容写入 按日期创建日志文件根据文件中的列内容&#xff0c;创建文件一行只有一列内容一行有多列内容 批量创建空文件 touch命令批量创建空文件 # 创建文件file1.txt到file10.txt …

颠覆传统!单样本熵最小化如何重塑大语言模型训练范式?

颠覆传统&#xff01;单样本熵最小化如何重塑大语言模型训练范式&#xff1f; 大语言模型&#xff08;LLM&#xff09;的训练往往依赖大量标注数据与复杂奖励设计&#xff0c;但最新研究发现&#xff0c;仅用1条无标注数据和10步优化的熵最小化&#xff08;EM&#xff09;方法…

ssm学习笔记day04

RequestMapping 首先添加依赖 Maven的配置 测试 在controller创建HelloController&#xff0c;如果只加RequestMapping&#xff0c;默认跳转到新页面 如果要是加上ResponseBody就把数据封装在包(JSON)&#xff0c;标签RestController是前后分离的注解&#xff08;因为默认用…

Read View在MVCC里如何工作

Read View的结构 Read View中有四个重要的字段&#xff1a; m_ids&#xff1a;创建 Read View 时&#xff0c;数据库中启动但未提交的「活跃事务」的事务 id 列表 。min_trx_id&#xff1a;创建 Read View 时&#xff0c;「活跃事务」中事务 id 最小的值&#xff0c;即 m_ids …

建筑工程施工进度智能编排系统 (SCS-BIM)

建筑工程施工进度智能编排 (SCS-BIM) 源码可见于&#xff1a;https://github.com/Asionm/SCS-BIM 项目简介 本项目是一个面向建筑工程的施工进度智能编制平台&#xff0c;用户只需上传一份标准 IFC 建筑信息模型文件&#xff0c;系统将自动完成以下任务&#xff1a; 解析模…

pikachu通关教程-XSS

XSS XSS漏洞原理 XSS被称为跨站脚本攻击&#xff08;Cross Site Scripting&#xff09;&#xff0c;由于和层叠样式表&#xff08;Cascading Style Sheets&#xff0c;CSS&#xff09;重名&#xff0c;改为XSS。主要基于JavaScript语言进行恶意攻击&#xff0c;因为js非常灵活…

AIGC学习笔记(9)——AI大模型开发工程师

文章目录 AI大模型开发工程师008 LangChain之Chains模块1 Chain模块核心知识2 Chain模块代码实战LLMSequentialTransformationRouter AI大模型开发工程师 008 LangChain之Chains模块 1 Chain模块核心知识 组合常用的模块 LLM&#xff1a;最常见的链式操作类型SequentialChain…

Keil MDK5.37或更高版本不再预装ARM Compiler Version5导致编译错误的解决方法

Keil MDK5.37预装的是最新的ARM Compiler Version6 我们可以先右击查看工程属性 在Target标签下&#xff0c;我们可以看到Compiler Version5就是丢失的 在Target标签下&#xff0c;我们可以看到Compiler Version5就是丢失的 图1 以固件库方式编程&#xff0c;编译之后全是错…

Unity-UI组件详解

今天我们来学习Unity的UI的详解&#xff0c;这部分的内容相对较少&#xff0c;对于程序员来说主要的工作是负责将各种格式的图片呈现在显示器上并允许操作这些图片。 本篇帖子的理论依据依然是官方开源的UGUI代码&#xff0c;网址为&#xff1a;GitHub - Unity-Technologies/u…

黑马点评完整代码(RabbitMQ优化)+简历编写+面试重点 ⭐

简历上展示黑马点评 完整代码地址 项目描述 黑马点评项目是一个springboot开发的前后端分离项目&#xff0c;使用了redis集群、tomcat集群、MySQL集群提高服务性能。类似于大众点评&#xff0c;实现了短信登录、商户查询缓存、优惠卷秒杀、附近的商户、UV统计、用户签到、好…

Java 大视界 -- Java 大数据在智能安防视频监控中的异常事件快速响应与处理机制(273)

&#x1f496;亲爱的朋友们&#xff0c;热烈欢迎来到 青云交的博客&#xff01;能与诸位在此相逢&#xff0c;我倍感荣幸。在这飞速更迭的时代&#xff0c;我们都渴望一方心灵净土&#xff0c;而 我的博客 正是这样温暖的所在。这里为你呈上趣味与实用兼具的知识&#xff0c;也…

【数据库】安全性

数据库安全性控制的常用方法&#xff1a;用户标识和鉴定、存取控制、视图、审计、数据加密。 1.用户标识与鉴别 用户标识与鉴别(Identification & Authentication)是系统提供的最外层安全保护措施。 2.存取控制 2.1自主存取控制(简称DAC) (1)同一用户对于不同的数据对…

【图像处理入门】4. 图像增强技术——对比度与亮度的魔法调节

摘要 图像增强是改善图像视觉效果的核心技术。本文将详解两种基础增强方法&#xff1a;通过直方图均衡化拉伸对比度&#xff0c;以及利用伽马校正调整非线性亮度。结合OpenCV代码实战&#xff0c;学会处理灰度图与彩色图的不同增强策略&#xff0c;理解为何彩色图像需在YUV空间…

HALCON 深度学习训练 3D 图像的几种方式优缺点

HALCON 深度学习训练 3D 图像的几种方式优缺点 ** 在计算机视觉和工业检测等领域&#xff0c;3D 图像数据的处理和分析变得越来越重要&#xff0c;HALCON 作为一款强大的机器视觉软件&#xff0c;提供了多种深度学习训练 3D 图像的方式。每种方式都有其独特的设计思路和应用场…

FreeRTOS的简单介绍

一、FreeRTOS介绍 FreeRTOS并不是实时操作系统&#xff0c;因为它是分时复用的 利用CubeMX快速移植 二、快速移植流程 1. 在 SYS 选项里&#xff0c;将 Debug 设为 Serial Wire &#xff0c;并且将 Timebase Source 设为 TIM2 &#xff08;其它定时器也行&#xff09;。为何…

深入解析C++引用:从别名机制到函数特性实践

1.C引用 1.1引用的概念和定义 引用不是新定义⼀个变量&#xff0c;而是给已存在变量取了⼀个别名&#xff0c;编译器不会为引用变量开辟内存空间&#xff0c;它和它引用的变量共用同⼀块内存空间。比如四大名著中林冲&#xff0c;他有一个外号叫豹子头&#xff0c;类比到C里就…