【RAG】【query_engine01】多文档自动检索分析

news2026/5/13 15:55:07

1. 案例目标本案例展示了如何实现结构化分层检索(Structured Hierarchical Retrieval)这是一种处理多文档RAG(检索增强生成)的高级架构。该架构能够根据用户查询动态选择相关文档然后再从这些文档中选择相关内容。主要目标包括演示如何将每个文档表示为包含不同属性的简洁元数据字典展示如何将元数据字典作为过滤器存储在向量数据库中说明如何实现自动检索(auto-retrieval) - 推断相关语义查询和过滤器集合展示如何结合文本到SQL和语义搜索来查询数据2. 技术栈与核心依赖核心框架LlamaIndex - 查询引擎和索引框架VectorIndexAutoRetriever - 自动检索器Weaviate - 向量数据库关键组件IndexNode - 索引节点SummaryIndex - 摘要索引MetadataFilters - 元数据过滤器GitHubRepositoryIssuesReader - GitHub问题读取器3. 环境配置安装步骤%pip install llama-index-readers-github%pip install llama-index-vector-stores-weaviate%pip install llama-index-llms-openai!pip install llama-index llama-hubAPI密钥配置import os# 配置GitHub和OpenAI API密钥os.environ[GITHUB_TOKEN] ghp_...os.environ[OPENAI_API_KEY] sk-...导入必要模块import nest_asyncioimport weaviatefrom llama_index.readers.github import (GitHubRepositoryIssuesReader,GitHubIssuesClient,)from llama_index.vector_stores.weaviate import WeaviateVectorStorefrom llama_index.core import VectorStoreIndex, StorageContext, SummaryIndexfrom llama_index.core.async_utils import run_jobsfrom llama_index.llms.openai import OpenAIfrom llama_index.core.schema import IndexNodefrom llama_index.core.vector_stores import (FilterOperator,MetadataFilter,MetadataFilters,MetadataInfo,VectorStoreInfo,)from llama_index.core.retrievers import VectorIndexAutoRetrieverfrom llama_index.core.query_engine import RetrieverQueryEngine异步环境配置# 应用nest_asyncio以支持异步操作 nest_asyncio.apply()4. 案例实现4.1 数据准备与加载从GitHub仓库加载问题数据# 创建GitHub客户端和加载器github_client GitHubIssuesClient()loader GitHubRepositoryIssuesReader(github_client,ownerrun-llama,repollama_index,verboseTrue,)# 加载数据orig_docs loader.load_data()# 限制文档数量并添加索引IDlimit 100docs []for idx, doc in enumerate(orig_docs):doc.metadata[index_id] int(doc.id_)if idx limit:breakdocs.append(doc)4.2 向量数据库设置配置Weaviate向量数据库并创建存储上下文# 配置Weaviate客户端auth_config weaviate.AuthApiKey(api_keyXRa15cDIkYRT7AkrpqT6jLfE4wropK1c1TGk)client weaviate.Client(https://llama-index-test-v0oggsoz.weaviate.network,auth_client_secretauth_config,)# 创建向量存储和存储上下文class_name LlamaIndex_docsvector_store WeaviateVectorStore(weaviate_clientclient, index_nameclass_name)storage_context StorageContext.from_defaults(vector_storevector_store)# 创建文档索引doc_index VectorStoreIndex.from_documents(docs, storage_contextstorage_context)4.3 创建索引节点为每个文档创建包含摘要和元数据的索引节点async def aprocess_doc(doc, include_summary: bool True):处理文档并创建索引节点metadata doc.metadata# 提取日期信息date_tokens metadata[created_at].split(T)[0].split(-)year int(date_tokens[0])month int(date_tokens[1])day int(date_tokens[2])# 提取分配者和大小信息assignee if assignee not in doc.metadata else doc.metadata[assignee]size if len(doc.metadata[labels]) 0:size_arr [l for l in doc.metadata[labels] if size: in l]size size_arr[0].split(:)[1] if len(size_arr) 0 else # 创建新元数据new_metadata {state: metadata[state],year: year,month: month,day: day,assignee: assignee,size: size,}# 提取文档摘要summary_index SummaryIndex.from_documents([doc])query_str Give a one-sentence concise summary of this issue.query_engine summary_index.as_query_engine(llmOpenAI(modelgpt-3.5-turbo))summary_txt await query_engine.aquery(query_str)summary_txt str(summary_txt)# 创建过滤器index_id doc.metadata[index_id]filters MetadataFilters(filters[MetadataFilter(keyindex_id, operatorFilterOperator.EQ, valueint(index_id)),])# 创建索引节点index_node IndexNode(textsummary_txt,metadatanew_metadata,objdoc_index.as_retriever(filtersfilters),index_iddoc.id_,)return index_node# 批量处理文档index_nodes await aprocess_docs(docs)4.4 创建自动检索索引为摘要元数据创建单独的向量索引# 创建新的向量存储用于摘要元数据class_name LlamaIndex_autovector_store_auto WeaviateVectorStore(weaviate_clientclient, index_nameclass_name)storage_context_auto StorageContext.from_defaults(vector_storevector_store_auto)# 创建摘要索引index VectorStoreIndex(objectsindex_nodes, storage_contextstorage_context_auto)4.5 定义向量数据库模式定义向量数据库的模式包括元数据字段信息vector_store_info VectorStoreInfo(content_infoGithub Issues,metadata_info[MetadataInfo(namestate,descriptionWhether the issue is open or closed,typestring,),MetadataInfo(nameyear,descriptionThe year issue was created,typeinteger,),MetadataInfo(namemonth,descriptionThe month issue was created,typeinteger,),MetadataInfo(nameday,descriptionThe day issue was created,typeinteger,),MetadataInfo(nameassignee,descriptionThe assignee of the ticket,typestring,),MetadataInfo(namesize,descriptionHow big the issue is (XS, S, M, L, XL, XXL),typestring,),],)4.6 创建自动检索器基于定义的模式创建自动检索器# 创建自动检索器retriever VectorIndexAutoRetriever(index,vector_store_infovector_store_info,similarity_top_k2,empty_query_top_k10, # 如果只指定了元数据过滤器这是限制verboseTrue,)4.7 创建查询引擎将检索器与查询引擎结合实现完整的RAG流程# 创建查询引擎llm OpenAI(modelgpt-3.5-turbo)query_engine RetrieverQueryEngine.from_args(retriever, llmllm)# 执行查询response query_engine.query(Tell me about some issues on 01/11)print(str(response))5. 案例效果5.1 自动检索能力系统能够根据用户查询自动推断语义查询和元数据过滤器# 示例输出Using query str: issuesUsing filters: [(day, , 11), (month, , 01)]5.2 结构化查询系统能够基于元数据进行结构化查询# 示例输出Using query str: agentsUsing filters: [(state, , open)]5.3 分层检索系统能够先检索相关文档然后从这些文档中检索具体内容# 示例输出Retrieval entering 9995: VectorIndexRetrieverRetrieving from object VectorIndexRetriever with query issuesRetrieval entering 9985: VectorIndexRetrieverRetrieving from object VectorIndexRetriever with query issues5.4 综合响应系统能够基于检索到的内容生成综合响应# 示例输出There are two issues that were created on 01/11. The first issue is related to ensuring backwards compatibility with the new Pinecone client version bifurcation. The second issue is a feature request to implement the Language Agent Tree Search (LATS) agent in llama-index.6. 案例实现思路6.1 整体架构该案例采用了以下架构思路数据准备阶段从GitHub加载问题数据并处理元数据摘要生成阶段为每个文档生成简洁摘要索引节点创建阶段创建包含摘要和元数据的索引节点双层索引阶段为原始文档和摘要元数据分别创建索引自动检索阶段基于用户查询自动推断语义查询和过滤器分层检索阶段先检索相关文档再从这些文档中检索具体内容响应生成阶段基于检索到的内容生成综合响应6.2 关键技术点结构化元数据将文档表示为包含不同属性的元数据字典自动检索结合文本到SQL和语义搜索来查询数据分层检索先选择相关文档再从这些文档中选择内容索引节点使用IndexNode连接摘要和原始文档6.3 创新点将每个文档表示为包含不同属性的简洁元数据字典将元数据字典作为过滤器存储在向量数据库中实现自动检索推断相关语义查询和过滤器集合结合文本到SQL和语义搜索来查询数据7. 扩展建议功能扩展支持更多元数据字段和类型添加自定义过滤器操作符实现多级分层检索支持跨文档关联查询添加查询结果排序和分页性能优化优化摘要生成算法实现并行文档处理添加查询结果缓存优化向量索引结构支持增量索引更新用户体验改进提供查询过程的可视化界面添加查询结果高亮和标注实现查询建议和自动补全提供查询历史和收藏功能添加查询结果导出功能应用场景扩展企业知识库检索学术论文搜索法律文档检索医疗记录查询客户支持系统8. 总结多文档自动检索案例展示了一种高级的RAG架构通过结构化分层检索实现了对多文档集合的高效查询。该架构能够根据用户查询动态选择相关文档然后再从这些文档中选择相关内容从而提高了查询的准确性和效率。该案例的核心价值在于提供了一种系统化的方法来处理多文档RAG问题通过结构化元数据实现了更精确的文档选择结合了语义搜索和结构化查询的优势实现了自动化的查询推断和执行注意这种结构化分层检索方法特别适合处理包含大量文档的集合其中每个文档都有丰富的元数据。通过将文档表示为包含不同属性的元数据字典系统能够更精确地选择相关文档从而提高查询的准确性和效率。随着数据量的不断增长和查询复杂性的提高这种基于结构化元数据的分层检索方法将成为构建智能检索系统的重要技术路径为各种需要处理多文档查询的应用场景提供支持。

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.coloradmin.cn/o/2609544.html

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈，一经查实，立即删除！