题意:怎样在Streamlit应用程序中显示参考文献或引用?
问题背景:
I have created an Azure AI Search Service, created index using Azure Blob Storage and Deployed a web application and made a chat Playground using AzureOpenAI.
我已经创建了一个Azure AI搜索服务,使用Azure Blob Storage创建了索引,并部署了一个Web应用程序,还使用AzureOpenAI制作了一个聊天游乐场。
Similarly, I have made a streamlit Application using VS Code. The application is like I will upload document, ask a query and will get an answer based on uploaded document using azure ai search index and azureopenai. But, One thing that I want is below answer, I want Reference/Citation to be displayed.
同样地,我使用VS Code制作了一个Streamlit应用程序。该应用程序的功能是,我可以上传文档,提出问题,然后基于上传的文档,利用Azure AI搜索索引和AzureOpenAI得到答案。但是,我希望的是,在答案下方能够显示参考/引用信息。
It should be the page/source information from where answer is extracted.
它应该是从中提取答案的页面/源信息。
The fields in my Index are: 我索引中的字段包括:
content: Full text content of the document. metadata_storage_path: Path where the document is stored. metadata_author: Author of the document. metadata_title: Title of the document. metadata_creation_date: Creation date of the document. language: Language of the document. split_text: Segmented parts of the document text. keywords: Keywords extracted from the document. summary: Summary of the document content. section_titles: Titles of sections within the document. metadata_file_type: File type of the document (e.g., PDF, DOCX). merged_content: Combined content from different parts of the document. text: Main text of the document. layoutText: Text layout information of the document.
我的索引中的字段包括:
- content: 文档的全文内容。
- metadata_storage_path: 文档存储的路径。
- metadata_author: 文档的作者。
- metadata_title: 文档的标题。
- metadata_creation_date: 文档的创建日期。
- language: 文档的语言。
- split_text: 文档文本的分割部分。
- keywords: 从文档中提取的关键词。
- summary: 文档内容的摘要。
- section_titles: 文档内部各节的标题。
- metadata_file_type: 文档的文件类型(例如,PDF、DOCX)。
- merged_content: 来自文档不同部分的合并内容。
- text: 文档的主要文本。
- layoutText: 文档的文本布局信息。
My Code is here: 下面是我的代码:
import os
import streamlit as st
from openai import AzureOpenAI
from azure.identity import AzureCliCredential
from azure.core.credentials import AccessToken
# Environment variables
endpoint = os.getenv("ENDPOINT_URL", "https://****************.azure.com/")
deployment = os.getenv("DEPLOYMENT_NAME", "openai-gpt-35-1106")
search_endpoint = os.getenv("SEARCH_ENDPOINT", "https://****************windows.net")
search_key = os.getenv("SEARCH_KEY", ********************************)
search_index = os.getenv("SEARCH_INDEX_NAME", "azureblob-index")
# Setup Azure OpenAI client
credential = AzureCliCredential()
def get_bearer_token() -> str:
    token = credential.get_token("https://****************windows.net")
    return token.token
client = AzureOpenAI(
    azure_endpoint=endpoint,
    azure_ad_token_provider=get_bearer_token,
    api_version="2024-05-01-preview"
)
# Streamlit UI
st.title("Document Uploader and Query Tool")
# File upload
uploaded_file = st.file_uploader("Upload a document", type=["pdf", "docx", "pptx", "xlsx", "txt"])
if uploaded_file is not None:
    file_content = uploaded_file.read()
    st.write("Document uploaded successfully!")
# Send query to Azure AI Search and OpenAI
query = st.text_input("Enter your query:")
if st.button("Get Answer"):
    if query:
        try:
            completion = client.chat.completions.create(
                model=deployment,
                messages=[
                    {
                        "role": "user",
                        "content": query
                    }
                ],
                max_tokens=800,
                temperature=0,
                top_p=1,
                frequency_penalty=0,
                presence_penalty=0,
                stop=None,
                stream=False,
                extra_body={
                    "data_sources": [{
                        "type": "azure_search",
                        "parameters": {
                            "endpoint": search_endpoint,
                            "index_name": search_index,
                            "semantic_configuration": "docs_test",
                            "query_type": "semantic",
                            "fields_mapping": {
                                "content_fields_separator": "\n",
                                "content_fields": ["content", "merged_content"]
                            },
                            "in_scope": True,
                            "role_information": "You are an AI assistant that helps people find information. The information should be small and crisp. It should be accurate.",
                            "authentication": {
                                "type": "api_key",
                                "key": search_key
                            }
                        }
                    }]
                }
            )
            response = completion.to_dict()
            answer = response["choices"][0]["message"]["content"]
            references = []
            if "references" in response["choices"][0]["message"]:
                references = response["choices"][0]["message"]["references"]
            st.write("Response from OpenAI:")
            st.write(answer)
            if references:
                st.write("References:")
                for i, ref in enumerate(references):
                    st.write(f"{i + 1}. {ref['title']} ({ref['url']})")
        except Exception as e:
            st.error(f"Error: {e}")
    else:
        st.warning("Please enter a query.")The answer that I am getting is like below: 我得到的答案如下:
The primary purpose of Tesla's existence, as stated in the 2019 Impact Report, is to accelerate the world's transition to sustainable energy [doc1]. This mission is aimed at minimizing the environmental impact of products and their components, particularly in the product-use phase, by providing information on both the manufacturing and consumer-use aspects of Tesla products [doc1].
正如2019年影响力报告所述,特斯拉存在的主要目的是加速世界向可持续能源的转型[doc1]。特斯拉的这一使命旨在通过提供特斯拉产品在生产和消费者使用方面的信息,来最小化产品及其组件对环境的影响,特别是在产品使用阶段[doc1]。
the [doc1] is placeholder of source information. But I want it to be like:
这里的“[doc1]”是源信息的占位符。但我想让它像这样:
Reference: the source information/page from where answer is extracted.
参考:答案被提取的源信息/页面。
Can you help. 谁能提供帮助?
Thanks in Advance!!!!! 非常感谢
问题解决:
You can use below code to extract title name and url from references.
你可以使用下面的代码来从参考文献中提取标题名称和URL
Actually, the [doc1] itself the reference which is in content of the message object.
实际上,[doc1] 本身就是消息对象中内容部分的引用。
doc1 in the sense 1st document in citations dictionary.
doc1”在意义上指的是“citations”字典中的第一个文档
So, below code helps you extract it. 所以,下面的代码可以帮助你提取它。
First, find out the unique references. 首先,找出唯一的引用。
import re
pattern = r'\[(.*?)\]'
text = simple_res.choices[0].message.content
matches = re.findall(pattern, text)
documents = list(set([match for match in matches if match.startswith('doc')]))
print(documents)Output: 输出
['doc1']
Next, create a dictionary of citation. The result citation will be mapped increasing order like doc1 is first citation and doc2 is second citation and so on.
接下来,创建一个引用字典。结果中的引用将按照递增顺序映射,比如“doc1”是第一个引用,“doc2”是第二个引用,依此类推。
references = {}
for i,j in enumerate(simple_res.choices[0].message.context['citations']):
    references[f"doc{i+1}"] =jNow fetch the title and url. 现在获取标题和URL。
if references:
    print("References:")
    for i, ref in enumerate(documents):
        print(f"{i + 1}. {references[ref]['title']} ({references[ref]['url']})")Output: 输出:
References:
1. 78782543_7_23_24.html (https://xxxxx.blob.core.windows.net/data/pdf/78782543_7_23_24.html)



















