本地大模型编程实战(22)用langchain实现基于SQL数据构建问答系统(1)

使 LLM(大语言模型) 系统能够查询结构化数据与非结构化文本数据在性质上可能不同。后者通常生成可在向量数据库中搜索的文本，而结构化数据的方法通常是让 LLM 编写和执行 DSL（例如 SQL）中的查询。
我们将演练在使用基于 langchain 链，在结构化数据库 SQlite 中的数据上创建问答系统的基本方法，该系统建立以后，我们用自然语言询问有关数据库中数据的问题并返回自然语言答案。
后面我们将基于 智能体(Agent) 实现类似功能，两者之间的主要区别在于：智能体可以根据需要多次循环查询数据库以回答问题。

实现上述功能需要以下步骤：

将问题转换为 DSL 查询：模型将用户输入转换为 SQL 查询；
执行 SQL 查询；
回答问题：模型使用查询结果响应用户输入。

使用 qwen2.5 、 deepseek 以及 llama3.1 做实验。

准备

在正式开始撸代码之前，需要准备一下编程环境。

计算机
本文涉及的所有代码可以在没有显存的环境中执行。我使用的机器配置为：
- CPU: Intel i5-8400 2.80GHz
- 内存: 16GB
Visual Studio Code 和 venv
这是很受欢迎的开发工具，相关文章的代码可以在 Visual Studio Code 中开发和调试。我们用 python 的 venv 创建虚拟环境, 详见：
在Visual Studio Code中配置venv。
Ollama
在 Ollama 平台上部署本地大模型非常方便，基于此平台，我们可以让 langchain 使用 llama3.1、qwen2.5、deepseek 等各种本地大模型。详见：
在langchian中使用本地部署的llama3.1大模型。

准备 `SQLite` 数据库

SQLite 是一个轻量级、嵌入式的关系型数据库管理系统，不需要独立的服务器进程，所有数据存储在单一文件中。它支持大部分 SQL 标准，适用于移动应用、本地存储和小型项目。

我们将使用 Chinook 数据库做练习，数据库文件放在本文最后提及的代码仓库中的 assert 文件夹，名为：Chinook.db 。
下图是该数据库的结构：
SQLite数据结构

点击 sqlitestudio 可以下载 SQLite 的可视化管理工具，Sample Databases for SQLite 详细介绍了该数据库的情况。

创建数据库实例

from langchain_community.utilities import SQLDatabase
db = SQLDatabase.from_uri(f"sqlite:///{db_file_path}")

测试数据库

print(db.dialect)
print(db.get_usable_table_names())
print(db.run("SELECT * FROM Artist LIMIT 1;"))

sqlite
['Album', 'Artist', 'Customer', 'Employee', 'Genre', 'Invoice', 'InvoiceLine', 'MediaType', 'Playlist', 'PlaylistTrack', 'Track']
[(1, 'AC/DC')]

输出上述内容说明 SQLite 可以正常工作。

将问题转换为 SQL

在 langchain 中，可以使用 create_sql_query_chain 轻松的将问题转化为 SQL ，并且通过 db.run 方法执行SQL，基于这两个方法，我们创建了下面的方法用于将问题转化为SQL并执行：

def execute_query(llm_model_name,question: str):
    """把问题转换为SQL语句并执行"""
    
    llm = ChatOllama(model=llm_model_name,temperature=0, verbose=True)
    chain = create_sql_query_chain(llm, db)
    #print(chain.get_prompts()[0].pretty_print())

    # 转化问题为SQL
    response = chain.invoke({"question": question})
    print(f'response SQL is:\n{response}')

    # 执行SQL
    result = db.run(response)
    print(f'result is:\n{result}')

我们问几个问题，把使用三个大模型做一下简单测试，看看效果。

问题1：“How many Employees are there?”

llama3.1

response SQL is:
SELECT COUNT(*) FROM Employee;
result is:
[(8,)]

llama3.1 生成了正确的SQL并返回了正确的结果。
完美！

qwen2.5

To find out how many employees there are, you can use the following SQL query:\n\n
```sql\nSELECT COUNT(*) AS EmployeeCount\nFROM Employee;\n```
...

qwen2.5 推理出了正确的 SQL，可惜该SQL在一段文字中，所以在后面执行sql会有“SQL语法错误”。

deepseek-r1

返回一大段推理过程，但是未推理出SQL语句。

问题2：“Which country’s customers spent the most?”

llama3.1

response SQL is:
SELECT T2.Country FROM Invoice AS T1 INNER JOIN Customer AS T2 ON T1.CustomerId = T2.CustomerId GROUP BY T2.Country ORDER BY SUM(T1.Total) DESC LIMIT 1;
result is:
[('USA',)]

完美！

问题3：“Describe the PlaylistTrack table.”

response SQL is:
SELECT * FROM `PlaylistTrack`
result is:
[(1, 3402), (1, 3389), (1, 3390), ...

不理想。

为了进一步探索 create_sql_query_chain 都做了什么，我们可以在此语句后面执行：

print(chain.get_prompts()[0].pretty_print())

打印出的提示词为：

You are a SQLite expert. Given an input question, first create a syntactically correct SQLite query to run, then look at the results of the query and return the answer to the input question.
Unless the user specifies in the question a specific number of examples to obtain, query for at most 5 results using the LIMIT clause as per SQLite. You can order the results to return the most informative data in the database.
Never query for all columns from a table. You must query only the columns that are needed to answer the question. Wrap each column name in double quotes (") to denote them as delimited identifiers.
Pay attention to use only the column names you can see in the tables below. Be careful to not query for columns that do not exist. Also, pay attention to which column is in which table.
Pay attention to use date('now') function to get the current date, if the question involves "today".

Use the following format:

Question: Question here
SQLQuery: SQL Query to run
SQLResult: Result of the SQLQuery
Answer: Final answer here

Only use the following tables:
{table_info}

Question: {input}

可见，创建这个链实际上是执行了上述的提示词，可能我们修改一下提示词或者自定义一个链才能让 qwen2.5 和 deepseek-r1 正常工作。

我们也可以用下面更优雅的代码实现将问题转换为SQL：

def execute_query_2(llm_model_name,question: str):
    """把问题转换为SQL语句并执行"""

    llm = ChatOllama(model=llm_model_name,temperature=0, verbose=True)
    execute_query = QuerySQLDataBaseTool(db=db)
    write_query = create_sql_query_chain(llm, db)
    chain = write_query | execute_query
    response = chain.invoke({"question": question})
    print(f'response SQL is:\n{response}')

回答问题

现在，我们已经有了自动生成和执行查询的方法，我们只需要将原始问题和 SQL 查询结果结合起来即可生成最终答案。我们可以通过再次将问题和结果传递给 LLM 来实现这一点：

def ask(llm_model_name,question: str):
    answer_prompt = PromptTemplate.from_template(
    """Given the following user question, corresponding SQL query, and SQL result, answer the user question.

    Question: {question}
    SQL Query: {query}
    SQL Result: {result}
    Answer: """
    )

    llm = ChatOllama(model=llm_model_name,temperature=0, verbose=True)
    execute_query = QuerySQLDataBaseTool(db=db)
    write_query = create_sql_query_chain(llm, db)
    chain = (
        RunnablePassthrough.assign(query=write_query).assign(
            result=itemgetter("query") | execute_query
        )
        | answer_prompt
        | llm
        | StrOutputParser()
    )

    response = chain.invoke({"question": question})
    print(f'Answer is:\n{response}')

我们看看上述 LCEL(LangChain Expression Language) 中发生的事情：

LangChain 表达式语言 (LCEL) 采用声明式方法从现有 Runnable 构建新的 Runnable，最常用的表达式是 | ，它可以把前后两个链或者其它 Runnable 组件串联起来：前面组件的输出可以提供给后面的组件作为输入。
更多内容参见：LangChain Expression Language (LCEL) 。

在第一个 RunnablePassthrough.assign 之后，会生成一个包含两个元素的 runnable ：
{"question": question, "query": write_query.invoke(question)}
其中 write_query 将生成一个 SQL 查询来回答问题。
在第二个 RunnablePassthrough.assign 之后，我们添加了第三个元素 result ，它的内容由 execute_query.invoke(query) 生成， query 是在上一步中计算的。
这三个元素输入被格式化为提示并传递到 LLM。
StrOutputParser() 提取输出消息的字符串内容。
请注意：我们正在将 LLM、工具、提示和其他链组合在一起，但由于它们都实现了 Runnable 接口，因此它们的输入和输出可以绑定在一起：前面的输出可以作为后面的输入。

下面我们使用 llama3.1 ，用三个问题看看 ask 方法的输出内容。

问题1：“How many Employees are there?”

There are 8 employees.

问题2：“Which country’s customers spent the most?”

The country whose customers spent the most is the USA.

总结

从这次演练的效果看，在基于 langchain 框架，使用 LLM(大语言模型) 可以生成 SQL 语句，这使得我们可以说一句“人话”，计算机就可以自动查询 SQLite，并且像人一样告诉我们结果。
可惜 qwen2.5 和 deepseek-r1 在该领域与 langchain 的集成不太理想。