Phi-3-Mini-128K赋能Java开发：SpringBoot集成智能问答助手实战

news2026/3/17 19:28:15

Phi-3-Mini-128K赋能Java开发SpringBoot集成智能问答助手实战最近在帮一个朋友的公司做技术升级他们想给内部的客服系统加个“智能大脑”让系统能自动回答一些常见问题减轻人工客服的压力。要求还挺明确要能集成到现有的Java技术栈里模型不能太笨重响应速度要快还得能记住点对话上下文。挑来选去最后锁定了Phi-3-Mini-128K。这模型名字听着挺唬人但说白了就是个“小而精”的选手。128K的超长上下文意味着它能记住很长的对话历史轻量级的特性又让它能在普通的服务器资源上跑得挺欢实。最关键的是它推理速度不错对于需要快速响应的问答场景来说这点太重要了。今天这篇文章我就来聊聊怎么把这个“智能大脑”塞进咱们熟悉的SpringBoot微服务里从头到尾走一遍打造一个能实际用起来的企业级智能问答助手。咱们不搞那些虚头巴脑的理论直接上代码讲实战。1. 为什么是Phi-3-Mini-128K与SpringBoot在做技术选型的时候我们得先想明白到底要什么。朋友公司的客服系统每天要处理大量重复性问题比如“订单怎么查”、“退货流程是什么”、“产品保修期多久”。人工客服回答这些耗时耗力而且容易因为疲劳而出错。我们需要的是一个能集成到现有Java后台的、能理解自然语言、并快速给出准确回复的组件。SpringBoot几乎是Java微服务开发的事实标准生态完善部署简单。而模型方面那些动辄几百亿参数的大模型虽然能力强但对算力要求高响应延迟也大不太适合这种需要高并发的在线服务场景。Phi-3-Mini-128K在这里就显得很合适。它参数规模相对较小意味着更快的推理速度和更低的内存占用。128K的上下文长度足以让它记住多轮对话的内容实现连贯的交流。你可以把它想象成一个反应快、记性好、但知识面相对聚焦的“专家型助手”正好匹配企业知识问答这种垂直场景。把这两者结合起来就是用SpringBoot搭建一个稳固、可扩展的服务框架然后把Phi-3-Mini-128K作为核心的推理引擎嵌入进去。SpringBoot负责处理网络请求、业务逻辑、数据持久化而模型则专心负责“思考”和“回答”。2. 项目搭建与核心依赖万事开头难但SpringBoot让开头变得简单。我们先来创建一个最基础的SpringBoot项目。如果你用的是IDEA可以直接通过Spring Initializr创建或者用命令行也挺方便。这里假设我们使用Maven来管理依赖。pom.xml文件里除了SpringBoot的基础依赖我们还需要引入一些关键的库来支持与AI模型的交互以及一些工具类。?xml version1.0 encodingUTF-8? project xmlnshttp://maven.apache.org/POM/4.0.0 xmlns:xsihttp://www.w3.org/2001/XMLSchema-instance xsi:schemaLocationhttp://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd modelVersion4.0.0/modelVersion groupIdcom.example/groupId artifactIdphi3-qa-assistant/artifactId version1.0.0/version parent groupIdorg.springframework.boot/groupId artifactIdspring-boot-starter-parent/artifactId version3.1.5/version !-- 使用较新稳定版本 -- /parent properties java.version17/java.version /properties dependencies !-- SpringBoot Web -- dependency groupIdorg.springframework.boot/groupId artifactIdspring-boot-starter-web/artifactId /dependency !-- 用于参数校验 -- dependency groupIdorg.springframework.boot/groupId artifactIdspring-boot-starter-validation/artifactId /dependency !-- 这里假设通过HTTP API调用模型服务例如Ollama -- dependency groupIdorg.springframework.boot/groupId artifactIdspring-boot-starter-webflux/artifactId /dependency !-- 用于JSON处理 -- dependency groupIdcom.fasterxml.jackson.core/groupId artifactIdjackson-databind/artifactId /dependency !-- 连接池用于管理模型服务HTTP连接 -- dependency groupIdio.github.resilience4j/groupId artifactIdresilience4j-spring-boot2/artifactId version2.1.0/version /dependency !-- 工具类 -- dependency groupIdorg.projectlombok/groupId artifactIdlombok/artifactId optionaltrue/optional /dependency /dependencies build plugins plugin groupIdorg.springframework.boot/groupId artifactIdspring-boot-maven-plugin/artifactId /plugin /plugins /build /project这里解释一下几个关键点spring-boot-starter-webflux我们不一定非要把模型直接加载到Java进程里。更常见的做法是模型在一个独立的服务比如用Ollama、vLLM部署中运行我们的SpringBoot应用通过HTTP API去调用它。WebFlux提供了响应式的HTTP客户端性能更好。resilience4j这是个好东西。当模型服务偶尔抽风或者响应慢的时候它能提供熔断、重试、限流等保护机制防止我们的问答服务被拖垮。lombok简化代码少写getter/setter。接下来我们需要一个配置文件application.yml来管理模型服务的地址、超时时间等参数。server: port: 8080 phi3: model: # 假设Phi-3-Mini-128K模型通过Ollama在本地8700端口提供服务 base-url: http://localhost:8700 # 模型在Ollama中的名称 model-name: phi3:mini-128k # 调用超时时间毫秒 timeout: 30000 # 最大上下文长度token数用于服务端截断 max-context-length: 120000 # 线程池配置用于处理并发请求 task: executor: core-pool-size: 10 max-pool-size: 50 queue-capacity: 100 thread-name-prefix: qa-async-2.1 模型服务连接层配置好了我们来写第一段核心代码一个负责和模型服务“对话”的客户端。这里我们设计一个ModelServiceClient。import com.fasterxml.jackson.databind.JsonNode; import com.fasterxml.jackson.databind.ObjectMapper; import lombok.RequiredArgsConstructor; import lombok.extern.slf4j.Slf4j; import org.springframework.beans.factory.annotation.Value; import org.springframework.http.HttpStatusCode; import org.springframework.http.MediaType; import org.springframework.stereotype.Component; import org.springframework.web.reactive.function.client.WebClient; import reactor.core.publisher.Mono; import java.time.Duration; Slf4j Component RequiredArgsConstructor public class ModelServiceClient { private final WebClient.Builder webClientBuilder; private final ObjectMapper objectMapper; Value(${phi3.model.base-url}) private String modelBaseUrl; Value(${phi3.model.timeout}) private long timeoutMillis; /** * 调用模型生成API * param prompt 输入的提示文本 * return 模型生成的回复 */ public MonoString generateResponse(String prompt) { // 构建请求体这里以Ollama的API格式为例 String requestBody String.format({\model\: \%s\, \prompt\: \%s\, \stream\: false}, phi3:mini-128k, prompt.replace(\, \\\)); // 简单转义 WebClient client webClientBuilder.baseUrl(modelBaseUrl).build(); return client.post() .uri(/api/generate) .contentType(MediaType.APPLICATION_JSON) .bodyValue(requestBody) .retrieve() .onStatus(HttpStatusCode::isError, response - { log.error(模型服务调用失败状态码: {}, response.statusCode()); return Mono.error(new RuntimeException(模型服务异常)); }) .bodyToMono(JsonNode.class) .timeout(Duration.ofMillis(timeoutMillis)) .map(jsonNode - { // 解析Ollama返回的JSON获取response字段 if (jsonNode.has(response)) { return jsonNode.get(response).asText(); } else { log.warn(模型返回格式异常: {}, jsonNode); return 抱歉我暂时无法处理这个问题。; } }) .doOnError(e - log.error(调用模型服务时发生错误, e)); } }这段代码干了啥它用WebClient向模型服务发送一个HTTP POST请求请求体里告诉模型“请根据这个prompt生成回复”然后等待模型返回结果最后从返回的JSON数据里把生成的文本抠出来。Slf4j和log是用来记录日志的方便出问题时排查。3. 设计智能问答的RESTful API有了能跟模型说话的后台我们得给它开个“窗口”让外部能访问。这就是RESTful API的作用。我们设计两个核心接口一个用于单次问答一个用于带上下文的连续对话。首先定义请求和响应的数据结构。这能让我们的接口更规范。// ApiRequest.java import jakarta.validation.constraints.NotBlank; import lombok.Data; Data public class ChatRequest { NotBlank(message 用户消息不能为空) private String message; // 会话ID用于关联多轮对话。如果为空则创建新会话。 private String sessionId; } // ApiResponse.java import lombok.Data; Data public class ApiResponseT { private int code; private String msg; private T data; public static T ApiResponseT success(T data) { ApiResponseT response new ApiResponse(); response.setCode(200); response.setMsg(success); response.setData(data); return response; } public static ApiResponse? error(int code, String msg) { ApiResponse? response new ApiResponse(); response.setCode(code); response.setMsg(msg); return response; } }然后创建控制器Controller。这是SpringBoot里处理HTTP请求的入口。import jakarta.validation.Valid; import lombok.RequiredArgsConstructor; import lombok.extern.slf4j.Slf4j; import org.springframework.http.ResponseEntity; import org.springframework.web.bind.annotation.*; import reactor.core.publisher.Mono; Slf4j RestController RequestMapping(/api/v1/chat) RequiredArgsConstructor public class ChatController { private final ChatService chatService; /** * 单次问答接口 */ PostMapping(/single) public MonoResponseEntityApiResponseString singleChat(Valid RequestBody ChatRequest request) { log.info(收到单次问答请求消息: {}, request.getMessage()); return chatService.handleSingleMessage(request.getMessage()) .map(response - ResponseEntity.ok(ApiResponse.success(response))) .onErrorResume(e - { log.error(处理单次问答失败, e); return Mono.just(ResponseEntity .internalServerError() .body(ApiResponse.error(500, 服务内部错误))); }); } /** * 带上下文的连续对话接口 */ PostMapping(/with-context) public MonoResponseEntityApiResponseString chatWithContext(Valid RequestBody ChatRequest request) { log.info(收到对话请求会话ID: {}, 消息: {}, request.getSessionId(), request.getMessage()); return chatService.handleMessageWithContext(request.getMessage(), request.getSessionId()) .map(response - ResponseEntity.ok(AApiResponse.success(response))) .onErrorResume(e - { log.error(处理对话请求失败, e); return Mono.just(ResponseEntity .internalServerError() .body(ApiResponse.error(500, 服务内部错误))); }); } }控制器很简洁它的主要工作就是接收请求、验证参数、然后把任务交给真正的业务处理类ChatService最后把结果包装成统一的格式返回给前端。这样做的好处是职责清晰控制器只管“接待”业务逻辑都在Service里。4. 实现上下文记忆与对话管理单次问答很简单直接把用户问题扔给模型就行。但真正的智能对话需要“记忆力”要知道用户之前说过什么。这就是上下文管理。对于Phi-3-Mini-128K我们可以把历史对话拼接成一个长文本作为新的prompt的一部分。但我们需要管理这个历史不能无限增长受限于128K上下文还要能根据会话ID来区分不同用户的对话。这里我们可以设计一个简单的ConversationManager。import org.springframework.stereotype.Component; import java.util.Map; import java.util.concurrent.ConcurrentHashMap; import java.util.concurrent.locks.ReentrantLock; Component public class ConversationManager { // 存储会话历史key是sessionIdvalue是对话历史字符串 private final MapString, StringBuilder sessionHistories new ConcurrentHashMap(); private final ReentrantLock lock new ReentrantLock(); Value(${phi3.model.max-context-length:120000}) private int maxContextLength; /** * 为指定会话添加一轮对话并返回整合后的完整prompt */ public String appendAndGetPrompt(String sessionId, String userMessage, String assistantMessage) { if (sessionId null || sessionId.isBlank()) { // 如果没有sessionId当作全新对话只返回当前问题 return 用户: userMessage \n助手:; } lock.lock(); try { StringBuilder history sessionHistories.computeIfAbsent(sessionId, k - new StringBuilder()); // 拼接新的对话轮次 String newTurn \n用户: userMessage; if (assistantMessage ! null !assistantMessage.isBlank()) { newTurn \n助手: assistantMessage; } // 简单的长度控制如果加上新的内容后超长则移除最老的部分这里简化处理可优化 // 实际中可以按token数精确计算或移除最早的一轮对话。 if ((history.length() newTurn.length()) maxContextLength) { // 这里简单地从历史中删除前面一部分内容直到长度合适 // 更优策略是维护一个对话列表按轮次删除 int overflow (history.length() newTurn.length()) - maxContextLength; if (overflow history.length()) { history.delete(0, overflow); } else { history.setLength(0); // 如果历史本身太长清空 } } history.append(newTurn); // 返回给模型的prompt是历史当前用户新问题 “助手:”提示词 return history.toString() \n用户: userMessage \n助手:; } finally { lock.unlock(); } } /** * 清除某个会话的历史 */ public void clearHistory(String sessionId) { sessionHistories.remove(sessionId); } }这个管理器做了几件事用ConcurrentHashMap存不同会话的历史保证线程安全。每次对话把新的“用户问-助手答”对拼接到历史后面。加了把锁ReentrantLock防止同时修改同一个会话的历史时出错。有一个简单的长度控制逻辑当历史对话太长快超过模型限制时会把最老的内容丢掉一些。这是一个很基础的实现在实际生产环境你可能需要更精细的策略比如按对话轮次删除或者计算准确的token数。现在我们的ChatService就可以利用这个管理器来处理带上下文的对话了。import lombok.RequiredArgsConstructor; import lombok.extern.slf4j.Slf4j; import org.springframework.stereotype.Service; import reactor.core.publisher.Mono; Slf4j Service RequiredArgsConstructor public class ChatService { private final ModelServiceClient modelClient; private final ConversationManager conversationManager; public MonoString handleSingleMessage(String userMessage) { // 单次问答直接调用模型 String prompt 用户: userMessage \n助手:; return modelClient.generateResponse(prompt); } public MonoString handleMessageWithContext(String userMessage, String sessionId) { return Mono.fromCallable(() - { // 1. 获取当前会话的历史prompt不包含本次助手回复 String promptWithHistory conversationManager.appendAndGetPrompt(sessionId, userMessage, null); return promptWithHistory; }) .flatMap(prompt - modelClient.generateResponse(prompt)) // 2. 调用模型 .flatMap(assistantReply - { // 3. 将模型回复更新到会话历史中 return Mono.fromRunnable(() - { // 这里需要重新计算一次prompt并把助手回复加进去。 // 注意这里存在一个小问题如果并发请求同一个sessionId历史可能错乱。 // 更严谨的做法是将3步合并成一个原子操作或者使用更复杂的对话状态管理。 // 为了示例清晰此处简化处理。 conversationManager.appendAndGetPrompt(sessionId, userMessage, assistantReply); }).thenReturn(assistantReply); }); } }5. 高并发下的性能与稳定性优化想象一下客服系统高峰期可能有成百上千的用户同时提问。我们的服务不能一压就垮。这就需要做优化。5.1 异步处理与线程池如果每个请求都同步等待模型回复模型推理可能要几秒那么很快线程就会被占满新的请求进不来。我们需要异步处理。SpringBoot提供了Async注解可以轻松实现异步方法。我们先配置一个线程池。import org.springframework.context.annotation.Bean; import org.springframework.context.annotation.Configuration; import org.springframework.scheduling.annotation.EnableAsync; import org.springframework.scheduling.concurrent.ThreadPoolTaskExecutor; import java.util.concurrent.Executor; Configuration EnableAsync public class AsyncConfig { Bean(name qaTaskExecutor) public Executor taskExecutor() { ThreadPoolTaskExecutor executor new ThreadPoolTaskExecutor(); // 核心线程数即使空闲也保留 executor.setCorePoolSize(10); // 最大线程数队列满了之后能创建的最大线程数 executor.setMaxPoolSize(50); // 队列容量超过核心线程数的任务会进入队列等待 executor.setQueueCapacity(100); executor.setThreadNamePrefix(qa-async-); executor.initialize(); return executor; } }然后我们可以修改ChatService让耗时的模型调用在异步线程中执行快速释放Web容器线程。Async(qaTaskExecutor) public CompletableFutureString handleMessageWithContextAsync(String userMessage, String sessionId) { // 将原来的Mono转换为CompletableFuture适应Async return handleMessageWithContext(userMessage, sessionId) .toFuture(); // 将Mono转换为CompletableFuture }控制器也需要稍作修改调用这个异步方法并返回一个DeferredResult或使用WebFlux的响应式类型Mono。由于我们之前已经用了WebFlux的Mono它本身就是非阻塞的所以这里更优雅的方式是保持Service层返回Mono由Controller直接返回。WebClient本身也是非阻塞的。所以我们上面写的ChatService实际上已经具备了较好的并发能力不需要额外加Async。线程池的配置更多是用于服务内部其他可能的阻塞操作。5.2 熔断、降级与重试模型服务可能不稳定。我们可以用Resilience4j给模型调用加上保护壳。首先在application.yml加配置resilience4j.circuitbreaker: instances: modelService: sliding-window-size: 10 failure-rate-threshold: 50 wait-duration-in-open-state: 10s permitted-number-of-calls-in-half-open-state: 3 automatic-transition-from-open-to-half-open-enabled: true resilience4j.retry: instances: modelService: max-attempts: 3 wait-duration: 1s然后在ModelServiceClient的调用方法上添加注解import io.github.resilience4j.circuitbreaker.annotation.CircuitBreaker; import io.github.resilience4j.retry.annotation.Retry; Service public class ModelServiceClient { // ... CircuitBreaker(name modelService, fallbackMethod generateResponseFallback) Retry(name modelService) public MonoString generateResponse(String prompt) { // ... 原有的WebClient调用逻辑 } // 降级方法当模型服务不可用时返回一个友好的默认回复 public MonoString generateResponseFallback(String prompt, Throwable t) { log.warn(模型服务熔断或异常启用降级回复。问题: {}, prompt, t); return Mono.just(您好智能助手当前正在休息请稍后再试。); } }这样当模型服务调用失败率达到阈值时熔断器会打开直接走降级逻辑避免大量请求堆积导致雪崩。同时还会自动重试几次提高单次请求的成功率。5.3 缓存热点问题客服系统中很多用户问的是相似的问题比如“怎么退货”。我们可以把常见问题的答案缓存起来直接返回根本不用劳烦模型速度飞快。import com.github.benmanes.caffeine.cache.Cache; import com.github.benmanes.caffeine.cache.Caffeine; import org.springframework.stereotype.Component; import java.util.concurrent.TimeUnit; Component public class QaCache { // 使用Caffeine缓存key是问题value是答案 private final CacheString, String cache Caffeine.newBuilder() .maximumSize(1000) // 最多缓存1000个问题 .expireAfterWrite(10, TimeUnit.MINUTES) // 写入10分钟后过期 .build(); public String get(String question) { return cache.getIfPresent(question); } public void put(String question, String answer) { cache.put(question, answer); } }在ChatService处理问题前先查一下缓存public MonoString handleSingleMessage(String userMessage) { // 先查缓存 String cachedAnswer qaCache.get(userMessage.trim()); if (cachedAnswer ! null) { log.info(缓存命中: {}, userMessage); return Mono.just(cachedAnswer); } String prompt 用户: userMessage \n助手:; return modelClient.generateResponse(prompt) .doOnNext(answer - { // 将回答放入缓存 qaCache.put(userMessage.trim(), answer); }); }6. 与业务系统对接的实战方案智能问答助手不是孤立的它需要从公司的知识库、CRM、订单系统里获取信息才能给出准确的回答。比如用户问“我的订单123456到哪了”助手需要去查订单系统。6.1 设计一个“技能”路由我们可以把助手的能力模块化。先定义一个Skill接口。public interface Skill { /** * 判断这个技能是否能处理当前问题 */ boolean canHandle(String userMessage); /** * 执行技能返回处理结果 */ MonoString handle(String userMessage, String sessionId); }然后实现几个具体的技能通用问答技能调用Phi-3模型处理通用问题。Component Primary // 作为默认技能 RequiredArgsConstructor public class GeneralQaSkill implements Skill { private final ModelServiceClient modelClient; Override public boolean canHandle(String userMessage) { // 默认技能总是返回true作为兜底 return true; } Override public MonoString handle(String userMessage, String sessionId) { // ... 调用模型 } }订单查询技能匹配特定意图调用订单系统API。Component RequiredArgsConstructor public class OrderQuerySkill implements Skill { private final OrderServiceClient orderServiceClient; // 假设的订单服务客户端 Override public boolean canHandle(String userMessage) { // 简单关键词匹配实际可用更复杂的NLP意图识别 return userMessage.contains(订单) (userMessage.contains(哪里) || userMessage.contains(状态)); } Override public MonoString handle(String userMessage, String sessionId) { // 1. 从消息中提取订单号这里简化实际可用正则 // 2. 调用orderServiceClient查询订单状态 // 3. 将查询结果组织成自然语言回复 return orderServiceClient.getOrderStatus(123456) .map(status - 您的订单当前状态是 status); } }知识库查询技能先从向量知识库搜索相关文档再把文档作为上下文给模型。Component RequiredArgsConstructor public class KnowledgeBaseSkill implements Skill { private final KnowledgeBaseService kbService; // 知识库服务 private final ModelServiceClient modelClient; Override public boolean canHandle(String userMessage) { // 判断是否为产品、政策类问题 return userMessage.contains(怎么) || userMessage.contains(如何) || userMessage.contains(政策); } Override public MonoString handle(String userMessage, String sessionId) { // 1. 从知识库搜索相关文档片段 return kbService.searchRelevantDocs(userMessage, 3) .flatMap(docs - { // 2. 将文档作为上下文构建更精准的prompt String context 请参考以下信息回答问题\n String.join(\n, docs); String enhancedPrompt context \n\n用户: userMessage \n助手:; // 3. 调用模型 return modelClient.generateResponse(enhancedPrompt); }); } }最后在ChatService里我们创建一个SkillRouter来管理和选择技能Component RequiredArgsConstructor public class SkillRouter { private final ListSkill skills; // Spring会自动注入所有Skill实现 public Skill route(String userMessage) { // 按顺序检查第一个能处理的技能被选中 for (Skill skill : skills) { if (skill.canHandle(userMessage)) { return skill; } } // 应该总是有GeneralQaSkill兜底 return skills.stream().filter(s - s instanceof GeneralQaSkill).findFirst().orElseThrow(); } }这样我们的助手就变得“聪明”了。它能识别用户意图如果是查订单就直接走业务系统如果是问产品知识就先查知识库再让模型总结其他问题才让模型自由发挥。整个系统的实用性和准确性大大提升。7. 总结走完这一整套流程一个基于SpringBoot和Phi-3-Mini-128K的智能问答助手就有了雏形。我们不仅完成了模型的简单调用更围绕企业级应用的需求做了很多实实在在的工作设计了清晰的API、实现了上下文对话管理、考虑了高并发下的性能与稳定性、最后还规划了与业务系统深度融合的“技能”架构。实际用下来这套方案在朋友公司的测试环境跑得挺稳。对于常见的客服问题响应速度基本在2-3秒内准确率也比预想的要高。特别是接入了订单查询技能后确实能分担不少人工客服的重复性工作。当然这里面还有很多可以深挖和优化的地方。比如上下文管理可以做得更精细按Token数裁剪技能路由的意图识别可以换成更专业的NLU模型缓存策略可以更智能区分热点数据和冷数据。但无论如何我们搭建了一个坚实、可扩展的框架后续的优化都可以在这个框架内逐步进行。如果你也在考虑为你的Java应用添加一些AI能力特别是需要快速响应和与企业数据结合的场景不妨试试这个组合。从一个小功能点开始比如先做一个简单的问答接口再慢慢叠加上下文、技能、缓存你会发现让传统应用“智能”起来并没有想象中那么难。获取更多AI镜像想探索更多AI镜像和应用场景访问 CSDN星图镜像广场提供丰富的预置镜像覆盖大模型推理、图像生成、视频生成、模型微调等多个领域支持一键部署。

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.coloradmin.cn/o/2420404.html

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈，一经查实，立即删除！