使用WebSocket在Responses API中加速代理工作流Speeding up agentic workflows with WebSockets in the Responses API

news2026/5/5 7:04:43
Speeding up agentic workflows with WebSockets in the Responses API使用WebSocket在Responses API中加速代理工作流https://openai.com/index/speeding-up-agentic-workflows-with-websockets/When you ask Codex to fix a bug, it scans through your codebase for relevant files, reads them to build context, makes edits, and runs tests to verify the fix worked. Under the hood, that means dozens of back-and-forth Responses API requests: determine the model’s next action, run a tool on your computer, send the tool output back to the API, and repeat.All of these requests can add up to minutes that users spend waiting for Codex to complete complex tasks. From a latency perspective, the Codex agent loop spends most of its time in three main stages: working in the API services (to validate and process requests), model inference, and client-side time (running tools and building model context). Inference is the stage where the model runs on GPUs to generate new tokens. In the past, running LLM inference on GPUs was the slowest part of the agentic loop, so API service overhead was easy to hide. As inference gets faster, the cumulative API overhead from an agentic rollout is much more notable.In this post, well explain how we made agent loops using the API 40% faster end-to-end, letting users experience the jump in inference speed from 65 to nearly 1,000 tokens per second. We approached this through caching, eliminating unnecessary network hops, improving our safety stack to quickly flag issues, and—most importantly—building a way to create a persistent connection to the Responses API, instead of having to make a series of synchronous API calls.当你要求Codex修复一个错误时它会扫描代码库中的相关文件通过读取这些文件来构建上下文进行编辑并运行测试以验证修复是否有效。其底层实现意味着需要数十次往返的Responses API请求确定模型的下一步操作在您的计算机上运行工具将工具输出发送回API然后重复这一过程。所有这些请求累积起来用户可能需要等待数分钟才能完成Codex的复杂任务。从延迟角度来看Codex代理循环主要消耗在三个关键阶段API服务处理验证和处理请求、模型推理以及客户端时间运行工具和构建模型上下文。推理阶段是指模型在GPU上运行以生成新标记的过程。过去在GPU上运行LLM推理是代理循环中最慢的部分因此API服务开销很容易被掩盖。随着推理速度的提升代理部署中累积的API开销变得更加显著。本文将介绍我们如何将使用API的代理循环端到端速度提升40%让用户体验到推理速度从每秒65个标记跃升至近1000个标记。我们通过缓存优化、消除不必要的网络跳转、改进安全堆栈以快速标记问题以及最关键的是——构建与Responses API建立持久连接的方式而非进行一系列同步API调用来实现这一提速。When the API became the bottleneckIn the Responses API, previous flagship models like GPT‑5 and GPT‑5.2 ran at roughly 65 tokens per second (TPS). For the launch of GPT‑5.3‑Codex‑Spark, a fast coding model, our goal was an order of magnitude faster: over 1,000 TPS, enabled by specialized Cerebras hardware optimized for LLM inference. To make sure users could experience the true speed of this new model, we had to reduce API overhead.Around November of 2025, we launched a performance sprint on the Responses API, landing many optimizations to the critical-path latency for a single request:Caching rendered tokens and model configuration in memory to skip expensive tokenization and network calls for multi-turn responsesReducing network hop latency by eliminating calls to intermediate services (for example, image processing resolution) and directly calling the inference service itselfImproving our safety stack so we could run certain classifiers to flag conversations faster当API成为性能瓶颈在Responses API中之前的旗舰模型如GPT-5和GPT-5.2运行速度约为每秒65个tokenTPS。为了推出快速编码模型GPT-5.3-Codex-Spark我们的目标是将速度提升一个数量级通过专为LLM推理优化的Cerebras硬件实现超过1,000 TPS。为了确保用户能体验到新模型的真实速度我们必须降低API的开销。2025年11月左右我们在Responses API上启动了性能冲刺针对单次请求的关键路径延迟进行了多项优化在内存中缓存渲染后的token和模型配置以跳过多轮响应中昂贵的token化和网络调用通过消除对中间服务例如图像处理解析的调用直接调用推理服务本身从而减少网络跳转延迟改进安全防护机制以便更快地运行某些分类器来标记对话With these improvements, we saw close to a 45% improvement in time to first token (TTFT)—which reflects how responsive the API feels—but these improvements were still not fast enough for GPT‑5.3‑Codex‑Spark. Even with these improvements, Responses API overhead was too large relative to the speed of the model—that is, users had to wait for the CPUs running our API before they could use the GPUs serving the model.The deeper issue was structural: we treated each Codex request as independent, processing conversation state and other reusable context in every follow-up request. Even when most of the conversation hadnt changed, we still paid for work tied to the full history. As conversations got longer, that repeated processing became more expensive.经过这些改进后首令牌响应时间TTFT——反映API响应速度的关键指标——提升了近45%但对于GPT-5.3-Codex-Spark而言仍显不足。即便取得这些进展响应API的开销相较于模型运算速度依然过大这意味着用户必须先等待运行API的CPU完成处理才能使用承载模型的GPU资源。根本问题在于架构设计我们将每个Codex请求视为独立任务在每次后续请求中重复处理会话状态等可复用上下文。即使对话内容基本未变系统仍需为完整历史记录执行重复计算。随着对话长度增加这种重复处理造成的资源消耗呈指数级增长。Building a persistent connectionTo tighten up the design, we rethought the transport protocol: could we keep a persistent connection and cache state, rather than establishing a new connection over HTTP and sending the full conversation history for each follow-up request? The idea was to only send any new information requiring validation and processing and cache reusable state in memory for the lifetime of the connection. This would reduce overhead from redundant work.We considered a few different approaches, including WebSockets and gRPC bidirectional streaming. We landed on WebSockets because as a simple message transport protocol, users wouldnt have to change their Responses API input and output shapes. It was developer-friendly and fit our existing architecture with little disruption.建立持久连接为了优化设计我们重新思考了传输协议能否维持持久连接并缓存状态而非每次后续请求都通过HTTP建立新连接并发送完整对话历史我们的设想是仅发送需要验证和处理的新信息并在连接生命周期内将可复用状态缓存在内存中。这将减少冗余工作带来的开销。我们考虑了多种方案包括WebSocket和gRPC双向流。最终选择WebSocket是因为作为简单的消息传输协议用户无需改变其Responses API的输入输出结构。这种方案对开发者友好且能几乎无缝适配我们现有架构。The first WebSocket prototype changed what we thought was possible for Responses API latency. An engineer on the Codex team with deep expertise across the API stack pulled together a prototype by running a Codex agent overnight.In that prototype, agentic rollouts were modeled as a single long-running Response. Usingasynciofeatures, the Responses API would asynchronously block in the sampling loop after a tool call was sampled, and the Responses API would send aresponse.doneevent back to the client. After executing the tool call, clients would send back aresponse.appendevent with the tool result, which unblocked the sampling loop and let the model continue.An analogy here is treating the local tool call as a hosted tool call. When the model calls web search, the inference loop blocks, calls a web search service, and puts the service response in the model context. In our design, we did the same thing; but instead of calling a remote service, we sent the models tool call to the client back over the WebSocket. When the client responded, we put the clients tool call response into the context and continued to sample.第一个WebSocket原型改变了我们对响应API延迟的认知。Codex团队中一位精通API全栈的工程师通过通宵运行Codex代理整合出了这个原型。在该原型中智能体展开被建模为一个长期运行的响应。利用asyncio特性响应API在采样循环中会异步阻塞——当工具调用被采样后API会向客户端发送response.done事件。客户端执行完工具调用后会通过response.append事件返回工具执行结果从而解除采样循环的阻塞让模型继续运行。这类似于将本地工具调用视为托管工具调用。当模型调用网络搜索时推理循环会暂停调用网络搜索服务并将服务响应放入模型上下文。我们的设计采用了相同原理——只不过不是调用远程服务而是通过WebSocket将模型的工具调用发送给客户端。当客户端响应时我们将客户端的工具调用结果放入上下文并继续采样。This design was extremely effective because it eliminated repeated API work across an agent rollout. We could do preinference work once, pause for tool execution, and do postinference work once at the end.Unfortunately, this came at the cost of a less familiar and more complicated API shape. We wanted developers to be able to drop in WebSocket support without having to rewrite their API integration around a new interaction mode.这一设计极其高效因为它消除了智能体部署过程中重复的API工作。我们只需执行一次预推理工作暂停等待工具执行最后再执行一次后推理工作即可。遗憾的是这种设计牺牲了API的易用性导致接口形态变得陌生且复杂。我们希望开发者能够轻松接入WebSocket支持而不必围绕新的交互模式重写他们的API集成代码。Keeping the API familiar while making the stack incrementalFor the version we launched, we switched back to a familiar shape: keep usingresponse.createwith the same body, and useprevious_response_idto continue the conversation context from the previous response’s state.On a WebSocket connection, the server keeps a connection-scoped, in-memory cache of previous response state. When a follow-upresponse.createincludesprevious_response_id, we fetch that state from the cache instead of rebuilding the full conversation from scratch.在保持技术栈渐进式改进的同时维持API的熟悉度对于我们发布的版本我们回归了熟悉的形态继续使用response.create方法并保持请求体结构不变同时通过previous_response_id参数延续先前响应状态的对话上下文。在WebSocket连接中服务器会维护一个连接作用域内的内存缓存用于存储先前的响应状态。当后续的response.create请求包含previous_response_id时我们会从缓存中提取对应状态而非从头开始重建完整对话。That cached state includes:The previousresponseobjectPrior input and output itemsTool definitions and namespacesReusable sampling artifacts, like previously rendered tokens缓存状态包括之前的响应对象先前的输入和输出项工具定义和命名空间可重用的采样工件如先前渲染的令牌By reusing the in-memory previous response state, we were able to land several major optimizations:Making some of our safety classifiers and request validators process only new input, not the full history every timeKeeping an in-memory cache of rendered tokens that we append to so we can skip unnecessary tokenizationReusing our successful model resolution/routing logic across requestsOverlapping non-blocking postinference work like billing with subsequent requestsThe goal was to get as close as possible to the minimal-overhead prototype but with an API shape developers already understood and built around.通过重用内存中的先前响应状态我们实现了多项重大优化部分安全分类器和请求验证器仅需处理新输入而非每次都处理完整历史记录在内存中维护已渲染令牌的缓存并追加内容从而跳过不必要的令牌化过程跨请求复用成功的模型解析/路由逻辑将计费等非阻塞的后推理工作与后续请求重叠处理我们的目标是尽可能接近零开销原型同时保持开发者已熟悉并围绕其构建的API形态。Setting a new bar for speedAfter a two-month sprint building WebSocket mode, we launched an alpha with key coding agent startups so they could integrate it into their infrastructure and safely ramp up traffic. Alpha users loved it, reporting up to 40% improvements⁠(opens in a new window) in their agentic workflows. Given the positive alpha feedback, we were ready to launch.The launch results were immediate. Codex quickly ramped up the majority of their Responses API traffic onto WebSocket mode, seeing significant latency improvements. For GPT‑5.3‑Codex‑Spark, we hit our 1,000 TPS target and saw bursts up to 4,000 TPS, showing that the Responses API could keep up with much faster inference in real production traffic. The impact showed up quickly in the developer community too:Codex quickly ramped the majority of their traffic onto WebSockets. Codex users running the latest models such as GPT‑5.3‑Codex⁠(opens in a new window), GPT‑5.4⁠(opens in a new window), and beyond all benefit from WebSocket mode’s speed up.Vercel integrated WebSocket mode into the AI SDK and saw latency decrease by up to 40%⁠(opens in a new window).Cline’s multi-file workflows are 39% faster⁠(opens in a new window).OpenAI models in Cursor became up to 30% faster⁠(opens in a new window).WebSocket mode is the one of the most significant new capabilities in the Responses API since its launch in March 2025. We went from idea to running in production in just a few weeks through close collaboration between OpenAIs API and Codex teams. It not only dramatically improves agent rollout latency but also supports a growing need for builders: as model inference gets faster, the services and systems that surround inference also need to speed up to transfer these gains to users.为速度树立新标杆经过两个月的WebSocket模式冲刺开发我们与核心编码智能体初创公司联合推出alpha版本使其能将该技术整合至基础设施并安全提升流量。alpha用户反馈热烈报告称其智能工作流效率提升高达40%。基于积极的测试反馈我们正式启动了该功能。上线效果立竿见影。Codex迅速将其大部分响应API流量切换至WebSocket模式延迟显著改善。在GPT-5.3-Codex-Spark模型上我们不仅达成了每秒1000次查询TPS的目标更实现了每秒4000次查询的峰值证明响应API能在真实生产流量中保持高速推理。开发者社区也迅速感受到其影响• Codex将主要流量成功迁移至WebSocket通道 • 使用最新模型如GPT-5.3-Codex、GPT-5.4等的Codex用户均体验到WebSocket模式的速度优势 • Vercel将WebSocket模式集成至AI SDK后延迟降低达40% • Cline的多文件工作流速度提升39% • Cursor中的OpenAI模型运行速度加快30%作为响应API自2025年3月推出以来最重要的功能之一WebSocket模式通过OpenAI API团队与Codex团队的紧密协作仅用数周便完成从构想到生产环境部署。它不仅大幅降低智能体响应延迟更满足了开发者日益增长的需求当模型推理加速时配套服务体系也需同步升级才能让终端用户受益。---

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/2584219.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

SpringBoot-17-MyBatis动态SQL标签之常用标签

文章目录 1 代码1.1 实体User.java1.2 接口UserMapper.java1.3 映射UserMapper.xml1.3.1 标签if1.3.2 标签if和where1.3.3 标签choose和when和otherwise1.4 UserController.java2 常用动态SQL标签2.1 标签set2.1.1 UserMapper.java2.1.2 UserMapper.xml2.1.3 UserController.ja…

wordpress后台更新后 前端没变化的解决方法

使用siteground主机的wordpress网站,会出现更新了网站内容和修改了php模板文件、js文件、css文件、图片文件后,网站没有变化的情况。 不熟悉siteground主机的新手,遇到这个问题,就很抓狂,明明是哪都没操作错误&#x…

网络编程(Modbus进阶)

思维导图 Modbus RTU(先学一点理论) 概念 Modbus RTU 是工业自动化领域 最广泛应用的串行通信协议,由 Modicon 公司(现施耐德电气)于 1979 年推出。它以 高效率、强健性、易实现的特点成为工业控制系统的通信标准。 包…

UE5 学习系列(二)用户操作界面及介绍

这篇博客是 UE5 学习系列博客的第二篇,在第一篇的基础上展开这篇内容。博客参考的 B 站视频资料和第一篇的链接如下: 【Note】:如果你已经完成安装等操作,可以只执行第一篇博客中 2. 新建一个空白游戏项目 章节操作,重…

IDEA运行Tomcat出现乱码问题解决汇总

最近正值期末周,有很多同学在写期末Java web作业时,运行tomcat出现乱码问题,经过多次解决与研究,我做了如下整理: 原因: IDEA本身编码与tomcat的编码与Windows编码不同导致,Windows 系统控制台…

利用最小二乘法找圆心和半径

#include <iostream> #include <vector> #include <cmath> #include <Eigen/Dense> // 需安装Eigen库用于矩阵运算 // 定义点结构 struct Point { double x, y; Point(double x_, double y_) : x(x_), y(y_) {} }; // 最小二乘法求圆心和半径 …

使用docker在3台服务器上搭建基于redis 6.x的一主两从三台均是哨兵模式

一、环境及版本说明 如果服务器已经安装了docker,则忽略此步骤,如果没有安装,则可以按照一下方式安装: 1. 在线安装(有互联网环境): 请看我这篇文章 传送阵>> 点我查看 2. 离线安装(内网环境):请看我这篇文章 传送阵>> 点我查看 说明&#xff1a;假设每台服务器已…

XML Group端口详解

在XML数据映射过程中&#xff0c;经常需要对数据进行分组聚合操作。例如&#xff0c;当处理包含多个物料明细的XML文件时&#xff0c;可能需要将相同物料号的明细归为一组&#xff0c;或对相同物料号的数量进行求和计算。传统实现方式通常需要编写脚本代码&#xff0c;增加了开…

LBE-LEX系列工业语音播放器|预警播报器|喇叭蜂鸣器的上位机配置操作说明

LBE-LEX系列工业语音播放器|预警播报器|喇叭蜂鸣器专为工业环境精心打造&#xff0c;完美适配AGV和无人叉车。同时&#xff0c;集成以太网与语音合成技术&#xff0c;为各类高级系统&#xff08;如MES、调度系统、库位管理、立库等&#xff09;提供高效便捷的语音交互体验。 L…

(LeetCode 每日一题) 3442. 奇偶频次间的最大差值 I (哈希、字符串)

题目&#xff1a;3442. 奇偶频次间的最大差值 I 思路 &#xff1a;哈希&#xff0c;时间复杂度0(n)。 用哈希表来记录每个字符串中字符的分布情况&#xff0c;哈希表这里用数组即可实现。 C版本&#xff1a; class Solution { public:int maxDifference(string s) {int a[26]…

【大模型RAG】拍照搜题技术架构速览:三层管道、两级检索、兜底大模型

摘要 拍照搜题系统采用“三层管道&#xff08;多模态 OCR → 语义检索 → 答案渲染&#xff09;、两级检索&#xff08;倒排 BM25 向量 HNSW&#xff09;并以大语言模型兜底”的整体框架&#xff1a; 多模态 OCR 层 将题目图片经过超分、去噪、倾斜校正后&#xff0c;分别用…

【Axure高保真原型】引导弹窗

今天和大家中分享引导弹窗的原型模板&#xff0c;载入页面后&#xff0c;会显示引导弹窗&#xff0c;适用于引导用户使用页面&#xff0c;点击完成后&#xff0c;会显示下一个引导弹窗&#xff0c;直至最后一个引导弹窗完成后进入首页。具体效果可以点击下方视频观看或打开下方…

接口测试中缓存处理策略

在接口测试中&#xff0c;缓存处理策略是一个关键环节&#xff0c;直接影响测试结果的准确性和可靠性。合理的缓存处理策略能够确保测试环境的一致性&#xff0c;避免因缓存数据导致的测试偏差。以下是接口测试中常见的缓存处理策略及其详细说明&#xff1a; 一、缓存处理的核…

龙虎榜——20250610

上证指数放量收阴线&#xff0c;个股多数下跌&#xff0c;盘中受消息影响大幅波动。 深证指数放量收阴线形成顶分型&#xff0c;指数短线有调整的需求&#xff0c;大概需要一两天。 2025年6月10日龙虎榜行业方向分析 1. 金融科技 代表标的&#xff1a;御银股份、雄帝科技 驱动…

观成科技:隐蔽隧道工具Ligolo-ng加密流量分析

1.工具介绍 Ligolo-ng是一款由go编写的高效隧道工具&#xff0c;该工具基于TUN接口实现其功能&#xff0c;利用反向TCP/TLS连接建立一条隐蔽的通信信道&#xff0c;支持使用Let’s Encrypt自动生成证书。Ligolo-ng的通信隐蔽性体现在其支持多种连接方式&#xff0c;适应复杂网…

铭豹扩展坞 USB转网口 突然无法识别解决方法

当 USB 转网口扩展坞在一台笔记本上无法识别,但在其他电脑上正常工作时,问题通常出在笔记本自身或其与扩展坞的兼容性上。以下是系统化的定位思路和排查步骤,帮助你快速找到故障原因: 背景: 一个M-pard(铭豹)扩展坞的网卡突然无法识别了,扩展出来的三个USB接口正常。…

未来机器人的大脑:如何用神经网络模拟器实现更智能的决策?

编辑&#xff1a;陈萍萍的公主一点人工一点智能 未来机器人的大脑&#xff1a;如何用神经网络模拟器实现更智能的决策&#xff1f;RWM通过双自回归机制有效解决了复合误差、部分可观测性和随机动力学等关键挑战&#xff0c;在不依赖领域特定归纳偏见的条件下实现了卓越的预测准…

Linux应用开发之网络套接字编程(实例篇)

服务端与客户端单连接 服务端代码 #include <sys/socket.h> #include <sys/types.h> #include <netinet/in.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <arpa/inet.h> #include <pthread.h> …

华为云AI开发平台ModelArts

华为云ModelArts&#xff1a;重塑AI开发流程的“智能引擎”与“创新加速器”&#xff01; 在人工智能浪潮席卷全球的2025年&#xff0c;企业拥抱AI的意愿空前高涨&#xff0c;但技术门槛高、流程复杂、资源投入巨大的现实&#xff0c;却让许多创新构想止步于实验室。数据科学家…

深度学习在微纳光子学中的应用

深度学习在微纳光子学中的主要应用方向 深度学习与微纳光子学的结合主要集中在以下几个方向&#xff1a; 逆向设计 通过神经网络快速预测微纳结构的光学响应&#xff0c;替代传统耗时的数值模拟方法。例如设计超表面、光子晶体等结构。 特征提取与优化 从复杂的光学数据中自…