使用WebSocket在Responses API中加速代理工作流Speeding up agentic workflows with WebSockets in the Responses API
Speeding up agentic workflows with WebSockets in the Responses API使用WebSocket在Responses API中加速代理工作流https://openai.com/index/speeding-up-agentic-workflows-with-websockets/When you ask Codex to fix a bug, it scans through your codebase for relevant files, reads them to build context, makes edits, and runs tests to verify the fix worked. Under the hood, that means dozens of back-and-forth Responses API requests: determine the model’s next action, run a tool on your computer, send the tool output back to the API, and repeat.All of these requests can add up to minutes that users spend waiting for Codex to complete complex tasks. From a latency perspective, the Codex agent loop spends most of its time in three main stages: working in the API services (to validate and process requests), model inference, and client-side time (running tools and building model context). Inference is the stage where the model runs on GPUs to generate new tokens. In the past, running LLM inference on GPUs was the slowest part of the agentic loop, so API service overhead was easy to hide. As inference gets faster, the cumulative API overhead from an agentic rollout is much more notable.In this post, well explain how we made agent loops using the API 40% faster end-to-end, letting users experience the jump in inference speed from 65 to nearly 1,000 tokens per second. We approached this through caching, eliminating unnecessary network hops, improving our safety stack to quickly flag issues, and—most importantly—building a way to create a persistent connection to the Responses API, instead of having to make a series of synchronous API calls.当你要求Codex修复一个错误时它会扫描代码库中的相关文件通过读取这些文件来构建上下文进行编辑并运行测试以验证修复是否有效。其底层实现意味着需要数十次往返的Responses API请求确定模型的下一步操作在您的计算机上运行工具将工具输出发送回API然后重复这一过程。所有这些请求累积起来用户可能需要等待数分钟才能完成Codex的复杂任务。从延迟角度来看Codex代理循环主要消耗在三个关键阶段API服务处理验证和处理请求、模型推理以及客户端时间运行工具和构建模型上下文。推理阶段是指模型在GPU上运行以生成新标记的过程。过去在GPU上运行LLM推理是代理循环中最慢的部分因此API服务开销很容易被掩盖。随着推理速度的提升代理部署中累积的API开销变得更加显著。本文将介绍我们如何将使用API的代理循环端到端速度提升40%让用户体验到推理速度从每秒65个标记跃升至近1000个标记。我们通过缓存优化、消除不必要的网络跳转、改进安全堆栈以快速标记问题以及最关键的是——构建与Responses API建立持久连接的方式而非进行一系列同步API调用来实现这一提速。When the API became the bottleneckIn the Responses API, previous flagship models like GPT‑5 and GPT‑5.2 ran at roughly 65 tokens per second (TPS). For the launch of GPT‑5.3‑Codex‑Spark, a fast coding model, our goal was an order of magnitude faster: over 1,000 TPS, enabled by specialized Cerebras hardware optimized for LLM inference. To make sure users could experience the true speed of this new model, we had to reduce API overhead.Around November of 2025, we launched a performance sprint on the Responses API, landing many optimizations to the critical-path latency for a single request:Caching rendered tokens and model configuration in memory to skip expensive tokenization and network calls for multi-turn responsesReducing network hop latency by eliminating calls to intermediate services (for example, image processing resolution) and directly calling the inference service itselfImproving our safety stack so we could run certain classifiers to flag conversations faster当API成为性能瓶颈在Responses API中之前的旗舰模型如GPT-5和GPT-5.2运行速度约为每秒65个tokenTPS。为了推出快速编码模型GPT-5.3-Codex-Spark我们的目标是将速度提升一个数量级通过专为LLM推理优化的Cerebras硬件实现超过1,000 TPS。为了确保用户能体验到新模型的真实速度我们必须降低API的开销。2025年11月左右我们在Responses API上启动了性能冲刺针对单次请求的关键路径延迟进行了多项优化在内存中缓存渲染后的token和模型配置以跳过多轮响应中昂贵的token化和网络调用通过消除对中间服务例如图像处理解析的调用直接调用推理服务本身从而减少网络跳转延迟改进安全防护机制以便更快地运行某些分类器来标记对话With these improvements, we saw close to a 45% improvement in time to first token (TTFT)—which reflects how responsive the API feels—but these improvements were still not fast enough for GPT‑5.3‑Codex‑Spark. Even with these improvements, Responses API overhead was too large relative to the speed of the model—that is, users had to wait for the CPUs running our API before they could use the GPUs serving the model.The deeper issue was structural: we treated each Codex request as independent, processing conversation state and other reusable context in every follow-up request. Even when most of the conversation hadnt changed, we still paid for work tied to the full history. As conversations got longer, that repeated processing became more expensive.经过这些改进后首令牌响应时间TTFT——反映API响应速度的关键指标——提升了近45%但对于GPT-5.3-Codex-Spark而言仍显不足。即便取得这些进展响应API的开销相较于模型运算速度依然过大这意味着用户必须先等待运行API的CPU完成处理才能使用承载模型的GPU资源。根本问题在于架构设计我们将每个Codex请求视为独立任务在每次后续请求中重复处理会话状态等可复用上下文。即使对话内容基本未变系统仍需为完整历史记录执行重复计算。随着对话长度增加这种重复处理造成的资源消耗呈指数级增长。Building a persistent connectionTo tighten up the design, we rethought the transport protocol: could we keep a persistent connection and cache state, rather than establishing a new connection over HTTP and sending the full conversation history for each follow-up request? The idea was to only send any new information requiring validation and processing and cache reusable state in memory for the lifetime of the connection. This would reduce overhead from redundant work.We considered a few different approaches, including WebSockets and gRPC bidirectional streaming. We landed on WebSockets because as a simple message transport protocol, users wouldnt have to change their Responses API input and output shapes. It was developer-friendly and fit our existing architecture with little disruption.建立持久连接为了优化设计我们重新思考了传输协议能否维持持久连接并缓存状态而非每次后续请求都通过HTTP建立新连接并发送完整对话历史我们的设想是仅发送需要验证和处理的新信息并在连接生命周期内将可复用状态缓存在内存中。这将减少冗余工作带来的开销。我们考虑了多种方案包括WebSocket和gRPC双向流。最终选择WebSocket是因为作为简单的消息传输协议用户无需改变其Responses API的输入输出结构。这种方案对开发者友好且能几乎无缝适配我们现有架构。The first WebSocket prototype changed what we thought was possible for Responses API latency. An engineer on the Codex team with deep expertise across the API stack pulled together a prototype by running a Codex agent overnight.In that prototype, agentic rollouts were modeled as a single long-running Response. Usingasynciofeatures, the Responses API would asynchronously block in the sampling loop after a tool call was sampled, and the Responses API would send aresponse.doneevent back to the client. After executing the tool call, clients would send back aresponse.appendevent with the tool result, which unblocked the sampling loop and let the model continue.An analogy here is treating the local tool call as a hosted tool call. When the model calls web search, the inference loop blocks, calls a web search service, and puts the service response in the model context. In our design, we did the same thing; but instead of calling a remote service, we sent the models tool call to the client back over the WebSocket. When the client responded, we put the clients tool call response into the context and continued to sample.第一个WebSocket原型改变了我们对响应API延迟的认知。Codex团队中一位精通API全栈的工程师通过通宵运行Codex代理整合出了这个原型。在该原型中智能体展开被建模为一个长期运行的响应。利用asyncio特性响应API在采样循环中会异步阻塞——当工具调用被采样后API会向客户端发送response.done事件。客户端执行完工具调用后会通过response.append事件返回工具执行结果从而解除采样循环的阻塞让模型继续运行。这类似于将本地工具调用视为托管工具调用。当模型调用网络搜索时推理循环会暂停调用网络搜索服务并将服务响应放入模型上下文。我们的设计采用了相同原理——只不过不是调用远程服务而是通过WebSocket将模型的工具调用发送给客户端。当客户端响应时我们将客户端的工具调用结果放入上下文并继续采样。This design was extremely effective because it eliminated repeated API work across an agent rollout. We could do preinference work once, pause for tool execution, and do postinference work once at the end.Unfortunately, this came at the cost of a less familiar and more complicated API shape. We wanted developers to be able to drop in WebSocket support without having to rewrite their API integration around a new interaction mode.这一设计极其高效因为它消除了智能体部署过程中重复的API工作。我们只需执行一次预推理工作暂停等待工具执行最后再执行一次后推理工作即可。遗憾的是这种设计牺牲了API的易用性导致接口形态变得陌生且复杂。我们希望开发者能够轻松接入WebSocket支持而不必围绕新的交互模式重写他们的API集成代码。Keeping the API familiar while making the stack incrementalFor the version we launched, we switched back to a familiar shape: keep usingresponse.createwith the same body, and useprevious_response_idto continue the conversation context from the previous response’s state.On a WebSocket connection, the server keeps a connection-scoped, in-memory cache of previous response state. When a follow-upresponse.createincludesprevious_response_id, we fetch that state from the cache instead of rebuilding the full conversation from scratch.在保持技术栈渐进式改进的同时维持API的熟悉度对于我们发布的版本我们回归了熟悉的形态继续使用response.create方法并保持请求体结构不变同时通过previous_response_id参数延续先前响应状态的对话上下文。在WebSocket连接中服务器会维护一个连接作用域内的内存缓存用于存储先前的响应状态。当后续的response.create请求包含previous_response_id时我们会从缓存中提取对应状态而非从头开始重建完整对话。That cached state includes:The previousresponseobjectPrior input and output itemsTool definitions and namespacesReusable sampling artifacts, like previously rendered tokens缓存状态包括之前的响应对象先前的输入和输出项工具定义和命名空间可重用的采样工件如先前渲染的令牌By reusing the in-memory previous response state, we were able to land several major optimizations:Making some of our safety classifiers and request validators process only new input, not the full history every timeKeeping an in-memory cache of rendered tokens that we append to so we can skip unnecessary tokenizationReusing our successful model resolution/routing logic across requestsOverlapping non-blocking postinference work like billing with subsequent requestsThe goal was to get as close as possible to the minimal-overhead prototype but with an API shape developers already understood and built around.通过重用内存中的先前响应状态我们实现了多项重大优化部分安全分类器和请求验证器仅需处理新输入而非每次都处理完整历史记录在内存中维护已渲染令牌的缓存并追加内容从而跳过不必要的令牌化过程跨请求复用成功的模型解析/路由逻辑将计费等非阻塞的后推理工作与后续请求重叠处理我们的目标是尽可能接近零开销原型同时保持开发者已熟悉并围绕其构建的API形态。Setting a new bar for speedAfter a two-month sprint building WebSocket mode, we launched an alpha with key coding agent startups so they could integrate it into their infrastructure and safely ramp up traffic. Alpha users loved it, reporting up to 40% improvements(opens in a new window) in their agentic workflows. Given the positive alpha feedback, we were ready to launch.The launch results were immediate. Codex quickly ramped up the majority of their Responses API traffic onto WebSocket mode, seeing significant latency improvements. For GPT‑5.3‑Codex‑Spark, we hit our 1,000 TPS target and saw bursts up to 4,000 TPS, showing that the Responses API could keep up with much faster inference in real production traffic. The impact showed up quickly in the developer community too:Codex quickly ramped the majority of their traffic onto WebSockets. Codex users running the latest models such as GPT‑5.3‑Codex(opens in a new window), GPT‑5.4(opens in a new window), and beyond all benefit from WebSocket mode’s speed up.Vercel integrated WebSocket mode into the AI SDK and saw latency decrease by up to 40%(opens in a new window).Cline’s multi-file workflows are 39% faster(opens in a new window).OpenAI models in Cursor became up to 30% faster(opens in a new window).WebSocket mode is the one of the most significant new capabilities in the Responses API since its launch in March 2025. We went from idea to running in production in just a few weeks through close collaboration between OpenAIs API and Codex teams. It not only dramatically improves agent rollout latency but also supports a growing need for builders: as model inference gets faster, the services and systems that surround inference also need to speed up to transfer these gains to users.为速度树立新标杆经过两个月的WebSocket模式冲刺开发我们与核心编码智能体初创公司联合推出alpha版本使其能将该技术整合至基础设施并安全提升流量。alpha用户反馈热烈报告称其智能工作流效率提升高达40%。基于积极的测试反馈我们正式启动了该功能。上线效果立竿见影。Codex迅速将其大部分响应API流量切换至WebSocket模式延迟显著改善。在GPT-5.3-Codex-Spark模型上我们不仅达成了每秒1000次查询TPS的目标更实现了每秒4000次查询的峰值证明响应API能在真实生产流量中保持高速推理。开发者社区也迅速感受到其影响• Codex将主要流量成功迁移至WebSocket通道 • 使用最新模型如GPT-5.3-Codex、GPT-5.4等的Codex用户均体验到WebSocket模式的速度优势 • Vercel将WebSocket模式集成至AI SDK后延迟降低达40% • Cline的多文件工作流速度提升39% • Cursor中的OpenAI模型运行速度加快30%作为响应API自2025年3月推出以来最重要的功能之一WebSocket模式通过OpenAI API团队与Codex团队的紧密协作仅用数周便完成从构想到生产环境部署。它不仅大幅降低智能体响应延迟更满足了开发者日益增长的需求当模型推理加速时配套服务体系也需同步升级才能让终端用户受益。---
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/2584219.html
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!