LLMs之Benchmarks：《ProgramBench: Can Language Models Rebuild Programs From Scratch?》翻译与解读

news2026/5/11 5:31:01

LLMs之Benchmarks《ProgramBench: Can Language Models Rebuild Programs From Scratch?》翻译与解读导读ProgramBench 把软件工程agent 的评测从“局部修补”推进到“从零重建程序”通过程序文档、行为级测试和 agent-driven fuzzing 设计出一个更接近真实开发的基准结果显示现有模型离完整重建复杂软件还有很大差距但这套基准已经能清楚揭示它们在架构决策、代码组织和测试质量上的系统性短板。背景痛点● 现有软件工程基准偏“局部修补”不够考察从零构建能力论文指出已有 benchmark 多测单个 bug 修复或单个功能开发而不是从零完成一个完整软件项目但真实的 agent 场景更像是在无现成代码库的情况下做架构、模块划分与实现决策。● 从零写程序需要高层设计决策而这恰恰缺少系统评估作者强调开发者在动笔前要先决定语言、构建系统、代码组织方式、核心数据结构和错误处理机制等而这些软件设计决策长期没有被充分研究。● 静态题目容易被“知其然”而非“知其所以然”地完成如果只看局部任务模型可能靠补丁式修改应付过去但并不能说明它能从行为规格反推出完整实现因此需要一个以“可观测行为”为核心的新评测。● 评测还容易被实现细节绑架如果测试直接依赖源代码结构模型会被迫模仿人类实现论文认为应改成只看可观测行为避免把某种实现路径误当成唯一正确答案。具体的解决方案● 提出 ProgramBench这是一个面向“从零重建程序”的基准。给定一个金标准可执行程序及其文档代理需要自己写源代码和构建脚本让新程序的行为尽可能匹配原程序。● 采用行为级、实现无关的评价方式评测不是检查源码长什么样而是通过自动生成的行为测试比较候选程序与金标准程序在输出、退出码、文件系统副作用等外部可观察行为上的一致性。● 用 agent-driven fuzzing自动生成测试论文不是人工写测试而是让 SWE-agent 去探索程序、文档、源码和已有测试再把观察到的行为编码成断言形成大规模行为测试集。● 从开源仓库半自动构建任务作者从 GitHub 上筛选可编译项目先让 agent 编译出金标准可执行文件再剥离源码和测试只保留程序与文档作为任务输入。● 构造严格的推理环境任务执行环境中不允许联网金标准可执行文件被单独注入 Docker并设置为只可执行、不可读避免模型通过反编译或构建缓存窥探实现细节。核心思路步骤● 第一步挑选适合做“从零重建”的仓库优先选择能产出独立可执行程序的开源仓库尤其是 C/C、Go、Rust、Java 等编译型项目。● 第二步生成金标准可执行体让 SWE-agent 先把仓库成功编译出来并把复现编译过程的命令写入 build script作为后续比较基线。● 第三步构建行为测试套件再让 agent 探索程序、现有测试与文档持续补全覆盖率逐步生成更完整的行为测试如果测试只检查退出码、短子串或其他过弱条件就会被判定为劣质并要求重写。● 第四步过滤过强或过弱的测试如果测试在金标准二进制上不稳定通过或者 dummy binary 也能通过就会被丢弃以减少伪阳性和过拟合空间。● 第五步对候选程序做黑盒行为比较任务方只看到程序和文档不能接触测试集评测阶段再运行行为测试判断候选程序是否复现了金标准的外部行为。优势● 更贴近真实软件开发ProgramBench 评估的是完整软件项目从设计到实现的能力而不是单点修补这更接近实际 agent 在仓库级任务中的工作方式。● 实现无关泛化空间更大只要行为一致模型可以用不同算法、不同抽象甚至不同语言来实现避免被人类原始代码结构锁死。● 测试规模大且能抓住核心功能作者表示生成的测试套件通常很大并且能可靠覆盖核心功能覆盖率统计也表明这些测试与原生测试套件在覆盖能力上相当接近。● 任务覆盖范围广200 个任务横跨小型 CLI 工具与大型真实项目包括 FFmpeg、SQLite、PHP 解释器、DuckDB、ripgrep、fzf、jq 等能比较全面地反映模型在不同规模软件上的能力。● 能揭示模型的结构偏好分析显示模型生成的代码库常倾向于单文件或少量文件的“单体化”结构和人类项目的模块化组织差异明显这类偏好本身就是很重要的研究信号。● 测试生成策略有实证优势覆盖率驱动的迭代策略明显优于一次性写测试的单体策略其平均测试数、覆盖率和对模型区分度都更强。论文结论与观点侧重经验与建议● 经验一当前模型离“从零重建完整软件”还有明显距离9 个模型中没有任何一个能完全解决任何一个任务最佳模型 Opus 4.7 也只是对 3% 的任务达到了 95% 的测试通过率。● 经验二模型即使能做出“差不多对”的程序也未必会像人类那样组织代码分析显示模型更偏好把逻辑堆进单文件或少量根目录文件里代码结构明显偏离人类写法。● 经验三覆盖率驱动的测试生成更可靠与 monolithic 和 decomposed 两种策略相比coverage-guided iterative 能显著提高测试数量和平均覆盖率是最值得采用的测试生成方式。● 经验四断言质量比测试数量本身更关键引入静态 lint、gold/dummy 反馈后dummy pass rate 从 18.5% 降到 3.7%说明弱断言会严重污染评测质量。● 经验五生成式测试应尽量从“外部可见行为”出发作者明确建议只写能从文档和可观测输出中推导出的断言避免把实现内部细节、未暴露的标志位或脆弱字符串匹配混入测试。● 建议一未来应把自动化软件开发评测从“修 bug”推进到“重建项目”论文认为只有这样的 benchmark 才能真正衡量 agent 的架构能力、模块化能力和端到端开发能力。● 建议二评价体系还应继续扩展非功能指标当前 ProgramBench 主要看输入输出一致性尚未衡量速度、内存占用、磁盘体积等非功能属性作者认为这些应该成为下一步的测试目标。● 建议三未来应提升测试覆盖与约束表达能力由于任何有限测试集都只是对完整规格的下界作者建议改进测试生成策略把系统约束和更丰富的行为模式纳入评估。目录《ProgramBench: Can Language Models Rebuild Programs From Scratch?》翻译与解读AbstractFigure 1: \bench evaluates models on their ability to write software projects from scratch. Given a software program (e.g., executable) and its documentation, a software engineering agent (SWE-agent) is tasked with producing source code and a build script that reconstructs the original program’s behavior.图 1\bench 评估模型从零开始编写软件项目的能力。给定一个软件程序例如可执行文件及其文档软件工程代理SWE 代理的任务是生成源代码和构建脚本以重现原始程序的行为。1、Introduction7 Conclusion《ProgramBench: Can Language Models Rebuild Programs From Scratch?》翻译与解读地址论文地址https://arxiv.org/abs/2605.03546时间2026年05月05日作者Meta FAIR、Meta TBD、斯坦福大学、哈佛大学AbstractTurning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with minimal human oversight. Such settings require models to make high-level software architecture decisions. However, existing benchmarks measure focused, limited tasks such as fixing a single bug or developing a single, specified feature. We therefore introduce \bench to measure the ability of software engineering agents to develop software holisitically. In \bench, given only a program and its documentation, agents must architect and implement a codebase that matches the reference executable’s behavior. End-to-end behavioral tests are generated via agent-driven fuzzing, enabling evaluation without prescribing implementation structure. Our 200 tasks range from compact CLI tools to widely used software such as FFmpeg, SQLite, and the PHP interpreter. We evaluate 9 LMs and find that none fully resolve any task, with the best model passing 95% of tests on only 3% of tasks. Models favor monolithic, single-file implementations that diverge sharply from human-written code.将想法从零开始转化为完整的软件项目已成为语言模型的一个热门用例。代理被部署用于在长时间内以最少的人工监督来播种、维护和扩展代码库。这种设置要求模型能够做出高层次的软件架构决策。然而现有的基准测试衡量的是诸如修复单个错误或开发单个指定功能等集中且有限的任务。因此我们引入了 \bench来衡量软件工程代理全面开发软件的能力。在 \bench 中仅给定程序及其文档代理必须架构并实现一个与参考可执行文件行为相匹配的代码库。通过代理驱动的模糊测试生成端到端的行为测试从而能够在不规定实现结构的情况下进行评估。我们的200 个任务涵盖了从紧凑的命令行工具到广泛使用的软件如 FFmpeg、SQLite 和 PHP 解释器。我们评估了 9 个语言模型发现没有一个能完全解决任何任务表现最佳的模型仅在 3%的任务中通过了 95%的测试。模型倾向于采用单体式、单文件的实现方式这与人类编写的代码大相径庭。Figure 1: \bench evaluates models on their ability to write software projects from scratch. Given a software program (e.g., executable) and its documentation, a software engineering agent (SWE-agent) is tasked with producing source code and a build script that reconstructs the original program’s behavior.图 1\bench 评估模型从零开始编写软件项目的能力。给定一个软件程序例如可执行文件及其文档软件工程代理SWE 代理的任务是生成源代码和构建脚本以重现原始程序的行为。1、IntroductionLanguage Models (LMs) are increasingly being used to turn ideas expressed in natural language into full-fledged code repositories (Carlini, 2026; Lin, 2026; Replit, 2026). Unlike smaller scope tasks such as function generation (Hendrycks et al., 2021) or GitHub issue resolution (Jimenez et al., 2024), which typically demand understanding a pre-existing codebase well enough to make localized changes, building a functional application from scratch requires models to engage heavily with software design (Jansen and Bosch, 2005).To understand what this entails, consider how a human programmer approaches the same task. Before a single line is written, she asks herself a series of important questions: What programming language and build system should be used? How should the codebase be organized? What data structures should represent the program’s core entities? How should errors be detected and communicated? Such requisite questions, which developers constantly revisit throughout the development lifecycle, lead to pivotal design decisions that shape the codebase far more profoundly than any individual code change. Although we are progressively entrusting LMs to similarly build software from the ground up, the ability of LMs to make such architectural decisions, choose abstractions, and decompose a system into coherent modules has not been studied extensively.To bridge this gap, we introduce \bench, a benchmark that challenges software engineering (SWE) agents to produce code that recovers the functionality of a software program (e.g., executables, .dmg’s, .pkg’s). Given a program and documentation, a SWE-agent, defined as an LM equipped with an agent scaffold to interact with a terminal environment (Yang et al., 2024a), must write source code and a compile script that reproduces the original program’s behavior. Every software design decision is entirely the model’s to make.语言模型LMs正越来越多地被用于将自然语言表达的想法转化为完整的代码库Carlini2026Lin2026Replit2026。与诸如函数生成Hendrycks 等人2021或 GitHub 问题解决Jimenez 等人2024这类较小范围的任务不同这些任务通常需要对现有的代码库有足够深入的理解以便进行局部修改从零开始构建一个功能性的应用程序则要求模型在软件设计方面投入大量精力Jansen 和 Bosch2005。要理解这涉及哪些内容不妨考虑一下人类程序员是如何处理这一任务的。在编写任何一行代码之前她会问自己一系列重要的问题应该使用哪种编程语言和构建系统代码库应该如何组织哪些数据结构应该代表程序的核心实体如何检测和传达错误这些开发者在整个开发周期中不断回顾的关键问题会引导出对代码库产生深远影响的重要设计决策其影响远超任何单独的代码更改。尽管我们正逐步将构建软件从头开始的任务交由语言模型LM来完成但语言模型在做出此类架构决策、选择抽象概念以及将系统分解为连贯模块方面的能力尚未得到充分研究。为了弥补这一空白我们引入了 \bench 基准测试该测试要求软件工程SWE代理生成能够恢复软件程序功能例如可执行文件、.dmg 文件、.pkg 文件的代码。给定一个程序及其文档SWE 代理定义为配备有代理框架以与终端环境进行交互的语言模型Yang 等人2024a必须编写源代码和编译脚本以重现原始程序的行为。每个软件设计决策完全由模型自行决定。We synthesize \bench tasks from open-source GitHub repositories. First, we identify repositories written in compiled languages (e.g., C/C, Golang, Rust, Java) that build a program. Next, to convert a repository into a task instance, we compile the program, then strip away all source code and tests, leaving only the program and its documentation as the task’s starting point.To evaluate a model’s solution, we generate behavioral tests by prompting a SWE-agent to systematically probe the original program with varied inputs and codify the observed input-output behavior into assertions that a candidate reconstruction must satisfy. Crucially, these tests are never revealed to the task worker. Since tests target executable behavior rather than source code, evaluation is entirely implementation agnostic; a model may use different algorithms, abstractions, or even programming languages than the original codebase, and still pass as long as the input-output behavior matches. While any test suite necessarily under-approximates an executable’s full specification, we empirically demonstrate that our test generation pipeline creates large suites that reliably capture core functionality.Using our pipeline, we collect 200 task instances, ranging from compact CLI tools to complex, widely used software including language interpreters (PHP, Lua, tinycc), databases (DuckDB, SQLite), media and compression utilities (FFmpeg, zstd, xz), and developer tools (ripgrep, fzf, jq). We evaluate 9 language models equipped with mini-SWE-agent, a widely adopted coding agent scaffold for open source SWE-agent research. The results resoundingly confirm \bench’s difficulty for today’s models; no task instance is fully resolved. However, test pass rates are significantly different between models. The best model, Opus 4.7, manages to pass 95% of tests for 3% of task instances. Further analysis reveals that model-written codebases diverge significantly from human-written ones, favoring monolithic file structures with longer functions. Our trajectory analyses showcase how models vary in the length and make up of the way they develop software.We open source \bench to enable the community to reproduce and build upon our investigations.我们从开源的 GitHub 代码库中综合生成 \bench 任务。首先我们识别出用编译型语言例如 C/C、Golang、Rust、Java编写的构建程序的代码库。接下来为了将一个代码库转换为任务实例我们先编译该程序然后删除所有源代码和测试仅保留程序及其文档作为任务的起点。为了评估模型的解决方案我们通过提示软件工程师代理SWE-agent系统地用各种输入对原始程序进行探测来生成行为测试并将观察到的输入输出行为编码为候选重建必须满足的断言。至关重要的是这些测试永远不会透露给任务执行者。由于测试针对的是可执行行为而非源代码因此评估完全不受实现方式的影响模型可以使用与原始代码库不同的算法、抽象或甚至编程语言只要输入输出行为匹配即可通过。虽然任何测试套件必然无法完全涵盖可执行文件的完整规范但我们通过实证表明我们的测试生成流程能够创建包含大量测试的套件可靠地捕获核心功能。我们利用流水线收集了 200 个任务实例涵盖从紧凑的命令行工具到复杂的、广泛使用的软件包括语言解释器PHP、Lua、tinycc、数据库DuckDB、SQLite、媒体和压缩工具FFmpeg、zstd、xz以及开发工具ripgrep、fzf、jq。我们评估了 9 个配备 mini-SWE-agent一个广泛采用的开源 SWE-agent 研究的编码代理框架的语言模型。结果明确证实了 \bench 对当今模型的难度没有一个任务实例能完全解决。然而不同模型的测试通过率差异显著。表现最佳的模型 Opus 4.7 在 3%的任务实例中通过了 95%的测试。进一步分析表明模型编写的代码库与人工编写的代码库差异显著倾向于采用单体文件结构和更长的函数。我们的轨迹分析展示了模型在软件开发的长度和构成方面的差异。我们开源 \bench 以使社区能够重现并在此基础上进行研究。7ConclusionLimitations. \bench relies on a finite set of behavioral tests, which under-approximates each executable’s full specification. Evaluation therefore is a “lower bound” on correctness: solutions that fail are definitively incorrect, while those that pass may still diverge from the original on untested inputs. \bench tests also currently focus exclusively on input-output equivalence. Non-functional properties like execution speed, memory usage, or disk footprint are not captured. Therefore, it is possible a model reproduces behavior with an implementation orders of magnitude slower or more resource intensive than the original. Developing richer test generation strategies to improve coverage and incorporate system constraints is a promising direction.bench 依赖于一组有限的行为测试这些测试对每个可执行文件的完整规范进行了下近似。因此评估是正确性的“下限”失败的解决方案肯定是不正确的而通过的解决方案在未测试的输入上仍可能与原始程序存在差异。 \bench 测试目前也仅专注于输入输出等价性。执行速度、内存使用或磁盘占用等非功能性属性未被涵盖。因此有可能模型重现的行为在执行速度或资源消耗方面比原始程序慢几个数量级或更耗费资源。开发更丰富的测试生成策略以提高覆盖率并纳入系统约束是一个有前景的方向。Future work. Several technical reports and blogs have suggested the effectiveness of applying multiple SWE-agents towards long horizon coding tasks (Lin, 2026; Carlini, 2026; Geng and Neubig, 2026; Mishra-Sharma, 2026). \bench can serve as a testbed for such works. Our work uses a single SWE-agent as the baseline; this design reflects prior benchmark evidence, notably SWE-bench, where well-tuned single-agent systems have performed competitively, and multi-agent variants have not consistently shown clear advantages. We are excited to use \bench to delineate the benefits of multi-agent approaches. Similarly, \bench could further exploration into human-centered coding agents, where a developer, given the executable, iteratively guides the agent through design decisions (Liu et al., 2025; Baumann et al., 2026; Wang et al., 2026).未来工作。几份技术报告和博客都表明将多个 SWE 代理应用于长期编码任务是有效的林2026 年卡林尼2026 年耿和纽比格2026 年米什拉 - 沙尔马2026 年。\bench 可以作为此类工作的试验平台。我们的工作以单个 SWE 代理作为基准该设计反映了先前基准测试的证据尤其是 SWE-bench其中经过良好调优的单体系统表现出了竞争力而多体变体并未始终展现出明显的优势。我们很高兴能利用 \bench 来阐明多体方法的优势。同样\bench 还能进一步探索以人类为中心的编码代理即开发者在给定可执行文件的情况下通过设计决策来逐步引导代理Liu 等人2025 年Baumann 等人2026 年Wang 等人2026 年。Conclusion. We introduce \bench, a benchmark for measuring the ability of software engineering agents to develop, from scratch, programs that match a given executable’s behavior. Existing models struggle substantially, and none fully resolve any task. However, via fine-grained metrics, we find that models achieve meaningful partial progress, with stark differences in how models expend turns and the final form of their codebases. Our analyses reveal meaningful gaps in models’ decision making in architecting, developing and testing software. We hope that \bench could serve as a testbed for efforts focused on end-to-end autonomous software development.结论。我们引入了 \bench这是一个用于衡量软件工程代理从零开始开发与给定可执行文件行为相匹配的程序的能力的基准。现有的模型在很大程度上存在困难且没有一个能完全解决任何任务。然而通过细粒度的指标我们发现模型取得了有意义的部分进展模型在消耗步数和最终代码库的形式方面存在显著差异。我们的分析揭示了模型在软件架构、开发和测试决策方面存在的有意义的差距。我们希望 \bench 能够成为专注于端到端自主软件开发工作的试验场。

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.coloradmin.cn/o/2602599.html

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈，一经查实，立即删除！