Skip to content
zzekelan
中文
Back to translations

Harness design for long-running application development

面向长时间运行应用开发的 harness 设计

Harness design is key to performance at the frontier of agentic coding. Here's how we pushed Claude further in frontend design and long-running autonomous software engineering.

Harness 设计是智能体编程前沿性能的关键。以下是我们如何在前端设计和长时间运行的自主软件工程中进一步提升 Claude 表现的经验。

Published Mar 24, 2026

发布于 2026 年 3 月 24 日


Written by Prithvi Rajasekaran, a member of our Labs team.

作者:Prithvi Rajasekaran,Anthropic Labs 团队成员。

Over the past several months I've been working on two interconnected problems: getting Claude to produce high-quality frontend designs, and getting it to build complete applications without human intervention. This work originated with earlier efforts on our frontend design skill and long-running coding agent harness, where my colleagues and I were able to improve Claude's performance well above baseline through prompt engineering and harness design—but both eventually hit ceilings.

在过去几个月里,我一直在研究两个相互关联的问题:让 Claude 生成高质量的前端设计,以及让它在没有人工干预的情况下构建完整的应用程序。这项工作源于我们此前在前端设计技能和长时间运行的编程智能体 harness(执行框架)方面的努力——我和同事们通过提示工程和 harness 设计将 Claude 的性能提升到远超基线的水平——但两者最终都遇到了瓶颈。

To break through, I sought out novel AI engineering approaches that held across two quite different domains, one defined by subjective taste, the other by verifiable correctness and usability. Taking inspiration from Generative Adversarial Networks (GANs), I designed a multi-agent structure with a generator and evaluator agent. Building an evaluator that graded outputs reliably—and with taste—meant first developing a set of criteria that could turn subjective judgments like "is this design good?" into concrete, gradable terms.

为了突破瓶颈,我探索了在两个截然不同的领域都适用的新型 AI 工程方法——一个由主观审美定义,另一个由可验证的正确性和可用性定义。受生成对抗网络(GAN)的启发,我设计了一种包含生成器和评估者智能体的多智能体架构。要构建一个能够可靠评分——而且有品味地评分——的评估者,首先需要开发一套标准,将"这个设计好不好?"这样的主观判断转化为具体的、可评分的维度。

I then applied these techniques to long-running autonomous coding, carrying over two lessons from our earlier harness work: decomposing the build into tractable chunks, and using structured artifacts to hand off context between sessions. The final result was a three-agent architecture—planner, generator, and evaluator—that produced rich full-stack applications over multi-hour autonomous coding sessions.

随后,我将这些技术应用到长时间运行的自主编程中,并沿用了早期 harness 工作中的两个经验:将构建过程分解为可处理的小块,以及使用结构化制品在会话之间进行上下文交接。最终成果是一个三智能体架构——规划者、生成器和评估者——它能在持续数小时的自主编程会话中产出丰富的全栈应用。


Why naive implementations fall short

为什么朴素实现不够好

We've previously shown that harness design has a substantial impact on the effectiveness of long running agentic coding. In an earlier experiment, we used an initializer agent to decompose a product spec into a task list, and a coding agent that implemented the tasks one feature at a time before handing off artifacts to carry context across sessions. The broader developer community has converged on similar insights, with approaches like the "Ralph Wiggum" method using hooks or scripts to keep agents in continuous iteration cycles.

我们此前已经展示过,harness 设计对长时间运行的智能体编程的效果有重大影响。在早期的一个实验中,我们使用一个初始化智能体将产品规格分解为任务列表,再由一个编程智能体逐个功能地实现这些任务,然后通过制品交接在不同会话之间传递上下文。更广泛的开发者社区也得出了类似的结论,比如"Ralph Wiggum"方法使用钩子或脚本来让智能体保持在持续迭代循环中。

But some problems remained persistent. For more complex tasks, the agent still tends to go off the rails over time. While decomposing this issue, we observed two common failure modes with agents executing these sorts of tasks.

但一些问题依然顽固存在。对于更复杂的任务,智能体仍然容易随时间推移而偏离正轨。在拆解这个问题时,我们观察到智能体执行此类任务时的两种常见失败模式。

First is that models tend to lose coherence on lengthy tasks as the context window fills (see our post on context engineering). Some models also exhibit "context anxiety," in which they begin wrapping up work prematurely as they approach what they believe is their context limit. Context resets—clearing the context window entirely and starting a fresh agent, combined with a structured handoff that carries the previous agent's state and the next steps—addresses both these issues.

第一种是,随着上下文窗口被填满,模型在长时间任务上往往会失去连贯性(参见我们关于上下文工程的文章)。一些模型还会表现出"上下文焦虑"——当它们认为接近自己的上下文限制时,会过早地开始收尾工作。上下文重置——完全清空上下文窗口并启动一个新的智能体,同时通过结构化交接传递前一个智能体的状态和下一步计划——可以同时解决这两个问题。

This differs from compaction, where earlier parts of the conversation are summarized in place so the same agent can keep going on a shortened history. While compaction preserves continuity, it doesn't give the agent a clean slate, which means context anxiety can still persist. A reset provides a clean slate, at the cost of the handoff artifact having enough state for the next agent to pick up the work cleanly. In our earlier testing, we found Claude Sonnet 4.5 exhibited context anxiety strongly enough that compaction alone wasn't sufficient to enable strong long task performance, so context resets became essential to the harness design. This solves the core issue, but adds orchestration complexity, token overhead, and latency to each harness run.

这与压缩不同——压缩是将对话早期部分原地总结,让同一个智能体在缩短后的历史上继续工作。虽然压缩保持了连续性,但它无法给智能体一个全新的起点,这意味着上下文焦虑仍可能持续。上下文重置则提供了一个全新起点,代价是交接制品必须包含足够的状态信息,以便下一个智能体能顺利接手工作。在我们早期的测试中,Claude Sonnet 4.5 表现出很强的上下文焦虑,仅靠压缩不足以支撑出色的长任务表现,因此上下文重置成为了 harness 设计的关键要素。这解决了核心问题,但也为每次 harness 运行增加了编排复杂性、token 开销和延迟。

A second issue, which we haven't previously addressed, is self-evaluation. When asked to evaluate work they've produced, agents tend to respond by confidently praising the work—even when, to a human observer, the quality is obviously mediocre. This problem is particularly pronounced for subjective tasks like design, where there is no binary check equivalent to a verifiable software test. Whether a layout feels polished or generic is a judgment call, and agents reliably skew positive when grading their own work.

第二个问题是我们此前未曾解决的:自我评估。当被要求评估自己产出的工作时,智能体倾向于自信地赞美自己的成果——即使在人类观察者看来,质量明显平庸。这个问题在设计这类主观任务中尤为突出,因为不存在像可验证软件测试那样的二元检查。一个布局看起来是精致还是平庸,完全是一个主观判断,而智能体在给自己的工作打分时总是倾向于往好处偏。

However, even on tasks that do have verifiable outcomes, agents still sometimes exhibit poor judgment that impedes their performance while completing the task. Separating the agent doing the work from the agent judging it proves to be a strong lever to address this issue. The separation doesn't immediately eliminate that leniency on its own; the evaluator is still an LLM that is inclined to be generous towards LLM-generated outputs. But tuning a standalone evaluator to be skeptical turns out to be far more tractable than making a generator critical of its own work, and once that external feedback exists, the generator has something concrete to iterate against.

然而,即使在有可验证结果的任务上,智能体有时也会表现出不良判断,妨碍其完成任务的效果。将执行工作的智能体与评判工作的智能体分离开来,被证明是解决这个问题的有力杠杆。这种分离本身并不会立刻消除宽松倾向——评估者仍然是一个 LLM,天然倾向于对 LLM 生成的输出手下留情。但调教一个独立的评估者使其保持怀疑态度,比让生成器批评自己的工作要容易得多;而一旦存在这种外部反馈,生成器就有了具体的改进目标。


Frontend design: making subjective quality gradable

前端设计:让主观质量可评分

I started by experimenting on frontend design, where the self-evaluation issue was most visible. Absent any intervention, Claude normally gravitates toward safe, predictable layouts that are technically functional but visually unremarkable.

我首先从前端设计开始实验,因为自我评估的问题在这里最为明显。在没有任何干预的情况下,Claude 通常会倾向于安全的、可预测的布局——技术上能用,但视觉上平平无奇。

Two insights shaped the harness I built for frontend design. First, while aesthetics can't be fully reduced to a score—and individual tastes will always vary—they can be improved with grading criteria that encode design principles and preferences. "Is this design beautiful?" is hard to answer consistently, but "does this follow our principles for good design?" gives Claude something concrete to grade against. Second, by separating frontend generation from frontend grading, we can create a feedback loop that drives the generator toward stronger outputs.

两个关键洞察塑造了我为前端设计构建的 harness。第一,虽然审美不能完全被量化为分数——个人品味始终因人而异——但可以通过编码了设计原则和偏好的评分标准来改善。"这个设计美不美?"很难一致地回答,但"这个设计是否符合我们的好设计原则?"却给了 Claude 一个具体的评判依据。第二,将前端生成与前端评审分离开来,可以创建一个驱动生成器产出更强成果的反馈循环。

With this in mind, I wrote four grading criteria that I gave to both the generator and evaluator agents in their prompts:

基于此,我撰写了四项评分标准,并在提示词中同时提供给生成器和评估者智能体:

Design quality: Does the design feel like a coherent whole rather than a collection of parts? Strong work here means the colors, typography, layout, imagery, and other details combine to create a distinct mood and identity.

设计质量:设计是否给人一种浑然一体的感觉,而非各部分的堆砌?在这一项表现优秀意味着颜色、排版、布局、图像及其他细节共同营造出独特的氛围和身份感。

Originality: Is there evidence of custom decisions, or is this template layouts, library defaults, and AI-generated patterns? A human designer should recognize deliberate creative choices. Unmodified stock components—or telltale signs of AI generation like purple gradients over white cards—fail here.

原创性:是否有自定义决策的痕迹,还是只是模板布局、库默认样式和 AI 生成的套路?一个人类设计师应该能识别出有意为之的创意选择。未经修改的通用组件——或者 AI 生成的典型特征如白色卡片上的紫色渐变——在这一项会不及格。

Craft: Technical execution: typography hierarchy, spacing consistency, color harmony, contrast ratios. This is a competence check rather than a creativity check. Most reasonable implementations do fine here by default; failing means broken fundamentals.

工艺:技术执行层面——排版层级、间距一致性、色彩和谐度、对比度。这是一项能力检查而非创意检查。大多数合理的实现默认在这一项就能做得不错;不及格意味着基本功出了问题。

Functionality: Usability independent of aesthetics. Can users understand what the interface does, find primary actions, and complete tasks without guessing?

功能性:独立于美学的可用性。用户能否理解界面的功能、找到主要操作入口、并且不用猜测就能完成任务?

I emphasized design quality and originality over craft and functionality. Claude already scored well on craft and functionality by default, as the required technical competence tended to come naturally to the model. But on design and originality, Claude often produced outputs that were bland at best. The criteria explicitly penalized highly generic "AI slop" patterns, and by weighting design and originality more heavily it pushed the model toward more aesthetic risk-taking.

我着重强调了设计质量和原创性,而非工艺和功能性。Claude 在工艺和功能性上默认就能获得不错的分数,因为所需的技术能力对模型来说往往是信手拈来的。但在设计和原创性方面,Claude 经常产出充其量只能算平淡的作品。这套标准明确惩罚了高度泛化的"AI 流水线"模式,而通过加大设计和原创性的权重,它推动模型去承担更多审美风险。

I calibrated the evaluator using few-shot examples with detailed score breakdowns. This ensured the evaluator's judgment aligned with my preferences, and reduced score drift across iterations.

我使用带有详细评分细则的少样本示例来校准评估者。这确保了评估者的判断与我的偏好一致,并减少了跨迭代的评分漂移。

I built the loop on the Claude Agent SDK, which kept the orchestration straightforward. A generator agent first created an HTML/CSS/JS frontend based on a user prompt. I gave the evaluator the Playwright MCP, which let it interact with the live page directly before scoring each criterion and writing a detailed critique. In practice, the evaluator would navigate the page on its own, screenshotting and carefully studying the implementation before producing its assessment. That feedback flowed back to the generator as input for the next iteration. I ran 5 to 15 iterations per generation, with each iteration typically pushing the generator in a more distinctive direction as it responded to the evaluator's critique. Because the evaluator was actively navigating the page rather than scoring a static screenshot, each cycle took real wall-clock time. Full runs stretched up to four hours. I also instructed the generator to make a strategic decision after each evaluation: refine the current direction if scores were trending well, or pivot to an entirely different aesthetic if the approach wasn't working.

我在 Claude Agent SDK 上构建了这个循环,使编排过程保持简洁。生成器智能体首先根据用户提示词创建一个 HTML/CSS/JS 前端。我给评估者提供了 Playwright MCP,让它在评分和撰写详细批评之前能直接与实时页面交互。实践中,评估者会自行浏览页面,截屏并仔细研究实现效果,然后给出评估。这些反馈作为下一次迭代的输入流回生成器。每次生成我会运行 5 到 15 次迭代,每次迭代通常会推动生成器在回应评估者批评的过程中朝更具辨识度的方向发展。由于评估者是在主动浏览页面而非对静态截图评分,每个循环都需要实实在在的等待时间。完整运行最长可达四小时。我还指示生成器在每次评估后做出战略决策:如果分数趋势向好就打磨当前方向,如果方法不奏效就转向一种完全不同的美学风格。

Across runs, the evaluator's assessments improved over iterations before plateauing, with headroom still remaining. Some generations refined incrementally. Others took sharp aesthetic turns between iterations.

在多次运行中,评估者的评估分数在迭代过程中逐步提升,然后趋于平稳,但仍有提升空间。一些生成是渐进式优化,另一些则在迭代之间发生了急剧的美学转变。

The wording of the criteria steered the generator in ways I didn't fully anticipate. Including phrases like "the best designs are museum quality" pushed designs toward a particular visual convergence, suggesting that the prompting associated with the criteria directly shaped the character of the output.

评分标准的措辞以我未完全预料到的方式引导了生成器。比如加入"最好的设计具有博物馆级品质"这样的表述后,设计被推向了某种特定的视觉趋同,这表明与标准相关的提示词直接塑造了输出的特质。

While scores generally improved over iterations, the pattern was not always cleanly linear. Later implementations tended to be better as a whole, but I regularly saw cases where I preferred a middle iteration over the last one. Implementation complexity also tended to increase across rounds, with the generator reaching for more ambitious solutions in response to the evaluator's feedback. Even on the first iteration, outputs were noticeably better than a baseline with no prompting at all, suggesting the criteria and associated language themselves steered the model away from generic defaults before any evaluator feedback led to further refinement.

虽然分数总体上随迭代而提升,但模式并不总是整齐的线性关系。后期实现整体上往往更好,但我经常看到自己更偏好中间某次迭代而非最后一次的情况。实现的复杂度也倾向于随轮次增加,生成器在回应评估者反馈时会追求更有野心的方案。即使在第一次迭代中,输出也明显优于完全没有提示词的基线,这表明评分标准及其相关语言本身就将模型从泛化默认中拉了出来,甚至在评估者反馈带来进一步改进之前就已经如此。

In one notable example, I prompted the model to create a website for a Dutch art museum. By the ninth iteration, it had produced a clean, dark-themed landing page for a fictional museum. The page was visually polished but largely in line with my expectations. Then, on the tenth cycle, it scrapped the approach entirely and reimagined the site as a spatial experience: a 3D room with a checkered floor rendered in CSS perspective, artwork hung on the walls in free-form positions, and doorway-based navigation between gallery rooms instead of scroll or click. It was the kind of creative leap that I hadn't seen before from a single-pass generation.

在一个值得注意的例子中,我提示模型为一个荷兰艺术博物馆创建网站。到第九次迭代时,它已经为一个虚构的博物馆生成了一个简洁的深色主题着陆页。页面视觉上很精致,但基本在我的预期之内。然后,在第十个循环中,它彻底推翻了之前的方案,将网站重新想象为一种空间体验:一个用 CSS 透视渲染的棋盘格地板 3D 房间,艺术品以自由位置挂在墙上,用门廊式的导航在展厅之间切换,而非滚动或点击。这是一种我从未在单次生成中见过的创意飞跃。


Scaling to full-stack coding

扩展到全栈编程

With these findings in hand, I applied this GAN-inspired pattern to full-stack development. The generator-evaluator loop maps naturally onto the software development lifecycle, where code review and QA serve the same structural role as the design evaluator.

有了这些发现,我将这种受 GAN 启发的模式应用到全栈开发中。生成器—评估者循环天然映射到软件开发生命周期上,其中代码审查和 QA 扮演着与设计评估者相同的结构性角色。

The architecture

架构设计

In our earlier long-running harness, we had solved for coherent multi-session coding with an initializer agent, a coding agent that worked one feature at a time, and context resets between sessions. Context resets were a key unlock: the harness used Sonnet 4.5, which exhibited the "context anxiety" tendency mentioned earlier. Creating a harness that worked well across context resets was key to keeping the model on task. Opus 4.5 largely removed that behavior on its own, so I was able to drop context resets from this harness entirely. The agents were run as one continuous session across the whole build, with the Claude Agent SDK's automatic compaction handling context growth along the way.

在我们早期的长时间运行 harness 中,我们通过一个初始化智能体、一个逐个功能实现的编程智能体以及会话之间的上下文重置,解决了多会话编程的连贯性问题。上下文重置是一个关键突破:当时的 harness 使用的是 Sonnet 4.5,它表现出前面提到的"上下文焦虑"倾向。创建一个能在上下文重置间良好运作的 harness 对保持模型专注至关重要。Opus 4.5 很大程度上自行消除了这种行为,因此我得以完全去掉这个 harness 中的上下文重置。所有智能体在整个构建过程中作为一个连续会话运行,由 Claude Agent SDK 的自动压缩机制处理上下文增长。

For this work I built on the foundation from the original harness with a three-agent system, with each agent addressing a specific gap I'd observed in prior runs. The system contained the following agent personas:

在这项工作中,我在原始 harness 的基础上构建了一个三智能体系统,每个智能体针对我在先前运行中观察到的特定不足。系统包含以下智能体角色:

Planner: Our previous long-running harness required the user to provide a detailed spec upfront. I wanted to automate that step, so I created a planner agent that took a simple 1-4 sentence prompt and expanded it into a full product spec. I prompted it to be ambitious about scope and to stay focused on product context and high level technical design rather than detailed technical implementation. This emphasis was due to the concern that if the planner tried to specify granular technical details upfront and got something wrong, the errors in the spec would cascade into the downstream implementation. It seemed smarter to constrain the agents on the deliverables to be produced and let them figure out the path as they worked. I also asked the planner to find opportunities to weave AI features into the product specs. (See example in the Appendix at the bottom.)

规划者:我们之前的长时间运行 harness 要求用户预先提供详细的规格说明。我想自动化这一步,于是创建了一个规划者智能体,它接受简单的 1-4 句提示词并将其扩展为完整的产品规格。我提示它在范围上要有雄心,并专注于产品上下文和高层技术设计,而非详细的技术实现。这样强调是因为担心如果规划者试图预先指定细粒度的技术细节并出了错,规格中的错误会级联到下游实现中。更明智的做法是约束智能体要交付的成果,让它们在工作过程中自行摸索路径。我还要求规划者寻找机会将 AI 功能融入产品规格中。(参见底部附录中的示例。)

Generator: The one-feature-at-a-time approach from the earlier harness worked well for scope management. I applied a similar model here, instructing the generator to work in sprints, picking up one feature at a time from the spec. Each sprint implemented the app with a React, Vite, FastAPI, and SQLite (later PostgreSQL) stack, and the generator was instructed to self-evaluate its work at the end of each sprint before handing off to QA. It also had git for version control.

生成器:早期 harness 中逐个功能实现的方法在范围管理上效果很好。我在这里采用了类似的模式,指示生成器按 sprint 工作,每次从规格中选取一个功能来实现。每个 sprint 使用 React、Vite、FastAPI 和 SQLite(后来改为 PostgreSQL)技术栈来实现应用,生成器被指示在每个 sprint 结束时进行自我评估,然后再移交给 QA。它还使用 git 进行版本控制。

Evaluator: Applications from earlier harnesses often looked impressive but still had real bugs when you actually tried to use them. To catch these, the evaluator used the Playwright MCP to click through the running application the way a user would, testing UI features, API endpoints, and database states. It then graded each sprint against both the bugs it had found and a set of criteria modeled on the frontend experiment, adapted here to cover product depth, functionality, visual design, and code quality. Each criterion had a hard threshold, and if any one fell below it, the sprint failed and the generator got detailed feedback on what went wrong.

评估者:早期 harness 产出的应用往往看起来很惊艳,但实际使用时仍有真实的 bug。为了捕获这些问题,评估者使用 Playwright MCP 像用户一样点击运行中的应用,测试 UI 功能、API 端点和数据库状态。然后,它根据发现的 bug 和一套以前端实验为蓝本的评分标准——这里调整为覆盖产品深度、功能性、视觉设计和代码质量——来为每个 sprint 评分。每项标准都有一个硬性阈值,任何一项低于阈值,sprint 就判定失败,生成器会收到关于问题所在的详细反馈。

Before each sprint, the generator and evaluator negotiated a sprint contract: agreeing on what "done" looked like for that chunk of work before any code was written. This existed because the product spec was intentionally high-level, and I wanted a step to bridge the gap between user stories and testable implementation. The generator proposed what it would build and how success would be verified, and the evaluator reviewed that proposal to make sure the generator was building the right thing. The two iterated until they agreed.

在每个 sprint 之前,生成器和评估者会协商一个 sprint 合约:在编写任何代码之前,就这块工作的"完成"标准达成一致。之所以需要这一步,是因为产品规格有意保持在高层面,而我需要一个环节来弥合用户故事和可测试实现之间的差距。生成器提出它将构建什么以及如何验证成功,评估者审查该提案以确保生成器在构建正确的东西。两者反复迭代直到达成一致。

Communication was handled via files: one agent would write a file, another agent would read it and respond either within that file or with a new file that the previous agent would read in turn. The generator then built against the agreed-upon contract before handing the work off to QA. This kept the work faithful to the spec without over-specifying implementation too early.

通信通过文件来处理:一个智能体写一个文件,另一个智能体读取它并在该文件内或通过新文件进行回应,前一个智能体再读取回应。然后生成器按照商定的合约进行构建,完成后将工作移交给 QA。这使得工作忠实于规格,同时避免过早地过度指定实现细节。

Running the harness

运行 harness

For the first version of this harness, I used Claude Opus 4.5, running user prompts against both the full harness and a single-agent system for comparison. I used Opus 4.5 since this was our best coding model when I began these experiments.

对于这个 harness 的第一个版本,我使用了 Claude Opus 4.5,将用户提示词同时在完整 harness 和单智能体系统上运行以进行对比。我选择 Opus 4.5 是因为在我开始这些实验时,它是我们最好的编程模型。

I wrote the following prompt to generate a retro video game maker:

我编写了以下提示词来生成一个复古电子游戏制作器:

Create a 2D retro game maker with features including a level editor, sprite editor, entity behaviors, and a playable test mode.

创建一个 2D 复古游戏制作器,包含关卡编辑器、精灵编辑器、实体行为和可游玩的测试模式。

The table below shows the harness type, length it ran for, and the total cost.

下表展示了 harness 类型、运行时长和总成本。

Harness Duration Cost
Solo 20 min $9
Full harness 6 hr $200

Harness 类型 时长 成本
单智能体 20 分钟 $9
完整 harness 6 小时 $200

The harness was over 20x more expensive, but the difference in output quality was immediately apparent.

完整 harness 的成本高出 20 多倍,但输出质量的差异立竿见影。

I was expecting an interface where I could construct a level and its component parts (sprites, entities, tile layout) then hit play to actually play the level. I started by opening the solo run's output, and the initial application seemed in line with those expectations.

我期望的是一个可以构建关卡及其组成部分(精灵、实体、瓦片布局)然后点击播放来实际游玩关卡的界面。我先打开了单智能体运行的输出,初始应用看起来符合这些预期。

As I clicked through, however, issues started to emerge. The layout wasted space, with fixed-height panels leaving most of the viewport empty. The workflow was rigid. Trying to populate a level prompted me to create sprites and entities first, but nothing in the UI guided me toward that sequence. More to the point, the actual game was broken. My entities appeared on screen but nothing responded to input. Digging into the code revealed that the wiring between entity definitions and the game runtime was broken, with no surface indication of where.

然而,随着我点击浏览,问题开始浮现。布局浪费了大量空间,固定高度的面板使视口大部分区域空置。工作流程很僵硬。尝试填充关卡时,系统提示我先创建精灵和实体,但 UI 中没有任何引导指向这个流程。更关键的是,实际游戏是坏的。我的实体出现在屏幕上,但没有任何东西响应输入。深入代码后发现,实体定义和游戏运行时之间的连接是断开的,而且表面上看不出问题出在哪里。

Opening screen Sprite editor Game play

开始界面 / 精灵编辑器 / 游戏运行

Initial screen when opening the app created by the solo harness.

单智能体 harness 创建的应用的初始界面。

After evaluating the solo run, I turned my attention to the harness run. This run started from the same one-sentence prompt, but the planner step expanded that prompt into a 16-feature spec spread across ten sprints. It went well beyond what the solo run attempted. In addition to the core editors and play mode, the spec called for a sprite animation system, behavior templates, sound effects and music, an AI-assisted sprite generator and level designer, and game export with shareable links. I gave the planner access to our frontend design skill, which it read and used to create a visual design language for the app as part of the spec. For each sprint, the generator and evaluator negotiated a contract defining the specific implementation details for the sprint, and the testable behaviors that would be tested to verify completion.

评估完单智能体运行后,我将注意力转向了 harness 运行。这次运行从相同的一句话提示词开始,但规划者步骤将该提示词扩展为一个包含 16 个功能、横跨 10 个 sprint 的完整规格。它远超单智能体运行所尝试的范围。除了核心编辑器和游玩模式外,规格还包含了精灵动画系统、行为模板、音效和音乐、AI 辅助的精灵生成器和关卡设计器,以及带有可分享链接的游戏导出功能。我让规划者访问了我们的前端设计技能,它阅读后将其用于为应用创建视觉设计语言作为规格的一部分。对于每个 sprint,生成器和评估者协商一个合约,定义该 sprint 的具体实现细节和用于验证完成的可测试行为。

The app immediately showed more polish and smoothness than the solo run. The canvas used the full viewport, the panels were sized sensibly, and the interface had a consistent visual identity that tracked the design direction from the spec. Some of the clunkiness I'd seen in the solo run did remain—the workflow still didn't make it clear that you should build sprites and entities before trying to populate a level, and I had to figure that out by poking around. This read as a gap in the base model's product intuition rather than something the harness was designed to address, though it did suggest a place where targeted iteration inside the harness could help to further improve output quality.

应用立刻展现出比单智能体运行更高的精致度和流畅度。画布占满了整个视口,面板尺寸合理,界面具有一致的视觉特征,与规格中的设计方向保持一致。我在单智能体运行中看到的一些粗糙感依然存在——工作流程仍然没有明确表明你应该先创建精灵和实体再尝试填充关卡,我只能靠自己摸索。这看起来更像是基础模型在产品直觉方面的不足,而非 harness 设计应当解决的问题,但这确实指出了一个可以通过 harness 内的针对性迭代来进一步提升输出质量的方向。

Working through the editors, the new run's advantages over solo became more apparent. The sprite editor was richer and more fully featured, with cleaner tool palettes, a better color picker, and more usable zoom controls.

在逐一使用各编辑器时,新运行相对于单智能体的优势变得更加明显。精灵编辑器更丰富、功能更完善,工具面板更整洁,颜色选择器更好用,缩放控制也更易用。

Because I'd asked the planner to weave AI features into its specs, the app also came with a built-in Claude integration that let me generate different parts of the game through prompting. This significantly sped up the workflow.

因为我要求规划者在规格中融入 AI 功能,应用还内置了 Claude 集成,让我可以通过提示词生成游戏的不同部分。这大大加速了工作流程。

Opening screen Sprite editor AI game design AI game design Game play

开始界面 / 精灵编辑器 / AI 游戏设计 / AI 游戏设计 / 游戏运行

Initial screen: Creating a new game, in the app built with the full harness

初始界面:在完整 harness 构建的应用中创建新游戏。

The biggest difference was in play mode. I was actually able to move my entity and play the game. The physics had some rough edges—my character jumped onto a platform but ended up overlapping with it, which felt intuitively wrong—but the core thing worked, which the solo run did not manage. After moving around a bit, I did hit some limitations with the AI's game level construction. There was a large wall that I wasn't able to jump past, so I was stuck. This suggested there were some common sense improvements and edge cases that the harness could handle to further refine the app.

最大的差异体现在游玩模式。我真的能移动自己的实体并玩游戏了。物理效果有些粗糙——我的角色跳上平台后与平台重叠了,这在直觉上感觉不对——但核心功能是可用的,而单智能体运行没能做到这一点。移动了一会儿后,我确实遇到了 AI 构建的游戏关卡的一些限制。有一面大墙我无法跳过去,所以卡住了。这表明还有一些常识性的改进和边缘情况可以通过 harness 来处理以进一步优化应用。

Reading through the logs, it was clear that the evaluator kept the implementation in line with the spec. Each sprint, it walked through the sprint contract's test criteria and exercised the running application through Playwright, filing bugs against anything that diverged from expected behavior. The contracts were granular—Sprint 3 alone had 27 criteria covering the level editor—and the evaluator's findings were specific enough to act on without extra investigation. The table below shows several examples of issues our evaluator identified:

通读日志后可以清楚地看到,评估者使实现保持了与规格的一致。每个 sprint,它都会遍历 sprint 合约中的测试标准,并通过 Playwright 操作运行中的应用,对任何偏离预期行为的地方提交 bug。合约非常细致——仅 Sprint 3 就有 27 条标准覆盖关卡编辑器——评估者的发现足够具体,无需额外调查即可据此行动。下表展示了我们的评估者识别出的几个问题示例:

Contract criterion Evaluator finding
Rectangle fill tool allows click-drag to fill a rectangular area with selected tile FAIL — Tool only places tiles at drag start/end points instead of filling the region. fillRectangle function exists but isn't triggered properly on mouseUp.
User can select and delete placed entity spawn points FAIL — Delete key handler at LevelEditor.tsx:892 requires both selection and selectedEntityId to be set, but clicking an entity only sets selectedEntityId. Condition should be selection || (selectedEntityId && activeLayer === 'entity').
User can reorder animation frames via API FAIL — PUT /frames/reorder route defined after /{frame_id} routes. FastAPI matches 'reorder' as a frame_id integer and returns 422: "unable to parse string as an integer."

合约标准 评估者发现
矩形填充工具允许点击拖拽用选定瓦片填充矩形区域 失败——工具仅在拖拽起点/终点放置瓦片,而非填充整个区域。fillRectangle 函数存在但在 mouseUp 时未正确触发。
用户可以选择并删除已放置的实体生成点 失败——LevelEditor.tsx:892 处的 Delete 键处理器要求同时设置 selection 和 selectedEntityId,但点击实体只设置了 selectedEntityId。条件应为 selection || (selectedEntityId && activeLayer === 'entity')。
用户可以通过 API 重排动画帧 失败——PUT /frames/reorder 路由定义在 /{frame_id} 路由之后。FastAPI 将 'reorder' 匹配为 frame_id 整数并返回 422:"unable to parse string as an integer。"

Getting the evaluator to perform at this level took work. Out of the box, Claude is a poor QA agent. In early runs, I watched it identify legitimate issues, then talk itself into deciding they weren't a big deal and approve the work anyway. It also tended to test superficially, rather than probing edge cases, so more subtle bugs often slipped through. The tuning loop was to read the evaluator's logs, find examples where its judgment diverged from mine, and update the QAs prompt to solve for those issues. It took several rounds of this development loop before the evaluator was grading in a way that I found reasonable. Even then, the harness output showed the limits of the model's QAing capabilities: small layout issues, interactions that felt unintuitive in places, and undiscovered bugs in more deeply nested features that the evaluator hadn't exercised thoroughly. There was clearly more verification headroom to capture with further tuning. But compared to the solo run, where the central feature of the application simply didn't work, the lift was obvious.

让评估者达到这个水平需要下功夫。开箱即用的 Claude 是一个糟糕的 QA 智能体。在早期运行中,我看到它识别出了合理的问题,然后说服自己这些不是大问题,照样批准了工作。它还倾向于浅层测试而非深入探测边缘情况,因此更隐蔽的 bug 经常被遗漏。调优循环是这样的:阅读评估者的日志,找到它的判断与我的判断不一致的地方,然后更新 QA 的提示词来解决这些问题。经过好几轮这样的开发循环,评估者的评分才达到我认为合理的水平。即便如此,harness 的输出仍然暴露了模型 QA 能力的局限:小的布局问题、某些地方不够直观的交互,以及评估者没有充分测试的深层嵌套功能中未被发现的 bug。通过进一步调优显然还有更多验证空间可以挖掘。但与单智能体运行——应用的核心功能根本无法工作——相比,提升是显而易见的。


Iterating on the harness

迭代优化 harness

The first set of harness results was encouraging, but it was also bulky, slow, and expensive. The logical next step was to find ways to simplify the harness without degrading its performance. This was partly common sense and partly a function of a more general principle: every component in a harness encodes an assumption about what the model can't do on its own, and those assumptions are worth stress testing, both because they may be incorrect, and because they can quickly go stale as models improve. Our blog post Building Effective Agents frames the underlying idea as "find the simplest solution possible, and only increase complexity when needed," and it's a pattern that shows up consistently for anyone maintaining an agent harness.

第一批 harness 结果令人鼓舞,但也很笨重、缓慢且昂贵。合理的下一步是找到在不降低性能的前提下简化 harness 的方法。这一方面是常识,另一方面也遵循一个更普遍的原则:harness 中的每个组件都编码了一个关于"模型自己做不到什么"的假设,而这些假设值得做压力测试——既因为它们可能本身就不正确,也因为随着模型改进它们很快就会过时。我们的博客文章《Building Effective Agents》将这一核心理念概括为"找到最简单的可行方案,只在需要时才增加复杂度",这是每个维护智能体 harness 的人都会反复遇到的模式。

In my first attempt to simplify, I cut the harness back radically and tried a few creative new ideas, but I wasn't able to replicate the performance of the original. It also became difficult to tell which pieces of the harness design were actually load-bearing, and in what ways. Based on that experience, I moved to a more methodical approach, removing one component at a time and reviewing what impact it had on the final result.

在我第一次尝试简化时,我对 harness 进行了大幅削减并尝试了一些有创意的新想法,但未能复现原始版本的性能。同时也变得很难判断 harness 设计中的哪些部分才是真正承重的,以及它们以什么方式承重。基于这次经验,我转向了更系统的方法——每次只移除一个组件,然后审视它对最终结果的影响。

As I was going through these iteration cycles, we also released Opus 4.6, which provided further motivation to reduce harness complexity. There was good reason to expect 4.6 would need less scaffolding than 4.5 did. From our launch blog: "[Opus 4.6] plans more carefully, sustains agentic tasks for longer, can operate more reliably in larger codebases, and has better code review and debugging skills to catch its own mistakes." It also improved substantially on long-context retrieval. These were all capabilities the harness had been built to supplement.

在我经历这些迭代周期的过程中,我们还发布了 Opus 4.6,这进一步推动了简化 harness 的动力。有充分的理由相信 4.6 比 4.5 需要更少的脚手架。引用我们的发布博客:"[Opus 4.6] 规划更谨慎,维持智能体任务的时间更长,能更可靠地在大型代码库中运行,并且具有更好的代码审查和调试技能来发现自己的错误。"它在长上下文检索方面也有大幅提升。这些都是 harness 此前被设计用来补充的能力。

Removing the sprint construct

移除 sprint 结构

I started by removing the sprint construct entirely. The sprint structure had helped to decompose work into chunks for the model to work coherently. Given the improvements in Opus 4.6, there was good reason to believe that the model could natively handle the job without this sort of decomposition.

我首先完全移除了 sprint 结构。sprint 结构此前帮助将工作分解为小块,以便模型能连贯地工作。鉴于 Opus 4.6 的改进,有充分理由相信模型可以原生地处理这项工作而无需这种分解。

I kept both the planner and evaluator, as each continued to add obvious value. Without the planner, the generator under-scoped: given the raw prompt, it would start building without first speccing its work, and end up creating a less feature-rich application than the planner did.

我保留了规划者和评估者,因为两者都在持续提供明显的价值。没有规划者,生成器会缩减范围:拿到原始提示词后就直接开始构建而不先做规格说明,最终创建出功能不如有规划者时丰富的应用。

With the sprint construct removed, I moved the evaluator to a single pass at the end of the run rather than grading per sprint. Since the model was much more capable, it changed how load-bearing the evaluator was for certain runs, with its usefulness depending on where the task sat relative to what the model could do reliably on its own. On 4.5, that boundary was close: our builds were at the edge of what the generator could do well solo, and the evaluator caught meaningful issues across the build. On 4.6, the model's raw capability increased, so the boundary moved outward. Tasks that used to need the evaluator's check to be implemented coherently were now often within what the generator handled well on its own, and for tasks within that boundary, the evaluator became unnecessary overhead. But for the parts of the build that were still at the edge of the generator's capabilities, the evaluator continued to give real lift.

移除 sprint 结构后,我将评估者改为在运行结束时做一次性检查,而非按 sprint 评分。由于模型能力大幅提升,这改变了评估者在不同运行中的承重程度——其有用性取决于任务相对于模型独立可靠完成能力的位置。在 4.5 上,这个边界很近:我们的构建处于生成器单独能做好的极限,评估者在整个构建过程中都捕获了有意义的问题。在 4.6 上,模型的原始能力增强,边界向外推移。以前需要评估者检查才能连贯实现的任务,现在往往在生成器独自就能处理好的范围内,对于这些任务,评估者成了不必要的开销。但对于仍处于生成器能力边缘的构建部分,评估者继续提供了切实的提升。

The practical implication is that the evaluator is not a fixed yes-or-no decision. It is worth the cost when the task sits beyond what the current model does reliably solo.

其实际意义在于:评估者并非一个固定的"要或不要"的决策。当任务超出当前模型独立可靠完成的范围时,它就值得投入。

Alongside the structural simplification, I also added prompting to improve how the harness built AI features into each app, specifically getting the generator to build a proper agent that could drive the app's own functionality through tools. That took real iteration, since the relevant knowledge is recent enough that Claude's training data covers it thinly. But with enough tuning, the generator was building agents correctly.

在结构简化的同时,我还通过调整提示词来改善 harness 在每个应用中构建 AI 功能的方式,特别是让生成器构建一个能通过工具驱动应用自身功能的正式智能体。这需要真正的迭代,因为相关知识足够新,Claude 的训练数据覆盖得很薄。但经过足够的调优后,生成器能够正确地构建智能体了。

Results from the updated harness

更新后 harness 的结果

To put the updated harness to the test, I used the following prompt to generate a Digital Audio Workstation (DAW), a music production program for composing, recording, and mixing songs:

为了测试更新后的 harness,我使用以下提示词生成一个数字音频工作站(DAW),即一个用于作曲、录音和混音的音乐制作程序:

Build a fully featured DAW in the browser using the Web Audio API.

在浏览器中使用 Web Audio API 构建一个功能完整的 DAW。

The run was still lengthy and expensive, at about 4 hours and $124 in token costs.

这次运行仍然耗时且昂贵,大约 4 小时,token 成本为 124 美元。

Most of the time went to the builder, which ran coherently for over two hours without the sprint decomposition that Opus 4.5 had needed.

大部分时间花在了构建器上,它在没有 Opus 4.5 所需的 sprint 分解的情况下连贯地运行了两个多小时。

Agent & Phase Duration Cost
Planner 4.7 min $0.46
Build (Round 1) 2 hr 7 min $71.08
QA (Round 1) 8.8 min $3.24
Build (Round 2) 1 hr 2 min $36.89
QA (Round 2) 6.8 min $3.09
Build (Round 3) 10.9 min $5.88
QA (Round 3) 9.6 min $4.06
Total V2 Harness 3 hr 50 min $124.70

智能体与阶段 时长 成本
规划者 4.7 分钟 $0.46
构建(第 1 轮) 2 小时 7 分钟 $71.08
QA(第 1 轮) 8.8 分钟 $3.24
构建(第 2 轮) 1 小时 2 分钟 $36.89
QA(第 2 轮) 6.8 分钟 $3.09
构建(第 3 轮) 10.9 分钟 $5.88
QA(第 3 轮) 9.6 分钟 $4.06
V2 Harness 总计 3 小时 50 分钟 $124.70

As with the previous harness, the planner expanded the one-line prompt into a full spec. From the logs, I could see the generator model did a good job planning the app and the agent design, wiring the agent up, and testing it before handing off to QA.

与之前的 harness 一样,规划者将一行提示词扩展为完整规格。从日志中我可以看到,生成器模型在规划应用和智能体设计、连接智能体以及在移交给 QA 之前进行测试方面都做得很好。

That being said, the QA agent still caught real gaps. In its first-round feedback, it noted:

话虽如此,QA 智能体仍然捕获了真实的缺口。在第一轮反馈中,它指出:

This is a strong app with excellent design fidelity, solid AI agent, and good backend. The main failure point is Feature Completeness — while the app looks impressive and the AI integration works well, several core DAW features are display-only without interactive depth: clips can't be dragged/moved on the timeline, there are no instrument UI panels (synth knobs, drum pads), and no visual effect editors (EQ curves, compressor meters). These aren't edge cases — they're the core interactions that make a DAW usable, and the spec explicitly calls for them.

这是一个很强的应用,设计还原度出色,AI 智能体稳健,后端也不错。主要的失败点在于功能完整性——虽然应用看起来令人印象深刻且 AI 集成运作良好,但几个核心 DAW 功能只有展示没有交互深度:音频片段无法在时间线上拖拽/移动,没有乐器 UI 面板(合成器旋钮、鼓垫),也没有可视化效果编辑器(EQ 曲线、压缩器仪表)。这些不是边缘情况——它们是让 DAW 可用的核心交互,而且规格中明确要求了这些功能。

In its second round feedback, it again caught several functionality gaps:

在第二轮反馈中,它再次捕获了几个功能缺口:

Remaining gaps:

  • Audio recording is still stub-only (button toggles but no mic capture)
  • Clip resize by edge drag and clip split not implemented
  • Effect visualizations are numeric sliders, not graphical (no EQ curve)

剩余缺口:

  • 音频录制仍然只是桩代码(按钮可切换但没有麦克风采集)
  • 通过边缘拖拽调整片段大小和片段分割未实现
  • 效果可视化是数字滑块而非图形化(没有 EQ 曲线)

The generator was still liable to miss details or stub features when left to its own devices, and the QA still added value in catching those last mile issues for the generator to fix.

生成器在独自工作时仍然容易遗漏细节或用桩代码应付功能,QA 在捕获这些最后一公里的问题并让生成器修复方面仍然发挥了价值。

Based on the prompt, I was expecting a program where I could create melodies, harmonies, and drum patterns, arrange them into a song, and get help from an integrated agent along the way. The video below shows the result.

基于提示词,我期望一个能让我创建旋律、和声和鼓点节奏、将它们编排成歌曲、并在过程中获得集成智能体帮助的程序。下面的视频展示了结果。

The app is far from a professional music production program, and the agent's song composition skills could clearly use a lot of work. Additionally, Claude can't actually hear, which made the QA feedback loop less effective with respect to musical taste.

这个应用离专业音乐制作程序还有很远的距离,智能体的歌曲创作能力显然还需要大量改进。此外,Claude 实际上无法"听",这使得 QA 反馈循环在音乐品味方面的效果大打折扣。

But the final app had all the core pieces of a functional music production program: a working arrangement view, mixer, and transport running in the browser. Beyond that, I was able to put together a short song snippet entirely through prompting: the agent set the tempo and key, laid down a melody, built a drum track, adjusted mixer levels, and added reverb. The core primitives for song composition were present, and the agent could drive them autonomously, using tools to create a simple production from end to end. You might say it's not pitch-perfect yet—but it's getting there.

但最终应用包含了一个功能性音乐制作程序的所有核心部件:在浏览器中运行的编曲视图、混音器和播放控制。不仅如此,我还完全通过提示词拼凑出了一段简短的歌曲片段:智能体设置了速度和调性,铺设了旋律,构建了鼓轨,调整了混音器电平,并添加了混响。歌曲创作的核心基元已经具备,智能体可以自主驱动它们,使用工具从头到尾创建一个简单的制作。你可能会说它还不够完美——但正在接近。


What comes next

未来展望

As models continue to improve, we can roughly expect them to be capable of working for longer, and on more complex tasks. In some cases, that will mean the scaffold surrounding the model matters less over time, and developers can wait for the next model and see certain problems solve themselves. On the other hand, the better the models get, the more space there is to develop harnesses that can achieve complex tasks beyond what the model can do at baseline.

随着模型持续改进,我们大致可以期待它们能够工作更长时间,处理更复杂的任务。在某些情况下,这意味着围绕模型的脚手架随时间推移变得不那么重要,开发者可以等待下一代模型,看着某些问题自行解决。另一方面,模型越好,开发能完成超出模型基线能力的复杂任务的 harness 的空间就越大。

With this in mind, there are a few lessons from this work worth carrying forward. It is always good practice to experiment with the model you're building against, read its traces on realistic problems, and tune its performance to achieve your desired outcomes. When working on more complex tasks, there is sometimes headroom from decomposing the task and applying specialized agents to each aspect of the problem. And when a new model lands, it is generally good practice to re-examine a harness, stripping away pieces that are no longer load-bearing to performance and adding new pieces to achieve greater capability that may not have been possible before.

有鉴于此,这项工作中有几个值得带走的经验教训。始终是好的实践:针对你正在使用的模型进行实验,阅读它在真实问题上的运行轨迹,调优其表现以达到你期望的结果。在处理更复杂的任务时,有时可以通过分解任务并为问题的每个方面配置专门的智能体来获得提升空间。而当新模型发布时,通常的好做法是重新审视 harness——剥离那些不再对性能起承重作用的部分,并添加新的部分以实现此前可能无法达到的更强能力。

From this work, my conviction is that the space of interesting harness combinations doesn't shrink as models improve. Instead, it moves, and the interesting work for AI engineers is to keep finding the next novel combination.

通过这项工作,我坚信有趣的 harness 组合空间不会随着模型的改进而缩小。相反,它在移动,而 AI 工程师有趣的工作就是不断找到下一个新颖的组合。


Acknowledgements

致谢

Special thanks to Mike Krieger, Michael Agaby, Justin Young, Jeremy Hadfield, David Hershey, Julius Tarng, Xiaoyi Zhang, Barry Zhang, Orowa Sidker, Michael Tingley, Ibrahim Madha, Martina Long, and Canyon Robbins for their contributions to this work.

特别感谢 Mike Krieger、Michael Agaby、Justin Young、Jeremy Hadfield、David Hershey、Julius Tarng、Xiaoyi Zhang、Barry Zhang、Orowa Sidker、Michael Tingley、Ibrahim Madha、Martina Long 和 Canyon Robbins 对这项工作的贡献。

Thanks also to Jake Eaton, Alyssa Leonard, and Stef Sequeira for their help shaping the post.

同时感谢 Jake Eaton、Alyssa Leonard 和 Stef Sequeira 对这篇文章的帮助和润色。

Published

Back to translations