Harness engineering: leveraging Codex in an agent-first world
Harness 工程:在智能体优先的世界中利用 Codex
Over the past five months, OpenAI's Harness team built and shipped an internal beta with zero manually-written code using Codex agents—roughly 1M lines across ~1,500 PRs at ~1/10th the development time. This post shares what they learned about designing environments, specifying intent, and building feedback loops for agents.
过去五个月,OpenAI 的 Harness 团队完全使用 Codex 智能体构建并交付了一款内部测试产品——手写代码量为零,通过约 1,500 个 PR 生成了大约一百万行代码,开发时间仅为手动编码的十分之一。本文分享了他们在环境设计、意图规约和智能体反馈回路构建方面的经验。
Published Feb 11, 2026
发布于 2026 年 2 月 11 日
By Ryan Lopopolo, Member of the Technical Staff
作者:Ryan Lopopolo,技术团队成员
Over the past five months, our team has been running an experiment: building and shipping an internal beta of a software product with 0 lines of manually-written code.
过去五个月里,我们的团队一直在进行一项实验:构建并交付一款软件产品的内部测试版本,手写代码量为 0 行。
The product has internal daily users and external alpha testers. It ships, deploys, breaks, and gets fixed. What's different is that every line of code—application logic, tests, CI configuration, documentation, observability, and internal tooling—has been written by Codex. We estimate that we built this in about 1/10th the time it would have taken to write the code by hand.
这款产品拥有内部日活用户和外部 Alpha 测试人员。它经历发布、部署、出错和修复的完整流程。不同之处在于,每一行代码——应用逻辑、测试、CI 配置、文档、可观测性和内部工具——都由 Codex 编写。我们估计,与手工编码相比,构建时间仅为原来的约十分之一。
Humans steer. Agents execute.
人类掌舵,智能体执行。
We intentionally chose this constraint so we would build what was necessary to increase engineering velocity by orders of magnitude. We had weeks to ship what ended up being a million lines of code. To do that, we needed to understand what changes when a software engineering team's primary job is no longer to write code, but to design environments, specify intent, and build feedback loops that allow Codex agents to do reliable work.
我们有意选择了这一约束条件,从而迫使我们构建出提升工程速度所必需的一切。我们只有数周时间交付最终达到一百万行的代码。为此,我们需要理解一个根本性的变化:当软件工程团队的首要工作不再是编写代码,而是设计环境、规约意图,并构建使 Codex 智能体能够可靠工作的反馈回路时,一切会发生怎样的改变。
This post is about what we learned by building a brand new product with a team of agents—what broke, what compounded, and how to maximize our one truly scarce resource: human time and attention.
这篇文章讲述的是我们用一支智能体团队从零构建一款全新产品的过程中学到的经验——什么出了问题、什么产生了复合效应,以及如何最大化利用我们唯一真正稀缺的资源:人类的时间和注意力。
We started with an empty git repository
我们从一个空的 git 仓库开始
The first commit to an empty repository landed in late August 2025.
空仓库的第一次提交发生在 2025 年 8 月下旬。
The initial scaffold—repository structure, CI configuration, formatting rules, package manager setup, and application framework—was generated by Codex CLI using GPT‑5, guided by a small set of existing templates. Even the initial AGENTS.md file that directs agents how to work in the repository was itself written by Codex.
初始脚手架——仓库结构、CI 配置、格式化规则、包管理器设置和应用框架——由 Codex CLI 使用 GPT‑5 生成,辅以一小组现有模板作为引导。甚至指导智能体如何在仓库中工作的初始 AGENTS.md 文件本身也是由 Codex 编写的。
There was no pre-existing human-written code to anchor the system. From the beginning, the repository was shaped by the agent.
系统中没有预先存在的人工编写代码作为锚点。从一开始,仓库就是由智能体塑造的。
Five months later, the repository contains on the order of a million lines of code across application logic, infrastructure, tooling, documentation, and internal developer utilities. Over that period, roughly 1,500 pull requests have been opened and merged with a small team of just three engineers driving Codex. This translates to an average throughput of 3.5 PRs per engineer per day, and surprisingly the throughput has increased as the team has grown to now seven engineers. Importantly, this wasn't output for output's sake: the product has been used by hundreds of users internally, including daily internal power users.
五个月后,仓库包含了大约一百万行代码,涵盖应用逻辑、基础设施、工具链、文档和内部开发者实用工具。在此期间,一个仅由三名工程师驱动 Codex 的小团队提交并合并了约 1,500 个 PR。这意味着平均吞吐量为每位工程师每天 3.5 个 PR,而且令人惊讶的是,随着团队扩展到现在的七名工程师,吞吐量反而增加了。重要的是,这并非为产出而产出:该产品已被内部数百名用户使用,包括日常内部重度用户。
Throughout the development process, humans never directly contributed any code. This became a core philosophy for the team: no manually-written code.
在整个开发过程中,人类从未直接贡献任何代码。这成为了团队的核心理念:不手写代码。
Redefining the role of the engineer
重新定义工程师的角色
The lack of hands-on human coding introduced a different kind of engineering work, focused on systems, scaffolding, and leverage.
人类不再亲手编码,这引入了一种不同类型的工程工作,聚焦于系统、脚手架和杠杆效应。
Early progress was slower than we expected, not because Codex was incapable, but because the environment was underspecified. The agent lacked the tools, abstractions, and internal structure required to make progress toward high-level goals. The primary job of our engineering team became enabling the agents to do useful work.
早期进展比我们预期的要慢,原因不在于 Codex 能力不足,而在于环境的定义不够充分。智能体缺乏朝着高层目标推进所需的工具、抽象和内部结构。我们工程团队的首要工作变成了让智能体能够做有用的工作。
In practice, this meant working depth-first: breaking down larger goals into smaller building blocks (design, code, review, test, etc), prompting the agent to construct those blocks, and using them to unlock more complex tasks. When something failed, the fix was almost never "try harder." Because the only way to make progress was to get Codex to do the work, human engineers always stepped into the task and asked: "what capability is missing, and how do we make it both legible and enforceable for the agent?"
在实践中,这意味着深度优先地工作:将较大的目标分解为更小的构建模块(设计、编码、审查、测试等),提示智能体构建这些模块,再利用它们来解锁更复杂的任务。当某个环节失败时,修复方式几乎从来不是"更努力地尝试"。因为推进工作的唯一途径就是让 Codex 完成任务,人类工程师总会介入并问:"缺少什么能力?如何让它对智能体来说既可理解又可执行?"
Humans interact with the system almost entirely through prompts: an engineer describes a task, runs the agent, and allows it to open a pull request. To drive a PR to completion, we instruct Codex to review its own changes locally, request additional specific agent reviews both locally and in the cloud, respond to any human or agent given feedback, and iterate in a loop until all agent reviewers are satisfied (effectively this is a Ralph Wiggum Loop). Codex uses our standard development tools directly (gh, local scripts, and repository-embedded skills) to gather context without humans copying and pasting into the CLI.
人类几乎完全通过提示词与系统交互:工程师描述一个任务,运行智能体,然后让它提交一个 PR。为了推动 PR 完成,我们指示 Codex 在本地审查自己的更改,在本地和云端请求特定的额外智能体审查,回应任何来自人类或智能体的反馈,并循环迭代直到所有智能体审查者都满意为止(实际上这是一个 Ralph Wiggum Loop)。Codex 直接使用我们的标准开发工具(gh、本地脚本和仓库内嵌技能)来收集上下文,而无需人类将内容复制粘贴到 CLI 中。
Humans may review pull requests, but aren't required to. Over time, we've pushed almost all review effort towards being handled agent-to-agent.
人类可以审查 PR,但并非必须。随着时间推移,我们已经将几乎所有审查工作推向了智能体对智能体的模式。
Increasing application legibility
提升应用的可理解性
As code throughput increased, our bottleneck became human QA capacity. Because the fixed constraint has been human time and attention, we've worked to add more capabilities to the agent by making things like the application UI, logs, and app metrics themselves directly legible to Codex.
随着代码吞吐量的增加,我们的瓶颈变成了人类的 QA 能力。由于固定约束始终是人类的时间和注意力,我们致力于让应用 UI、日志和应用指标等内容对 Codex 直接可理解,从而为智能体增加更多能力。
For example, we made the app bootable per git worktree, so Codex could launch and drive one instance per change. We also wired the Chrome DevTools Protocol into the agent runtime and created skills for working with DOM snapshots, screenshots, and navigation. This enabled Codex to reproduce bugs, validate fixes, and reason about UI behavior directly.
例如,我们让应用可以按 git worktree 启动,这样 Codex 就能为每个更改启动并驱动一个独立实例。我们还将 Chrome DevTools Protocol 接入了智能体运行时,并创建了用于处理 DOM 快照、截图和导航的技能。这使 Codex 能够直接重现 bug、验证修复并推理 UI 行为。
[图示:Codex 通过 Chrome DevTools MCP 驱动应用以验证其工作。Codex 选择一个目标,在触发 UI 路径之前和之后对状态进行快照,通过 Chrome DevTools 观察运行时事件,应用修复,重新启动,并循环重新运行验证直到应用状态正常。]
We did the same for observability tooling. Logs, metrics, and traces are exposed to Codex via a local observability stack that's ephemeral for any given worktree. Codex works on a fully isolated version of that app—including its logs and metrics, which get torn down once that task is complete. Agents can query logs with LogQL and metrics with PromQL. With this context available, prompts like "ensure service startup completes in under 800ms" or "no span in these four critical user journeys exceeds two seconds" become tractable.
我们对可观测性工具做了同样的处理。日志、指标和链路追踪通过一个本地可观测性技术栈暴露给 Codex,该技术栈对任何给定的 worktree 来说都是临时性的。Codex 在完全隔离的应用版本上工作——包括其日志和指标,任务完成后就会被销毁。智能体可以使用 LogQL 查询日志,使用 PromQL 查询指标。有了这些上下文,像"确保服务启动在 800 毫秒以内完成"或"这四个关键用户旅程中没有任何 span 超过两秒"这样的提示就变得可处理了。
[图示:为 Codex 在本地开发中提供完整的可观测性技术栈。应用将日志、指标和链路追踪发送到 Vector,Vector 将数据扇出到包含 Victoria Logs、Metrics 和 Traces 的可观测性技术栈,分别通过 LogQL、PromQL 或 TraceQL API 进行查询。Codex 使用这些信号进行查询、关联和推理,然后在代码库中实施修复、重启应用、重新运行工作负载、测试 UI 旅程,并在反馈回路中重复。]
We regularly see single Codex runs work on a single task for upwards of six hours (often while the humans are sleeping).
我们经常看到单次 Codex 运行在一个任务上持续工作超过六个小时(通常是在人类睡觉的时候)。
We made repository knowledge the system of record
我们将仓库知识作为唯一权威来源
Context management is one of the biggest challenges in making agents effective at large and complex tasks. One of the earliest lessons we learned was simple: give Codex a map, not a 1,000-page instruction manual.
上下文管理是使智能体在大型复杂任务中保持有效的最大挑战之一。我们最早学到的一个教训很简单:给 Codex 一张地图,而不是一本一千页的说明手册。
We tried the "one big AGENTS.md" approach. It failed in predictable ways:
我们尝试过"一个大而全的 AGENTS.md"的方法。它以可预见的方式失败了:
- Context is a scarce resource. A giant instruction file crowds out the task, the code, and the relevant docs—so the agent either misses key constraints or starts optimizing for the wrong ones.
- 上下文是稀缺资源。一个巨大的指令文件会挤占任务、代码和相关文档的空间——导致智能体要么遗漏关键约束,要么开始针对错误的目标进行优化。
- Too much guidance becomes non-guidance. When everything is "important," nothing is. Agents end up pattern-matching locally instead of navigating intentionally.
- 过多的指导等于没有指导。当一切都"重要"时,什么都不重要。智能体最终会倾向于局部模式匹配,而不是有意图地导航。
- It rots instantly. A monolithic manual turns into a graveyard of stale rules. Agents can't tell what's still true, humans stop maintaining it, and the file quietly becomes an attractive nuisance.
- 它会立即腐化。单体式手册会变成过时规则的坟场。智能体无法分辨哪些仍然有效,人类也不再维护它,这个文件悄然变成一个有害的诱惑。
- It's hard to verify. A single blob doesn't lend itself to mechanical checks (coverage, freshness, ownership, cross-links), so drift is inevitable.
- 它难以验证。一个巨大的文本块不便于进行机械化检查(覆盖度、新鲜度、所有权、交叉链接),因此漂移是不可避免的。
So instead of treating AGENTS.md as the encyclopedia, we treat it as the table of contents.
因此,我们不再把 AGENTS.md 当作百科全书,而是将其视为目录。
The repository's knowledge base lives in a structured docs/ directory treated as the system of record. A short AGENTS.md (roughly 100 lines) is injected into context and serves primarily as a map, with pointers to deeper sources of truth elsewhere.
仓库的知识库存放在一个结构化的 docs/ 目录中,被视为唯一权威来源。一个简短的 AGENTS.md(大约 100 行)被注入上下文,主要充当地图的角色,指向其他位置更深层的真相来源。
AGENTS.md
ARCHITECTURE.md
docs/
├── design-docs/
│ ├── index.md
│ ├── core-beliefs.md
│ └── ...
├── exec-plans/
│ ├── active/
│ ├── completed/
│ └── tech-debt-tracker.md
├── generated/
│ └── db-schema.md
├── product-specs/
│ ├── index.md
│ ├── new-user-onboarding.md
│ └── ...
├── references/
│ ├── design-system-reference-llms.txt
│ ├── nixpacks-llms.txt
│ ├── uv-llms.txt
│ └── ...
├── DESIGN.md
├── FRONTEND.md
├── PLANS.md
├── PRODUCT_SENSE.md
├── QUALITY_SCORE.md
├── RELIABILITY.md
└── SECURITY.md
[图示:仓库内知识库的目录布局。]
Design documentation is catalogued and indexed, including verification status and a set of core beliefs that define agent-first operating principles. Architecture documentation provides a top-level map of domains and package layering. A quality document grades each product domain and architectural layer, tracking gaps over time.
设计文档被编目和索引,包括验证状态和一组定义智能体优先运营原则的核心信念。架构文档提供了领域和包分层的顶层地图。质量文档对每个产品领域和架构层进行评级,并跟踪随时间推移的差距。
Plans are treated as first-class artifacts. Ephemeral lightweight plans are used for small changes, while complex work is captured in execution plans with progress and decision logs that are checked into the repository. Active plans, completed plans, and known technical debt are all versioned and co-located, allowing agents to operate without relying on external context.
计划被视为一等公民级别的产物。小变更使用临时的轻量级计划,而复杂工作则被记录在执行计划中,附带进度和决策日志,并检入仓库。活跃计划、已完成计划和已知技术债务全部版本化并集中存放,使智能体无需依赖外部上下文即可运作。
This enables progressive disclosure: agents start with a small, stable entry point and are taught where to look next, rather than being overwhelmed up front.
这实现了渐进式披露:智能体从一个小而稳定的入口出发,被教会接下来该查看哪里,而不是一开始就被信息淹没。
We enforce this mechanically. Dedicated linters and CI jobs validate that the knowledge base is up to date, cross-linked, and structured correctly. A recurring "doc-gardening" agent scans for stale or obsolete documentation that does not reflect the real code behavior and opens fix-up pull requests.
我们通过机械化手段强制执行这一点。专用的 lint 工具和 CI 任务验证知识库是否为最新版本、交叉链接是否正确、结构是否规范。一个周期性运行的"文档园艺"智能体会扫描不再反映实际代码行为的过时或废弃文档,并提交修复 PR。
Agent legibility is the goal
智能体可理解性是目标
As the codebase evolved, Codex's framework for design decisions needed to evolve, too.
随着代码库的演进,Codex 的设计决策框架也需要随之演进。
Because the repository is entirely agent-generated, it's optimized first for Codex's legibility. In the same way teams aim to improve navigability of their code for new engineering hires, our human engineers' goal was making it possible for an agent to reason about the full business domain directly from the repository itself.
由于仓库完全由智能体生成,它首先针对 Codex 的可理解性进行优化。正如团队致力于提高代码对新工程师的可导航性一样,我们人类工程师的目标是让智能体能够直接从仓库本身推理整个业务领域。
From the agent's point of view, anything it can't access in-context while running effectively doesn't exist. Knowledge that lives in Google Docs, chat threads, or people's heads are not accessible to the system. Repository-local, versioned artifacts (e.g., code, markdown, schemas, executable plans) are all it can see.
从智能体的角度来看,运行时无法在上下文中访问的任何东西实际上都不存在。存在于 Google Docs、聊天记录或人们脑海中的知识对系统来说是不可访问的。仓库本地的、版本化的制品(如代码、Markdown、Schema、可执行计划)是它能看到的全部。
[图示:智能体知识的边界——Codex 看不到的就不存在。Codex 的知识被展示为一个有界的气泡。下方是不可见知识的示例——Google Docs、Slack 消息和人类隐性知识。箭头表明要让这些信息对 Codex 可见,必须将其编码为代码库中的 Markdown。]
We learned that we needed to push more and more context into the repo over time. That Slack discussion that aligned the team on an architectural pattern? If it isn't discoverable to the agent, it's illegible in the same way it would be unknown to a new hire joining three months later.
我们认识到,随着时间推移,需要将越来越多的上下文推入仓库。那个让团队就某个架构模式达成共识的 Slack 讨论?如果智能体无法发现它,那它就是不可理解的——就像三个月后加入的新员工不会知道这件事一样。
Giving Codex more context means organizing and exposing the right information so the agent can reason over it, rather than overwhelming it with ad-hoc instructions. In the same way you would onboard a new teammate on product principles, engineering norms, and team culture (emoji preferences included), giving the agent this information leads to better-aligned output.
给 Codex 更多上下文意味着组织和暴露正确的信息,使智能体能够基于这些信息进行推理,而不是用临时指令淹没它。就像你为新队友进行产品原则、工程规范和团队文化(包括 emoji 偏好)的入职培训一样,向智能体提供这些信息会带来更好对齐的产出。
This framing clarified many tradeoffs. We favored dependencies and abstractions that could be fully internalized and reasoned about in-repo. Technologies often described as "boring" tend to be easier for agents to model due to composability, api stability, and representation in the training set. In some cases, it was cheaper to have the agent reimplement subsets of functionality than to work around opaque upstream behavior from public libraries. For example, rather than pulling in a generic p-limit-style package, we implemented our own map-with-concurrency helper: it's tightly integrated with our OpenTelemetry instrumentation, has 100% test coverage, and behaves exactly the way our runtime expects.
这种思考框架澄清了许多权衡。我们倾向于选择那些能在仓库内被完全内化和推理的依赖和抽象。通常被描述为"无聊"的技术往往更容易被智能体建模,因为它们具有可组合性、API 稳定性,以及在训练集中的充分表示。在某些情况下,让智能体重新实现功能子集比绕过公共库的不透明上游行为更划算。例如,我们没有引入通用的 p-limit 风格包,而是实现了自己的带并发控制的 map 辅助函数:它与我们的 OpenTelemetry 监控紧密集成,拥有 100% 的测试覆盖率,并且完全按照我们运行时的预期方式运行。
Pulling more of the system into a form the agent can inspect, validate, and modify directly increases leverage—not just for Codex, but for other agents (e.g. Aardvark) that are working on the codebase as well.
将更多系统转化为智能体可以直接检查、验证和修改的形式,能够增加杠杆效应——不仅对 Codex 如此,对其他在代码库上工作的智能体(如 Aardvark)也是如此。
Enforcing architecture and taste
强制执行架构和风格品味
Documentation alone doesn't keep a fully agent-generated codebase coherent. By enforcing invariants, not micromanaging implementations, we let agents ship fast without undermining the foundation. For example, we require Codex to parse data shapes at the boundary, but are not prescriptive on how that happens (the model seems to like Zod, but we didn't specify that specific library).
仅靠文档无法让一个完全由智能体生成的代码库保持一致性。通过强制执行不变量而非微观管理实现细节,我们让智能体快速交付而不会破坏根基。例如,我们要求 Codex 在边界处解析数据结构,但不规定具体如何实现(模型似乎喜欢 Zod,但我们并没有指定这个特定的库)。
Agents are most effective in environments with strict boundaries and predictable structure, so we built the application around a rigid architectural model. Each business domain is divided into a fixed set of layers, with strictly validated dependency directions and a limited set of permissible edges. These constraints are enforced mechanically via custom linters (Codex-generated, of course!) and structural tests.
智能体在具有严格边界和可预测结构的环境中最为有效,因此我们围绕一个严格的架构模型构建了应用。每个业务领域被划分为一组固定的层次,依赖方向经过严格验证,可允许的边仅限于一个有限集合。这些约束通过自定义 lint 工具(当然是 Codex 生成的!)和结构化测试进行机械化执行。
The diagram below shows the rule: within each business domain (e.g. App Settings), code can only depend "forward" through a fixed set of layers (Types → Config → Repo → Service → Runtime → UI). Cross-cutting concerns (auth, connectors, telemetry, feature flags) enter through a single explicit interface: Providers. Anything else is disallowed and enforced mechanically.
下图展示了规则:在每个业务领域(如 App Settings)内,代码只能沿着一组固定层次"向前"依赖(Types → Config → Repo → Service → Runtime → UI)。横切关注点(认证、连接器、遥测、功能标志)通过唯一的显式接口进入:Providers。其他任何形式的依赖都被禁止,并通过机械化手段强制执行。
[图示:具有显式横切边界的分层领域架构。业务逻辑领域内包含模块:Types → Config → Repo,以及 Providers → Service → Runtime → UI,底部是 App Wiring + UI。Utils 模块位于边界之外并馈入 Providers。]
This is the kind of architecture you usually postpone until you have hundreds of engineers. With coding agents, it's an early prerequisite: the constraints are what allows speed without decay or architectural drift.
这种架构通常要等到有数百名工程师时才会去搭建。而在编码智能体的场景下,它是一个早期前提条件:正是这些约束使得高速开发不会导致衰退或架构漂移。
In practice, we enforce these rules with custom linters and structural tests, plus a small set of "taste invariants." For example, we statically enforce structured logging, naming conventions for schemas and types, file size limits, and platform-specific reliability requirements with custom lints. Because the lints are custom, we write the error messages to inject remediation instructions into agent context.
在实践中,我们用自定义 lint 工具和结构化测试来强制执行这些规则,外加一小组"风格不变量"。例如,我们通过自定义 lint 静态强制执行结构化日志、Schema 和类型的命名规范、文件大小限制以及平台特定的可靠性要求。因为 lint 是自定义的,我们编写错误信息时会将修复指引注入智能体的上下文中。
In a human-first workflow, these rules might feel pedantic or constraining. With agents, they become multipliers: once encoded, they apply everywhere at once.
在以人为主的工作流中,这些规则可能显得吹毛求疵或过于约束。而在智能体场景下,它们成为倍增器:一旦编码,就能同时在所有地方生效。
At the same time, we're explicit about where constraints matter and where they do not. This resembles leading a large engineering platform organization: enforce boundaries centrally, allow autonomy locally. You care deeply about boundaries, correctness, and reproducibility. Within those boundaries, you allow teams—or agents—significant freedom in how solutions are expressed.
同时,我们明确界定了约束在哪里重要、在哪里不重要。这类似于领导一个大型工程平台组织:在中心层面强制执行边界,在局部层面允许自治。你深度关心边界、正确性和可重现性。在这些边界之内,你允许团队——或智能体——在解决方案的表达方式上有很大的自由度。
The resulting code does not always match human stylistic preferences, and that's okay. As long as the output is correct, maintainable, and legible to future agent runs, it meets the bar.
生成的代码并不总是符合人类的风格偏好,这没关系。只要输出是正确的、可维护的,并且对未来的智能体运行来说是可理解的,就达到了标准。
Human taste is fed back into the system continuously. Review comments, refactoring pull requests, and user-facing bugs are captured as documentation updates or encoded directly into tooling. When documentation falls short, we promote the rule into code.
人类的品味被持续反馈到系统中。审查评论、重构 PR 和面向用户的 bug 被记录为文档更新或直接编码到工具中。当文档不足以表达时,我们将规则提升为代码。
Throughput changes the merge philosophy
吞吐量改变了合并理念
As Codex's throughput increased, many conventional engineering norms became counterproductive.
随着 Codex 吞吐量的增加,许多传统的工程规范变得适得其反。
The repository operates with minimal blocking merge gates. Pull requests are short-lived. Test flakes are often addressed with follow-up runs rather than blocking progress indefinitely. In a system where agent throughput far exceeds human attention, corrections are cheap, and waiting is expensive.
仓库以最少的阻塞式合并门禁运作。PR 的生命周期很短。测试不稳定问题通常通过后续重新运行来解决,而不是无限期地阻塞进度。在一个智能体吞吐量远超人类注意力的系统中,修正是廉价的,而等待是昂贵的。
This would be irresponsible in a low-throughput environment. Here, it's often the right tradeoff.
在低吞吐量环境中,这样做会是不负责任的。但在这里,这往往是正确的权衡。
What "agent-generated" actually means
"智能体生成"的真正含义
When we say the codebase is generated by Codex agents, we mean everything in the codebase.
当我们说代码库由 Codex 智能体生成时,我们指的是代码库中的一切。
Agents produce:
智能体生成以下内容:
- Product code and tests
- 产品代码和测试
- CI configuration and release tooling
- CI 配置和发布工具链
- Internal developer tools
- 内部开发者工具
- Documentation and design history
- 文档和设计历史
- Evaluation harnesses
- 评测执行框架(evaluation harness)
- Review comments and responses
- 审查评论和回复
- Scripts that manage the repository itself
- 管理仓库自身的脚本
- Production dashboard definition files
- 生产环境仪表板定义文件
Humans always remain in the loop, but work at a different layer of abstraction than we used to. We prioritize work, translate user feedback into acceptance criteria, and validate outcomes. When the agent struggles, we treat it as a signal: identify what is missing—tools, guardrails, documentation—and feed it back into the repository, always by having Codex itself write the fix.
人类始终在环路中,但工作在与过去不同的抽象层次上。我们确定工作优先级,将用户反馈转化为验收标准,并验证结果。当智能体遇到困难时,我们将其视为一个信号:识别缺失的环节——工具、防护机制、文档——然后将其反馈到仓库中,始终由 Codex 本身来编写修复。
Agents use our standard development tools directly. They pull review feedback, respond inline, push updates, and often squash and merge their own pull requests.
智能体直接使用我们的标准开发工具。它们拉取审查反馈、行内回复、推送更新,并经常自行 squash 并合并自己的 PR。
Increasing levels of autonomy
不断提升的自主性水平
As more of the development loop was encoded directly into the system—testing, validation, review, feedback handling, and recovery—the repository recently crossed a meaningful threshold where Codex can end-to-end drive a new feature.
随着开发循环的更多环节被直接编码到系统中——测试、验证、审查、反馈处理和恢复——仓库最近跨越了一个有意义的门槛:Codex 可以端到端地驱动一个新功能。
Given a single prompt, the agent can now:
给定一个提示词,智能体现在可以:
- Validate the current state of the codebase
- 验证代码库的当前状态
- Reproduce a reported bug
- 重现报告的 bug
- Record a video demonstrating the failure
- 录制展示故障的视频
- Implement a fix
- 实施修复
- Validate the fix by driving the application
- 通过驱动应用来验证修复
- Record a second video demonstrating the resolution
- 录制第二个视频展示修复结果
- Open a pull request
- 提交一个 PR
- Respond to agent and human feedback
- 回应智能体和人类的反馈
- Detect and remediate build failures
- 检测并修复构建失败
- Escalate to a human only when judgment is required
- 仅在需要判断时向人类升级
- Merge the change
- 合并更改
This behavior depends heavily on the specific structure and tooling of this repository and should not be assumed to generalize without similar investment—at least, not yet.
这种行为高度依赖于该仓库的特定结构和工具链,不应假设在没有类似投入的情况下可以泛化——至少目前还不行。
Entropy and garbage collection
熵与垃圾回收
Full agent autonomy also introduces novel problems. Codex replicates patterns that already exist in the repository—even uneven or suboptimal ones. Over time, this inevitably leads to drift.
完全的智能体自主性也带来了新问题。Codex 会复制仓库中已经存在的模式——即使是不均衡或次优的模式。随着时间推移,这不可避免地导致漂移。
Initially, humans addressed this manually. Our team used to spend every Friday (20% of the week) cleaning up "AI slop." Unsurprisingly, that didn't scale.
最初,人类手动处理这个问题。我们的团队曾经每个周五(占每周 20% 的时间)都在清理"AI 废料"。不出所料,这无法规模化。
Instead, we started encoding what we call "golden principles" directly into the repository and built a recurring cleanup process. These principles are opinionated, mechanical rules that keep the codebase legible and consistent for future agent runs. For example: (1) we prefer shared utility packages over hand-rolled helpers to keep invariants centralized, and (2) we don't probe data "YOLO-style"—we validate boundaries or rely on typed SDKs so the agent can't accidentally build on guessed shapes. On a regular cadence, we have a set of background Codex tasks that scan for deviations, update quality grades, and open targeted refactoring pull requests. Most of these can be reviewed in under a minute and automerged.
取而代之的是,我们开始将所谓的"黄金原则"直接编码到仓库中,并建立了一个周期性清理流程。这些原则是有主见的、机械化的规则,使代码库对未来的智能体运行保持可理解性和一致性。例如:(1) 我们偏好共享的工具包而非手工编写的辅助函数,以保持不变量的集中化;(2) 我们不以"YOLO 风格"探测数据——我们验证边界或依赖类型化的 SDK,这样智能体就不会意外地基于猜测的数据结构进行构建。在固定节奏下,我们有一组后台 Codex 任务扫描偏差、更新质量评级并提交有针对性的重构 PR。大多数这样的 PR 可以在一分钟内审查完毕并自动合并。
This functions like garbage collection. Technical debt is like a high-interest loan: it's almost always better to pay it down continuously in small increments than to let it compound and tackle it in painful bursts. Human taste is captured once, then enforced continuously on every line of code. This also lets us catch and resolve bad patterns on a daily basis, rather than letting them spread in the code base for days or weeks.
这个过程就像垃圾回收。技术债务就像高利率贷款:几乎总是以小额持续偿还为佳,而不是让它复利增长后再痛苦地一次性还清。人类的品味只需捕获一次,然后在每一行代码上持续强制执行。这也让我们能够每天发现并解决不良模式,而不是任由它们在代码库中蔓延数天或数周。
What we're still learning
我们仍在学习的
This strategy has so far worked well up through internal launch and adoption at OpenAI. Building a real product for real users helped anchor our investments in reality and guide us towards long-term maintainability.
到目前为止,这一策略在 OpenAI 内部发布和采用的过程中运行良好。为真实用户构建真实产品帮助我们将投入锚定在现实中,并引导我们走向长期可维护性。
What we don't yet know is how architectural coherence evolves over years in a fully agent-generated system. We're still learning where human judgment adds the most leverage and how to encode that judgment so it compounds. We also don't know how this system will evolve as models continue to become more capable over time.
我们尚不清楚的是,在一个完全由智能体生成的系统中,架构一致性如何随着数年时间而演进。我们仍在学习人类判断在哪里能发挥最大的杠杆作用,以及如何编码这种判断使其产生复合效应。我们也不知道随着模型能力的持续提升,这个系统将如何演化。
What's become clear: building software still demands discipline, but the discipline shows up more in the scaffolding rather than the code. The tooling, abstractions, and feedback loops that keep the codebase coherent are increasingly important.
已经明确的是:构建软件仍然需要纪律,但这种纪律更多体现在脚手架而非代码上。使代码库保持一致的工具、抽象和反馈回路正变得越来越重要。
Our most difficult challenges now center on designing environments, feedback loops, and control systems that help agents accomplish our goal: build and maintain complex, reliable software at scale.
我们现在最困难的挑战集中在设计环境、反馈回路和控制系统上,这些系统帮助智能体完成我们的目标:大规模构建和维护复杂、可靠的软件。
As agents like Codex take on larger portions of the software lifecycle, these questions will matter even more. We hope that sharing some early lessons helps you reason about where to invest your effort so you can just build things.
随着像 Codex 这样的智能体承担软件生命周期中越来越多的部分,这些问题将变得更加重要。我们希望分享这些早期经验能帮助你思考应该把精力投入到哪里,从而专注于构建产品本身。