Skip to content
zzekelan
English
返回翻译列表

AGENTS.md Outperforms Skills in Our Agent Evals

AGENTS.md 在智能体评测中胜过技能系统

We expected skills to solve framework knowledge for coding agents. After building evals for Next.js 16, we found a compressed 8KB docs index in AGENTS.md achieved 100% pass rate, while skills maxed out at 79%.

我们本以为技能系统能解决编码智能体的框架知识问题。在为 Next.js 16 构建评测后,我们发现将 8KB 压缩文档索引嵌入 AGENTS.md 可达到 100% 通过率,而技能系统最高只有 79%。

Published Jan 27, 2026

发布于 2026 年 1 月 27 日


We expected skills to be the solution for teaching coding agents framework-specific knowledge. After building evals focused on Next.js 16 APIs, we found something unexpected.

我们原以为技能系统就是向编码智能体传授框架特定知识的最终方案。然而在围绕 Next.js 16 API 构建评测之后,我们发现了出乎意料的结果。

A compressed 8KB docs index embedded directly in AGENTS.md achieved a 100% pass rate, while skills maxed out at 79% even with explicit instructions telling the agent to use them. Without those instructions, skills performed no better than having no documentation at all.

一份压缩至 8KB 的文档索引直接嵌入 AGENTS.md 后,通过率达到了 100%;而技能系统即使在明确指示智能体使用的情况下,最高也只有 79%。如果没有这些显式指令,技能系统的表现与完全不提供文档毫无差别。


The problem we were trying to solve

我们试图解决的问题

AI coding agents rely on training data that becomes outdated. Next.js 16 introduces APIs like use cache, connection(), and forbidden() that aren't in current model training data. When agents don't know these APIs, they generate incorrect code or fall back to older patterns.

AI 编码智能体依赖的训练数据会逐渐过时。Next.js 16 引入了 use cacheconnection()forbidden() 等新 API,而这些并不在当前模型的训练数据中。当智能体不了解这些 API 时,就会生成错误代码或退回到旧的模式。

The reverse can also be true, where you're running an older Next.js version and the model suggests newer APIs that don't exist in your project yet. We wanted to fix this by giving agents access to version-matched documentation.

反过来也一样——如果你运行的是旧版 Next.js,模型可能会推荐你的项目中尚不存在的新 API。我们希望通过向智能体提供与项目版本匹配的文档来解决这个问题。


Two approaches for teaching agents framework knowledge

向智能体传授框架知识的两种方式

  • Skills are an open standard for packaging domain knowledge that coding agents can use. A skill bundles prompts, tools, and documentation that an agent can invoke on demand. The idea is that the agent recognizes when it needs framework-specific help, invokes the skill, and gets access to relevant docs.

  • **技能(Skills)**是一种用于封装领域知识的开放标准,供编码智能体调用。一个技能将提示词、工具和文档打包在一起,智能体可以按需调用。其设计理念是:智能体在需要框架特定帮助时主动调用技能,从而获取相关文档。

  • AGENTS.md is a markdown file in your project root that provides persistent context to coding agents. Whatever you put in AGENTS.md is available to the agent on every turn, without the agent needing to decide to load it. Claude Code uses CLAUDE.md for the same purpose.

  • AGENTS.md 是放在项目根目录下的 markdown 文件,为编码智能体提供持久上下文。你放入 AGENTS.md 的所有内容在每一轮交互中都对智能体可见,无需智能体自行决定是否加载。Claude Code 的 CLAUDE.md 也是同样的机制。


We started by betting on skills

我们最初押注于技能系统

Skills seemed like the right abstraction. You package your framework docs into a skill, the agent invokes it when working on Next.js tasks, and you get correct code. Clean separation of concerns, minimal context overhead, and the agent only loads what it needs.

技能看起来是正确的抽象方式。你把框架文档打包成一个技能,智能体在处理 Next.js 任务时调用它,然后得到正确的代码。关注点清晰分离,上下文开销最小,智能体只加载所需内容。


Skills weren't being triggered reliably

技能无法被可靠触发

In 56% of eval cases, the skill was never invoked. The agent had access to the documentation but didn't use it. Adding the skill produced no improvement over baseline:

在 56% 的评测用例中,技能从未被调用。智能体可以访问文档,却选择不使用它。添加技能相比基线毫无提升:

Configuration Pass Rate vs Baseline
Baseline (no docs) 53%
Skill (default behavior) 53% +0pp

配置 通过率 对比基线
基线(无文档) 53%
技能(默认行为) 53% +0pp

Zero improvement. The skill existed, the agent could use it, and the agent chose not to. On the detailed Build/Lint/Test breakdown, the skill actually performed worse than baseline on some metrics (58% vs 63% on tests), suggesting that an unused skill in the environment may introduce noise or distraction.

零提升。技能确实存在,智能体也可以使用,但它就是选择不用。在 Build/Lint/Test 的细项拆分中,技能在部分指标上甚至比基线更差(测试项为 58% vs 63%),这表明环境中未使用的技能可能会引入噪声或干扰。


Explicit instructions helped, but wording was fragile

显式指令有所帮助,但措辞非常脆弱

We tried adding explicit instructions to AGENTS.md telling the agent to use the skill. This improved the trigger rate to 95%+ and boosted the pass rate to 79%.

我们尝试在 AGENTS.md 中添加显式指令,告诉智能体使用技能。这将触发率提升到了 95% 以上,通过率提升到了 79%。

Configuration Pass Rate vs Baseline
Baseline (no docs) 53%
Skill (default behavior) 53% +0pp
Skill with explicit instructions 79% +26pp

配置 通过率 对比基线
基线(无文档) 53%
技能(默认行为) 53% +0pp
技能 + 显式指令 79% +26pp

Different wordings produced dramatically different results:

不同的措辞产生了截然不同的结果:

  • "You MUST invoke the skill" → Reads docs first, anchors on doc patterns → Misses project context

  • "You MUST invoke the skill" → 先读文档,锚定在文档模式上 → 忽略了项目上下文

  • "Explore project first, then invoke skill" → Builds mental model first, uses docs as reference → Better results

  • "Explore project first, then invoke skill" → 先构建心智模型,再将文档作为参考 → 效果更好

This fragility concerned us. If small wording tweaks produce large behavioral swings, the approach feels brittle for production use.

这种脆弱性让我们感到担忧。如果微小的措辞调整就能引起行为的巨大波动,那这个方案在生产环境中就显得太不可靠了。


Building evals we could trust

构建值得信赖的评测

Before drawing conclusions, we needed evals we could trust. Our initial test suite had ambiguous prompts, tests that validated implementation details rather than observable behavior, and a focus on APIs already in model training data.

在得出结论之前,我们需要可信赖的评测。我们最初的测试套件存在提示词模糊、测试验证实现细节而非可观测行为、以及侧重于已存在于模型训练数据中的 API 等问题。

We hardened the eval suite by removing test leakage, resolving contradictions, and shifting to behavior-based assertions. Most importantly, we added tests targeting Next.js 16 APIs not in model training data: connection(), use cache, cacheLife(), cacheTag(), forbidden(), unauthorized(), proxy.ts, async cookies() and headers(), after(), updateTag(), refresh().

我们通过消除测试泄漏、解决矛盾、转向基于行为的断言来加固评测套件。最关键的是,我们新增了针对不在模型训练数据中的 Next.js 16 API 的测试:connection()use cachecacheLife()cacheTag()forbidden()unauthorized()proxy.ts、异步 cookies()headers()after()updateTag()refresh()


The hunch that paid off

得到验证的直觉

What if we removed the decision entirely? Instead of hoping agents would invoke a skill, we could embed a docs index directly in AGENTS.md. Not the full documentation, just an index that tells the agent where to find specific doc files that match your project's Next.js version.

如果我们彻底去掉"决策"这个环节呢?与其指望智能体主动调用技能,不如将文档索引直接嵌入 AGENTS.md。不是完整的文档,而是一份索引——告诉智能体在哪里找到与项目 Next.js 版本匹配的具体文档文件。

We added a key instruction: "IMPORTANT: Prefer retrieval-led reasoning over pre-training-led reasoning for any Next.js tasks."

我们添加了一条关键指令:"IMPORTANT: Prefer retrieval-led reasoning over pre-training-led reasoning for any Next.js tasks."(重要:在任何 Next.js 任务中,优先使用检索驱动的推理,而非预训练驱动的推理。)


The results surprised us

结果令我们大吃一惊

Final pass rates:

最终通过率:

Configuration Pass Rate vs Baseline
Baseline (no docs) 53%
Skill (default behavior) 53% +0pp
Skill with explicit instructions 79% +26pp
AGENTS.md docs index 100% +47pp

配置 通过率 对比基线
基线(无文档) 53%
技能(默认行为) 53% +0pp
技能 + 显式指令 79% +26pp
AGENTS.md 文档索引 100% +47pp

Detailed breakdown:

细项分解:

Configuration Build Lint Test
Baseline 84% 95% 63%
Skill (default) 84% 89% 58%
Skill with instructions 95% 100% 84%
AGENTS.md 100% 100% 100%

配置 Build Lint Test
基线 84% 95% 63%
技能(默认) 84% 89% 58%
技能 + 指令 95% 100% 84%
AGENTS.md 100% 100% 100%

Why does passive context beat active retrieval? Three factors:

为什么被动上下文优于主动检索?三个因素:

  1. No decision point — with AGENTS.md, there's no moment where the agent must decide "should I look this up?"

  2. 无需决策 — 使用 AGENTS.md 时,智能体不必面对"我是否应该去查一下"这个抉择时刻。

  3. Consistent availability — Skills load asynchronously and only when invoked. AGENTS.md content is in the system prompt for every turn.

  4. 始终可用 — 技能是异步加载的,且只在被调用时才加载。而 AGENTS.md 的内容在每一轮交互的系统提示中都存在。

  5. No ordering issues — Skills create sequencing decisions (read docs first vs. explore project first). Passive context avoids this entirely.

  6. 无顺序问题 — 技能会引入执行顺序的决策(先读文档还是先探索项目)。被动上下文则完全避免了这个问题。


Addressing the context bloat concern

应对上下文膨胀的顾虑

The initial docs injection was ~40KB. We compressed it to 8KB (80% reduction) while maintaining 100% pass rate. The compressed format uses pipe-delimited structure that packs the docs index into minimal space. The agent knows where to find docs without full content in context. When it needs specific information, it reads the relevant file from .next-docs/.

最初注入的文档约为 40KB。我们将其压缩到了 8KB(缩减 80%),同时保持 100% 的通过率。压缩格式采用管道符分隔的结构,将文档索引压缩到最小空间。智能体知道文档在哪里,但上下文中无需承载完整内容。当它需要具体信息时,再从 .next-docs/ 中读取相关文件。


Try it yourself

亲自试试

One command:

一条命令即可:

npx @next/codemod@canary agents-md

What this means for framework authors

这对框架作者意味着什么

Skills aren't useless. AGENTS.md provides broad, horizontal improvements. Skills work better for vertical, action-specific workflows that users explicitly trigger.

技能并非毫无用处。AGENTS.md 提供的是广泛的、横向的改进。而技能更适合那些用户显式触发的、垂直的、针对特定操作的工作流。

Practical recommendations:

实践建议:

  • Don't wait for skills to improve. The gap may close as models get better at tool use, but results matter now.

  • **不要等待技能系统改进。**随着模型在工具使用方面的进步,差距可能会缩小,但当下的结果才是最重要的。

  • Compress aggressively. An index pointing to retrievable files works just as well.

  • **激进压缩。**一份指向可检索文件的索引同样有效。

  • Test with evals. Build evals targeting APIs not in training data.

  • **用评测来验证。**构建针对不在训练数据中的 API 的评测。

  • Design for retrieval. Structure docs so agents can find and read specific files.

  • **面向检索设计。**组织文档结构,使智能体能够定位并读取特定文件。

The goal is to shift agents from pre-training-led reasoning to retrieval-led reasoning. AGENTS.md turns out to be the most reliable way to make that happen.

目标是将智能体从预训练驱动的推理转向检索驱动的推理。事实证明,AGENTS.md 是实现这一转变最可靠的方式。


Research and evals by Jude Gao.

研究与评测:Jude Gao。

发布于

返回翻译列表