Context Engineering for AI Agents: Lessons from Building Manus
AI 智能体的上下文工程:构建 Manus 的经验教训
Practical lessons from building the Manus AI agent, covering context engineering principles including KV-cache optimization, action space management, file-system-as-context, attention manipulation, error retention, and few-shot pattern avoidance.
构建 Manus AI 智能体过程中总结的上下文工程实战经验,涵盖 KV 缓存优化、动作空间管理、文件系统作为上下文、注意力操控、错误保留和少样本模式规避等核心原则。
2025/7/18 — Yichao 'Peak' Ji
2025 年 7 月 18 日 — 纪一超 (Peak)
At the very beginning of the Manus project, my team and I faced a key decision: should we train an end-to-end agentic model using open-source foundations, or build an agent on top of the in-context learning abilities of frontier models?
在 Manus 项目启动之初,我和团队就面临了一个关键抉择:是基于开源基座模型训练一个端到端的智能体模型,还是在前沿模型的上下文学习能力之上构建智能体?
Back in my first decade in NLP, we didn't have the luxury of that choice. In the distant days of BERT (yes, it's been seven years), models had to be fine-tuned—and evaluated—before they could transfer to a new task. That process often took weeks per iteration, even though the models were tiny compared to today's LLMs. For fast-moving applications, especially pre–PMF, such slow feedback loops are a deal-breaker. That was a bitter lesson from my last startup, where I trained models from scratch for open information extraction and semantic search. Then came GPT-3 and Flan-T5, and my in-house models became irrelevant overnight. Ironically, those same models marked the beginning of in-context learning—and a whole new path forward.
回想我在 NLP 领域的头十年,我们根本没有这样的选择余地。在 BERT 的遥远年代(是的,已经过去七年了),模型必须先经过微调和评估,才能迁移到新任务上。即使当时模型规模跟今天的 LLM 相比微不足道,每次迭代也往往需要数周之久。对于快速迭代的应用——尤其是尚未达到 PMF 阶段的产品——如此漫长的反馈周期是致命的。这是我上一次创业留下的惨痛教训——当时我从零开始训练模型做开放信息抽取和语义搜索。然后 GPT-3 和 Flan-T5 横空出世,我的自研模型一夜之间变得毫无意义。讽刺的是,正是这些模型标志着上下文学习的开端——也开辟了一条全新的前进道路。
That hard-earned lesson made the choice clear: Manus would bet on context engineering. This allows us to ship improvements in hours instead of weeks, and kept our product orthogonal to the underlying models: If model progress is the rising tide, we want Manus to be the boat, not the pillar stuck to the seabed.
这个来之不易的教训让选择变得清晰:Manus 将押注于上下文工程。这使我们能够在数小时内而非数周内交付改进,并让我们的产品与底层模型保持正交关系——如果说模型的进步是上涨的潮水,我们希望 Manus 是水面上的船,而不是钉在海底的桩子。
Still, context engineering turned out to be anything but straightforward. It's an experimental science—and we've rebuilt our agent framework four times, each time after discovering a better way to shape context. We affectionately refer to this manual process of architecture searching, prompt fiddling, and empirical guesswork as "Stochastic Graduate Descent". It's not elegant, but it works.
然而,上下文工程远非一帆风顺。它是一门实验科学——我们已经重建了四次智能体框架,每次都是在发现了更好的上下文塑造方法之后。我们亲切地将这种手动搜索架构、反复调整提示词、凭经验猜测的过程称为"随机研究生下降法"(Stochastic Graduate Descent)。它不够优雅,但确实管用。
This post shares the local optima we arrived at through our own "SGD". If you're building your own AI agent, I hope these principles help you converge faster.
这篇文章分享的是我们通过自己的"SGD"找到的局部最优解。如果你也在构建自己的 AI 智能体,希望这些原则能帮助你更快地收敛。
Design Around the KV-Cache
围绕 KV 缓存进行设计
If I had to choose just one metric, I'd argue that the KV-cache hit rate is the single most important metric for a production-stage AI agent. It directly affects both latency and cost. To understand why, let's look at how a typical agent operates:
如果只能选择一个指标,我认为 KV 缓存命中率是生产级 AI 智能体最重要的单一指标。它直接影响延迟和成本。要理解原因,让我们来看一下典型智能体的运作方式:
After receiving a user input, the agent proceeds through a chain of tool uses to complete the task. In each iteration, the model selects an action from a predefined action space based on the current context. That action is then executed in the environment (e.g., Manus's virtual machine sandbox) to produce an observation. The action and observation are appended to the context, forming the input for the next iteration. This loop continues until the task is complete.
在接收用户输入后,智能体通过一系列工具调用来完成任务。在每次迭代中,模型根据当前上下文从预定义的动作空间中选择一个动作。该动作随后在环境中执行(例如 Manus 的虚拟机沙箱)以产生一个观测结果。动作和观测被追加到上下文中,形成下一次迭代的输入。这个循环持续进行直到任务完成。
As you can imagine, the context grows with every step, while the output—usually a structured function call—remains relatively short. This makes the ratio between prefilling and decoding highly skewed in agents compared to chatbots. In Manus, for example, the average input-to-output token ratio is around 100:1.
可以想象,上下文随着每一步不断增长,而输出——通常是一个结构化的函数调用——则相对较短。与聊天机器人相比,这使得智能体中预填充和解码的比率高度倾斜。例如在 Manus 中,平均输入与输出的 token 比率大约是 100:1。
Fortunately, contexts with identical prefixes can take advantage of KV-cache, which drastically reduces time-to-first-token (TTFT) and inference cost—whether you're using a self-hosted model or calling an inference API. And we're not talking about small savings: with Claude Sonnet, for instance, cached input tokens cost 0.30 USD/MTok, while uncached ones cost 3 USD/MTok—a 10x difference.
幸运的是,具有相同前缀的上下文可以利用 KV 缓存,这大幅降低了首 token 生成时间(TTFT)和推理成本——无论你是自托管模型还是调用推理 API。而且这里的节省绝非小数目:以 Claude Sonnet 为例,缓存命中的输入 token 成本为 0.30 美元/百万 token,而未命中的则为 3 美元/百万 token——相差 10 倍。
From a context engineering perspective, improving KV-cache hit rate involves a few key practices:
从上下文工程的角度来看,提高 KV 缓存命中率涉及以下几个关键实践:
-
Keep your prompt prefix stable. Due to the autoregressive nature of LLMs, even a single-token difference can invalidate the cache from that token onward. A common mistake is including a timestamp—especially one precise to the second—at the beginning of the system prompt. Sure, it lets the model tell you the current time, but it also kills your cache hit rate.
保持提示词前缀稳定。由于 LLM 的自回归特性,即使一个 token 的差异也会导致从该 token 起的缓存全部失效。一个常见的错误是在系统提示词的开头包含时间戳——尤其是精确到秒的时间戳。当然,这能让模型告诉你当前时间,但也会彻底摧毁你的缓存命中率。
-
Make your context append-only. Avoid modifying previous actions or observations. Ensure your serialization is deterministic. Many programming languages and libraries don't guarantee stable key ordering when serializing JSON objects, which can silently break the cache.
使上下文只追加不修改。避免修改之前的动作或观测结果。确保你的序列化方式是确定性的。许多编程语言和库在序列化 JSON 对象时不保证键的顺序稳定,这会悄无声息地破坏缓存。
-
Mark cache breakpoints explicitly when needed. Some model providers or inference frameworks don't support automatic incremental prefix caching, and instead require manual insertion of cache breakpoints in the context. When assigning these, account for potential cache expiration and at minimum, ensure the breakpoint includes the end of the system prompt.
在需要时显式标记缓存断点。一些模型提供商或推理框架不支持自动增量前缀缓存,而是需要在上下文中手动插入缓存断点。在设置这些断点时,要考虑缓存可能过期的情况,至少确保断点覆盖到系统提示词的末尾。
Additionally, if you're self-hosting models using frameworks like vLLM, make sure prefix/prompt caching is enabled, and that you're using techniques like session IDs to route requests consistently across distributed workers.
此外,如果你使用 vLLM 等框架自托管模型,请确保启用了前缀/提示词缓存,并使用会话 ID 等技术将请求一致地路由到分布式工作节点上。
Mask, Don't Remove
用掩码,不要移除
As your agent takes on more capabilities, its action space naturally grows more complex—in plain terms, the number of tools explodes. The recent popularity of MCP only adds fuel to the fire. If you allow user-configurable tools, trust me: someone will inevitably plug hundreds of mysterious tools into your carefully curated action space. As a result, the model is more likely to select the wrong action or take an inefficient path. In short, your heavily armed agent gets dumber.
随着你的智能体承担越来越多的能力,它的动作空间自然变得越来越复杂——说白了,就是工具数量爆炸。MCP 近来的流行更是火上浇油。如果你允许用户自行配置工具,相信我:一定会有人把数百个莫名其妙的工具塞进你精心策划的动作空间。结果就是模型更容易选错动作或走弯路。简而言之,你那武装到牙齿的智能体反而变蠢了。
A natural reaction is to design a dynamic action space—perhaps loading tools on demand using something RAG-like. We tried that in Manus too. But our experiments suggest a clear rule: unless absolutely necessary, avoid dynamically adding or removing tools mid-iteration. There are two main reasons for this:
一个自然的反应是设计动态动作空间——比如像 RAG 那样按需加载工具。我们在 Manus 中也尝试过。但实验给出了一条明确的规则:除非绝对必要,避免在迭代过程中动态添加或移除工具。主要有两个原因:
-
In most LLMs, tool definitions live near the front of the context after serialization, typically before or after the system prompt. So any change will invalidate the KV-cache for all subsequent actions and observations.
在大多数 LLM 中,序列化后的工具定义位于上下文的前端,通常在系统提示词之前或之后。因此任何改动都会导致后续所有动作和观测的 KV 缓存失效。
-
When previous actions and observations still refer to tools that are no longer defined in the current context, the model gets confused. Without constrained decoding, this often leads to schema violations or hallucinated actions.
当之前的动作和观测仍然引用了当前上下文中已不再定义的工具时,模型会感到困惑。如果没有约束解码,这通常会导致 schema 违规或幻觉动作。
To solve this while still improving action selection, Manus uses a context-aware state machine to manage tool availability. Rather than removing tools, it masks the token logits during decoding to prevent (or enforce) the selection of certain actions based on the current context.
为了在改善动作选择的同时解决这个问题,Manus 使用了一个上下文感知的状态机来管理工具的可用性。它不是移除工具,而是在解码过程中对 token 的 logits 进行掩码处理,从而根据当前上下文阻止(或强制)某些动作的选择。
In practice, most model providers and inference frameworks support some form of response prefill, which allows you to constrain the action space without modifying the tool definitions. There are generally three modes of function calling (we'll use the Hermes format from NousResearch as an example):
在实践中,大多数模型提供商和推理框架都支持某种形式的响应预填充,允许你在不修改工具定义的情况下约束动作空间。函数调用通常有三种模式(我们以 NousResearch 的 Hermes 格式为例):
-
Auto – The model may choose to call a function or not. Implemented by prefilling only the reply prefix:
<|im_start|>assistant 自动模式(Auto) – 模型可以选择是否调用函数。通过仅预填充回复前缀实现:
<|im_start|>assistant-
Required – The model must call a function, but the choice is unconstrained. Implemented by prefilling up to tool call token:
<|im_start|>assistant<tool_call> 必选模式(Required) – 模型必须调用一个函数,但具体选择不受约束。通过预填充到工具调用 token 实现:
<|im_start|>assistant<tool_call>-
Specified – The model must call a function from a specific subset. Implemented by prefilling up to the beginning of the function name:
<|im_start|>assistant<tool_call>{"name": "browser_ 指定模式(Specified) – 模型必须从特定子集中调用一个函数。通过预填充到函数名的开头实现:
<|im_start|>assistant<tool_call>{"name": "browser_
Using this, we constrain action selection by masking token logits directly. For example, when the user provides a new input, Manus must reply immediately instead of taking an action. We've also deliberately designed action names with consistent prefixes—e.g., all browser-related tools start with browser_, and command-line tools with shell_. This allows us to easily enforce that the agent only chooses from a certain group of tools at a given state without using stateful logits processors.
利用这种方式,我们通过直接对 token logits 进行掩码来约束动作选择。例如,当用户提供新输入时,Manus 必须立即回复而非执行动作。我们还刻意为动作名称设计了一致的前缀——例如,所有浏览器相关的工具以 browser_ 开头,命令行工具以 shell_ 开头。这使我们能够轻松地限制智能体在特定状态下只能从某组工具中进行选择,而无需使用有状态的 logits 处理器。
These designs help ensure that the Manus agent loop remains stable—even under a model-driven architecture.
这些设计帮助确保 Manus 的智能体循环保持稳定——即使在模型驱动的架构下也是如此。
Use the File System as Context
将文件系统作为上下文
Modern frontier LLMs now offer context windows of 128K tokens or more. But in real-world agentic scenarios, that's often not enough, and sometimes even a liability. There are three common pain points:
现代前沿 LLM 提供了 128K token 甚至更大的上下文窗口。但在实际的智能体场景中,这往往不够用,有时甚至会成为负担。常见的三个痛点:
-
Observations can be huge, especially when agents interact with unstructured data like web pages or PDFs. It's easy to blow past the context limit.
观测结果可能非常庞大,尤其是当智能体与网页或 PDF 等非结构化数据交互时。很容易就超出上下文限制。
-
Model performance tends to degrade beyond a certain context length, even if the window technically supports it.
即使上下文窗口在技术上支持,模型性能往往会在超过一定上下文长度后下降。
-
Long inputs are expensive, even with prefix caching. You're still paying to transmit and prefill every token.
长输入很昂贵,即使有前缀缓存。你仍然需要为传输和预填充每一个 token 付费。
To deal with this, many agent systems implement context truncation or compression strategies. But overly aggressive compression inevitably leads to information loss. The problem is fundamental: an agent, by nature, must predict the next action based on all prior state—and you can't reliably predict which observation might become critical ten steps later. From a logical standpoint, any irreversible compression carries risk.
为了应对这个问题,许多智能体系统实现了上下文截断或压缩策略。但过于激进的压缩不可避免地导致信息丢失。这个问题是根本性的:智能体从本质上讲必须基于所有先前状态来预测下一个动作——而你无法可靠地预测哪个观测结果会在十步之后变得至关重要。从逻辑上讲,任何不可逆的压缩都有风险。
That's why we treat the file system as the ultimate context in Manus: unlimited in size, persistent by nature, and directly operable by the agent itself. The model learns to write to and read from files on demand—using the file system not just as storage, but as structured, externalized memory.
这就是为什么我们在 Manus 中将文件系统视为终极上下文:容量无限、天然持久,并且智能体可以直接操作。模型学会了按需向文件写入和从文件读取——将文件系统不仅用作存储,更用作结构化的外部记忆。
Our compression strategies are always designed to be restorable. For instance, the content of a web page can be dropped from the context as long as the URL is preserved, and a document's contents can be omitted if its path remains available in the sandbox. This allows Manus to shrink context length without permanently losing information.
我们的压缩策略始终被设计为可恢复的。例如,只要保留了 URL,就可以将网页内容从上下文中移除;只要文件路径在沙箱中仍然可用,就可以省略文档的内容。这使 Manus 能够缩短上下文长度而不会永久丢失信息。
While developing this feature, I found myself imagining what it would take for a State Space Model (SSM) to work effectively in an agentic setting. Unlike Transformers, SSMs lack full attention and struggle with long-range backward dependencies. But if they could master file-based memory—externalizing long-term state instead of holding it in context—then their speed and efficiency might unlock a new class of agents. Agentic SSMs could be the real successors to Neural Turing Machines.
在开发这个功能时,我不禁想象状态空间模型(SSM)要在智能体场景中有效工作需要什么条件。与 Transformer 不同,SSM 缺乏完整的注意力机制,难以处理长距离的反向依赖。但如果它们能够掌握基于文件的记忆——将长期状态外部化而不是保持在上下文中——那么它们的速度和效率可能会解锁全新一类的智能体。智能体化的 SSM 或许才是神经图灵机真正的继任者。
Manipulate Attention Through Recitation
通过复述操控注意力
If you've worked with Manus, you've probably noticed something curious: when handling complex tasks, it tends to create a todo.md file—and update it step-by-step as the task progresses, checking off completed items.
如果你使用过 Manus,你可能注意到一个有趣的现象:在处理复杂任务时,它倾向于创建一个 todo.md 文件——并随着任务进展逐步更新,勾掉已完成的事项。
That's not just cute behavior—it's a deliberate mechanism to manipulate attention.
这不仅仅是一个可爱的行为——而是一个刻意设计的注意力操控机制。
A typical task in Manus requires around 50 tool calls on average. That's a long loop—and since Manus relies on LLMs for decision-making, it's vulnerable to drifting off-topic or forgetting earlier goals, especially in long contexts or complicated tasks.
Manus 中一个典型任务平均需要大约 50 次工具调用。这是一个很长的循环——由于 Manus 依赖 LLM 进行决策,它容易偏离主题或忘记早期目标,尤其是在长上下文或复杂任务中。
By constantly rewriting the todo list, Manus is reciting its objectives into the end of the context. This pushes the global plan into the model's recent attention span, avoiding "lost-in-the-middle" issues and reducing goal misalignment. In effect, it's using natural language to bias its own focus toward the task objective—without needing special architectural changes.
通过不断重写待办列表,Manus 将自己的目标复述到上下文的末尾。这将全局计划推入模型的近期注意力范围,避免了"中间丢失"问题,减少了目标偏移。实际上,它是在用自然语言将自身的关注点引导向任务目标——而无需特殊的架构变更。
Keep the Wrong Stuff In
保留错误信息
Agents make mistakes. That's not a bug—it's reality. Language models hallucinate, environments return errors, external tools misbehave, and unexpected edge cases show up all the time. In multi-step tasks, failure is not the exception; it's part of the loop.
智能体会犯错。这不是 bug——这是现实。语言模型会产生幻觉,环境会返回错误,外部工具会行为异常,意料之外的边界情况随时出现。在多步骤任务中,失败不是例外,而是循环的一部分。
And yet, a common impulse is to hide these errors: clean up the trace, retry the action, or reset the model's state and leave it to the magical "temperature". That feels safer, more controlled. But it comes at a cost: Erasing failure removes evidence. And without evidence, the model can't adapt.
然而,一个常见的冲动是掩盖这些错误:清理执行痕迹、重试动作,或者重置模型的状态然后交给神奇的"温度参数"。这感觉更安全、更可控。但代价是:抹去失败就是销毁证据。而没有证据,模型就无法适应。
In our experience, one of the most effective ways to improve agent behavior is deceptively simple: leave the wrong turns in the context. When the model sees a failed action—and the resulting observation or stack trace—it implicitly updates its internal beliefs. This shifts its prior away from similar actions, reducing the chance of repeating the same mistake.
根据我们的经验,改善智能体行为最有效的方法之一看似简单得难以置信:将错误的尝试保留在上下文中。当模型看到一个失败的动作——以及由此产生的观测结果或堆栈追踪——它会隐式地更新其内部信念。这会将其先验概率从类似动作上移开,从而降低重复相同错误的概率。
In fact, we believe error recovery is one of the clearest indicators of true agentic behavior. Yet it's still underrepresented in most academic work and public benchmarks, which often focus on task success under ideal conditions.
事实上,我们认为错误恢复是真正智能体行为最明确的标志之一。然而,在大多数学术工作和公开基准测试中,它仍然没有得到充分体现——这些测试往往只关注理想条件下的任务成功率。
Don't Get Few-Shotted
别被少样本模式带偏
Few-shot prompting is a common technique for improving LLM outputs. But in agent systems, it can backfire in subtle ways.
少样本提示是一种改善 LLM 输出的常用技术。但在智能体系统中,它可能以微妙的方式适得其反。
Language models are excellent mimics; they imitate the pattern of behavior in the context. If your context is full of similar past action-observation pairs, the model will tend to follow that pattern, even when it's no longer optimal.
语言模型是出色的模仿者——它们会模仿上下文中的行为模式。如果你的上下文中充满了类似的历史动作—观测对,模型就会倾向于遵循这个模式,即使它已不再是最优选择。
This can be dangerous in tasks that involve repetitive decisions or actions. For example, when using Manus to help review a batch of 20 resumes, the agent often falls into a rhythm—repeating similar actions simply because that's what it sees in the context. This leads to drift, overgeneralization, or sometimes hallucination.
这在涉及重复决策或动作的任务中可能是危险的。例如,当使用 Manus 帮助审阅一批 20 份简历时,智能体经常会陷入节奏——重复类似的动作,仅仅因为它在上下文中看到了这种模式。这会导致漂移、过度泛化,有时甚至产生幻觉。
The fix is to increase diversity. Manus introduces small amounts of structured variation in actions and observations—different serialization templates, alternate phrasing, minor noise in order or formatting. This controlled randomness helps break the pattern and tweaks the model's attention.
解决方法是增加多样性。Manus 在动作和观测中引入少量结构化的变异——不同的序列化模板、替代措辞、顺序或格式上的微小噪声。这种受控的随机性有助于打破模式并调整模型的注意力。
In other words, don't few-shot yourself into a rut. The more uniform your context, the more brittle your agent becomes.
换句话说,别让少样本模式把你带进死胡同。你的上下文越单一,你的智能体就越脆弱。
Conclusion
结语
Context engineering is still an emerging science—but for agent systems, it's already essential. Models may be getting stronger, faster, and cheaper, but no amount of raw capability replaces the need for memory, environment, and feedback. How you shape the context ultimately defines how your agent behaves: how fast it runs, how well it recovers, and how far it scales.
上下文工程仍是一门新兴学科——但对于智能体系统来说,它已经不可或缺。模型或许正在变得更强大、更快速、更低成本,但再多的原始能力也无法替代对记忆、环境和反馈的需求。你如何塑造上下文,最终决定了你的智能体如何表现:它跑得多快、恢复得多好、能扩展到多远。
At Manus, we've learned these lessons through repeated rewrites, dead ends, and real-world testing across millions of users. None of what we've shared here is universal truth—but these are the patterns that worked for us. If they help you avoid even one painful iteration, then this post did its job.
在 Manus,我们通过反复重写、走过死胡同、以及面向数百万用户的实战测试,习得了这些经验。我们分享的这些都不是普适真理——但它们是对我们有效的模式。如果这些能帮助你避免哪怕一次痛苦的迭代,这篇文章就完成了它的使命。
The agentic future will be built one context at a time. Engineer them well.
智能体的未来将由一个又一个上下文构建而成。用心去工程化它们。