Dec 19, 2024
· Building effective agents — Anthropic ↗Building effective agents
构建高效智能体
We've worked with dozens of teams building LLM agents across industries. Consistently, the most successful implementations use simple, composable patterns rather than complex frameworks.
我们与多个行业中数十支构建 LLM 智能体的团队合作,发现最成功的实现都采用了简单、可组合的模式,而非复杂框架。
Published Dec 19, 2024
发布于 2024 年 12 月 19 日
Over the past year, we've worked with dozens of teams building large language model (LLM) agents across industries. Consistently, the most successful implementations weren't using complex frameworks or specialized libraries. Instead, they were building with simple, composable patterns.
在过去一年里,我们与多个行业中数十支构建 LLM 智能体的团队合作。我们一再留意到,最成功的实现并没有使用复杂框架或专门库;相反,它们采用的是简单、可组合的模式。
In this post, we share what we’ve learned from working with our customers and building agents ourselves, and give practical advice for developers on building effective agents.
在这篇文章中,我们将分享我们在与客户合作以及亲自构建智能体过程中获得的经验,并为开发者提供关于如何构建高效智能体的实用建议。
What are agents?
什么是智能体?
"Agent" can be defined in several ways. Some customers define agents as fully autonomous systems that operate independently over extended periods, using various tools to accomplish complex tasks. Others use the term to describe more prescriptive implementations that follow predefined workflows. At Anthropic, we categorize all these variations as agentic systems, but draw an important architectural distinction between workflows and agents:
“智能体”可以有多种定义。有些客户将智能体定义为能够在较长时间内独立运行、并借助各种工具完成复杂任务的全自主系统。另一些客户则用这个词来描述那些遵循预定义工作流、约束更强的实现方式。在 Anthropic,我们将这些不同形态统称为智能体系统(agentic systems),但同时也会在架构层面明确区分工作流(workflows)与智能体(agents):
-
Workflows are systems where LLMs and tools are orchestrated through predefined code paths.
工作流,是指通过预先定义好的代码路径来编排 LLM 与工具的系统。
-
Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks.
而智能体,则是由 LLM 动态主导其自身流程与工具使用方式,并持续掌控“如何完成任务”的系统。
Below, we will explore both types of agentic systems in detail. In Appendix 1 (“Agents in Practice”), we describe two domains where customers have found particular value in using these kinds of systems.
下面,我们将详细探讨这两类智能体系统。在附录 1《智能体的实践应用》中,我们会介绍两个客户尤其发现其价值的应用领域。
When (and when not) to use agents
何时使用智能体(以及何时不该使用)
When building applications with LLMs, we recommend finding the simplest solution possible, and only increasing complexity when needed. This might mean not building agentic systems at all. Agentic systems often trade latency and cost for better task performance, and you should consider when this tradeoff makes sense.
当构建 LLM 应用时,我们建议先寻找尽可能简单的解决方案,只有在确有必要时才增加复杂度。这有时意味着根本不需要构建智能体系统。智能体系统通常是用更高的延迟与成本来换取更好的任务表现,因此你需要认真评估这种权衡是否合理。
When more complexity is warranted, workflows offer predictability and consistency for well-defined tasks, whereas agents are the better option when flexibility and model-driven decision-making are needed at scale. For many applications, however, optimizing single LLM calls with retrieval and in-context examples is usually enough.
当确实需要更高复杂度时,对于定义明确的任务,工作流能提供更强的可预测性和一致性;而当你需要大规模的灵活性,以及由模型主导的决策能力时,智能体会是更好的选择。不过,对很多应用来说,只需通过检索增强和上下文示例来优化单次 LLM 调用,往往就已经足够。
When and how to use frameworks
何时以及如何使用框架
There are many frameworks that make agentic systems easier to implement, including:
有许多框架可以让智能体系统更容易实现,其中包括:
The Claude Agent SDK;
Claude 提供的 Agent SDK;
Strands Agents SDK by AWS;
AWS 的 Strands Agents SDK;
Rivet, a drag and drop GUI LLM workflow builder; and
Rivet,一个支持拖放式图形界面的 LLM 工作流构建器;以及
Vellum, another GUI tool for building and testing complex workflows.
Vellum,另一个用于构建和测试复杂工作流的图形界面工具。
These frameworks make it easy to get started by simplifying standard low-level tasks like calling LLMs, defining and parsing tools, and chaining calls together. However, they often create extra layers of abstraction that can obscure the underlying prompts and responses, making them harder to debug. They can also make it tempting to add complexity when a simpler setup would suffice.
这些框架通过简化 LLM 调用、定义和解析工具、串联多个调用等底层标准任务,使入门变得更容易。然而,它们也常常引入额外的抽象层,从而掩盖了底层提示词和模型响应,使调试变得更困难。与此同时,它们还可能诱使开发者在本可用更简单方案解决问题时,过早增加系统复杂度。
We suggest that developers start by using LLM APIs directly: many patterns can be implemented in a few lines of code. If you do use a framework, ensure you understand the underlying code. Incorrect assumptions about what's under the hood are a common source of customer error.
我们建议开发者一开始先直接使用 LLM API:很多模式只需几行代码就能实现。如果你确实要使用框架,请务必理解其底层代码。对“框架内部机制”的错误假设,是客户最常见的出错来源之一。
See our cookbook for some sample implementations.
一些示例实现可以参考我们的 cookbook。
Building blocks, workflows, and agents
基础构件、工作流与智能体
In this section, we’ll explore the common patterns for agentic systems we’ve seen in production. We'll start with our foundational building block—the augmented LLM—and progressively increase complexity, from simple compositional workflows to autonomous agents.
在这一部分,我们将探讨那些已经在生产环境中出现的智能体系统常见模式。我们会从最基础的构件——增强型 LLM——开始,再逐步提升复杂度,从简单的组合式工作流一路讲到自主智能体。
Building block: The augmented LLM
基础构件:增强型 LLM
The basic building block of agentic systems is an LLM enhanced with augmentations such as retrieval, tools, and memory. Our current models can actively use these capabilities—generating their own search queries, selecting appropriate tools, and determining what information to retain.
智能体系统最基础的构件,是一个经过增强的 LLM;这些增强能力包括检索、工具和记忆。我们当前的模型已经能够主动使用这些能力——例如自行生成搜索查询、选择合适的工具,以及决定保留哪些信息。
We recommend focusing on two key aspects of the implementation: tailoring these capabilities to your specific use case and ensuring they provide an easy, well-documented interface for your LLM. While there are many ways to implement these augmentations, one approach is through our recently released Model Context Protocol, which allows developers to integrate with a growing ecosystem of third-party tools with a simple client implementation.
我们建议在实现时重点关注两个方面:第一,让这些能力真正贴合你的具体用例;第二,确保它们为 LLM 提供了一个易用且文档完善的接口。实现这些增强能力的方法有很多,其中一种方式是使用我们最近发布的模型上下文协议(Model Context Protocol, MCP)。它允许开发者通过一个简单的客户端实现,接入不断扩展的第三方工具生态。
For the remainder of this post, we'll assume each LLM call has access to these augmented capabilities.
在本文的后续内容中,我们默认每一次 LLM 调用都能够访问这些增强能力。
Workflow: Prompt chaining
工作流:提示链
Prompt chaining decomposes a task into a sequence of steps, where each LLM call processes the output of the previous one. You can add programmatic checks (see "gate” in the diagram below) on any intermediate steps to ensure that the process is still on track.
提示链会把一个任务拆解成一连串步骤,每一步的 LLM 调用都处理上一步的输出。你还可以在任意中间步骤加入程序化检查(见下图中的 “gate”,即“关卡/校验点”),以确保整个流程仍在正确轨道上。
When to use this workflow: This workflow is ideal for situations where the task can be easily and cleanly decomposed into fixed subtasks. The main goal is to trade off latency for higher accuracy, by making each LLM call an easier task.
何时使用这种工作流:当一个任务可以被清晰、自然地拆解为若干固定子任务时,这种工作流最合适。它的主要目标,是用更高的延迟换取更高的准确率,因为每一次 LLM 调用面对的都是一个更简单的子问题。
Examples where prompt chaining is useful:
提示链适用的例子包括:
-
Generating Marketing copy, then translating it into a different language.
先生成营销文案,再将其翻译成另一种语言。
-
Writing an outline of a document, checking that the outline meets certain criteria, then writing the document based on the outline.
先撰写文档提纲,检查该提纲是否符合特定标准,再根据提纲完成整篇文档。
Workflow: Routing
工作流:路由
Routing classifies an input and directs it to a specialized followup task. This workflow allows for separation of concerns, and building more specialized prompts. Without this workflow, optimizing for one kind of input can hurt performance on other inputs.
路由会先对输入进行分类,再将其导向一个专门化的后续任务。这样的工作流有利于实现关注点分离,并构建更有针对性的提示词。如果没有这种工作流,针对某一类输入做优化,往往会损害对其他输入类型的处理效果。
When to use this workflow: Routing works well for complex tasks where there are distinct categories that are better handled separately, and where classification can be handled accurately, either by an LLM or a more traditional classification model/algorithm.
何时使用这种工作流:当一个复杂任务中存在若干彼此差异明显、最好分别处理的类别,并且这些类别能够被 LLM 或传统分类模型/算法较准确地区分出来时,路由会非常适用。
Examples where routing is useful:
路由适用的例子包括:
-
Directing different types of customer service queries (general questions, refund requests, technical support) into different downstream processes, prompts, and tools.
将不同类型的客服请求(如一般咨询、退款申请、技术支持)分流到不同的下游流程、提示词和工具中。
-
Routing easy/common questions to smaller, cost-efficient models like Claude Haiku 4.5 and hard/unusual questions to more capable models like Claude Sonnet 4.5 to optimize for best performance.
把简单或常见的问题路由到更小、更具成本效率的模型(如 Claude Haiku 4.5),而把困难或非常规的问题路由到能力更强的模型(如 Claude Sonnet 4.5),以优化表现。
Workflow: Parallelization
工作流:并行化
LLMs can sometimes work simultaneously on a task and have their outputs aggregated programmatically. This workflow, parallelization, manifests in two key variations:
有时可以让多个 LLM 同时处理同一任务,并以程序方式汇总它们的输出。这种工作流——并行化——主要体现为两种形式:
Sectioning: Breaking a task into independent subtasks run in parallel.
切分:将一个任务拆成相互独立的子任务,并行执行。
Voting: Running the same task multiple times to get diverse outputs.
投票:对同一个任务执行多次,以获得多样化输出。
When to use this workflow: Parallelization is effective when the divided subtasks can be parallelized for speed, or when multiple perspectives or attempts are needed for higher confidence results. For complex tasks with multiple considerations, LLMs generally perform better when each consideration is handled by a separate LLM call, allowing focused attention on each specific aspect.
何时使用这种工作流:当被拆分出的子任务能够并行执行以提升速度,或者当你需要多个视角、多种尝试来提高结果置信度时,并行化会很有效。对于那些需要多重考量的复杂任务,通常当每一项考量都由一次独立的 LLM 调用处理时,效果会更好,因为这样可以让其注意力集中到各个具体的方面。
Examples where parallelization is useful:
并行化适用的例子包括:
-
Sectioning:
切分:
-
Implementing guardrails where one model instance processes user queries while another screens them for inappropriate content or requests. This tends to perform better than having the same LLM call handle both guardrails and the core response.
实现护栏机制时,可以让一个模型实例处理用户请求,另一个模型实例专门审查其中是否包含不当内容或不当请求。与让同一次 LLM 调用同时承担护栏审查和核心响应相比,这种做法通常表现更好。
-
Automating evals for evaluating LLM performance, where each LLM call evaluates a different aspect of the model’s performance on a given prompt.
将用于评估 LLM 表现的评测(evals)自动化,其中每一次 LLM 调用都负责评估模型在同一提示下表现的某一个不同维度。
-
-
Voting:
投票:
-
Reviewing a piece of code for vulnerabilities, where several different prompts review and flag the code if they find a problem.
对一段代码做漏洞审查时,可以用若干不同提示词分别审查,只要其中某个审查发现问题,就将代码标记出来。
-
Evaluating whether a given piece of content is inappropriate, with multiple prompts evaluating different aspects or requiring different vote thresholds to balance false positives and negatives.
评估一段内容是否不当时,可以使用多个提示词分别评估不同方面,或者设置不同的投票阈值,以平衡误报与漏报。
-
Workflow: Orchestrator-workers
工作流:编排者—执行者
In the orchestrator-workers workflow, a central LLM dynamically breaks down tasks, delegates them to worker LLMs, and synthesizes their results.
在编排者—执行者工作流中,一个中央 LLM 会动态地拆解任务,把子任务分派给执行者 LLM,再综合它们的结果。
When to use this workflow: This workflow is well-suited for complex tasks where you can’t predict the subtasks needed (in coding, for example, the number of files that need to be changed and the nature of the change in each file likely depend on the task). Whereas it’s topographically similar, the key difference from parallelization is its flexibility—subtasks aren't pre-defined, but determined by the orchestrator based on the specific input.
何时使用这种工作流:当你面对的是复杂任务,且事先无法预测究竟需要哪些子任务时,这种工作流就非常合适。(例如在编程场景中,需要修改多少个文件、每个文件要怎么改,通常都取决于具体任务。)虽然它在结构上与并行化有些相似,但它与并行化的关键区别在于灵活性:这里的子任务不是预先定义好的,而是由编排者根据具体输入动态决定的。
Example where orchestrator-workers is useful:
编排者—执行者适用的例子包括:
-
Coding products that make complex changes to multiple files each time.
那些每次都需要对多个文件做复杂改动的编程类产品。
-
Search tasks that involve gathering and analyzing information from multiple sources for possible relevant information.
需要从多个来源收集并分析信息,以发现其中潜在相关内容的搜索任务。
Workflow: Evaluator-optimizer
工作流:评估者—优化者
In the evaluator-optimizer workflow, one LLM call generates a response while another provides evaluation and feedback in a loop.
在评估者—优化者工作流中,一个 LLM 调用负责生成响应,另一个 LLM 调用则在循环中对其进行评估并提供反馈。
When to use this workflow: This workflow is particularly effective when we have clear evaluation criteria, and when iterative refinement provides measurable value. The two signs of good fit are, first, that LLM responses can be demonstrably improved when a human articulates their feedback; and second, that the LLM can provide such feedback. This is analogous to the iterative writing process a human writer might go through when producing a polished document.
何时使用这种工作流:当你拥有清晰的评估标准,并且迭代式改进能够带来可衡量的价值时,这种工作流尤其有效。判断它是否适合的两个信号是:第一,当人类明确表达反馈时,LLM 的回答确实能够被明显改进;第二,LLM 自身也具备提供这类反馈的能力。这类似于人类作者在打磨一篇成熟文稿时所经历的反复修改过程。
Examples where evaluator-optimizer is useful:
评估者—优化者适用的例子包括:
-
Literary translation where there are nuances that the translator LLM might not capture initially, but where an evaluator LLM can provide useful critiques.
文学翻译:译者型 LLM 一开始可能无法把握某些细微之处,但评估型 LLM 可以给出有价值的批评与建议。
-
Complex search tasks that require multiple rounds of searching and analysis to gather comprehensive information, where the evaluator decides whether further searches are warranted.
复杂搜索任务:为了获得全面信息,往往需要多轮搜索与分析,而评估者会决定是否还有必要继续搜索。
Agents
智能体
Agents are emerging in production as LLMs mature in key capabilities—understanding complex inputs, engaging in reasoning and planning, using tools reliably, and recovering from errors. Agents begin their work with either a command from, or interactive discussion with, the human user. Once the task is clear, agents plan and operate independently, potentially returning to the human for further information or judgement. During execution, it's crucial for the agents to gain “ground truth” from the environment at each step (such as tool call results or code execution) to assess its progress. Agents can then pause for human feedback at checkpoints or when encountering blockers. The task often terminates upon completion, but it’s also common to include stopping conditions (such as a maximum number of iterations) to maintain control.
随着 LLM 在若干关键能力上逐渐成熟——包括理解复杂输入、进行推理与规划、可靠地使用工具,以及从错误中恢复——智能体正在开始进入生产环境。智能体的工作通常始于来自人类用户的一条指令,或者与用户进行一段交互式澄清。一旦任务被明确,智能体便会自行规划并独立执行;在此过程中,它也可能返回向人类寻求额外信息或判断。执行期间,智能体必须在每一步都从环境中获取“地面真实信息(ground truth)”,例如工具调用结果或代码执行结果,以评估自己的进展。之后,智能体可以在关键检查点暂停,征求人类反馈,或在遇到阻塞时寻求人类介入。任务通常在完成时终止,但也常常会设置停止条件(例如最大迭代次数),以维持可控性。
When to use agents: Agents can be used for open-ended problems where it’s difficult or impossible to predict the required number of steps, and where you can’t hardcode a fixed path. The LLM will potentially operate for many turns, and you must have some level of trust in its decision-making. Agents' autonomy makes them ideal for scaling tasks in trusted environments.
何时使用智能体:当你面对的是开放式问题、难以甚至不可能预先判断需要多少步骤,且无法把固定执行路径硬编码进去时,就可以考虑使用智能体。此时,LLM 可能会持续运行很多轮,而你也必须对它的决策能力具备一定程度的信任。智能体的自主性,使它们特别适合在可信环境中扩展任务处理规模。
The autonomous nature of agents means higher costs, and the potential for compounding errors. We recommend extensive testing in sandboxed environments, along with the appropriate guardrails.
但智能体的自主性也意味着更高的成本,以及错误逐步累积放大的风险。我们建议在沙箱环境中进行充分测试,并配备合适的护栏机制。
Examples where agents are useful:
智能体适用的例子包括:
The following examples are from our own implementations:
以下例子来自我们自己的实现:
-
A coding Agent to resolve SWE-bench tasks, which involve edits to many files based on a task description;
一个用于解决 SWE-bench 任务的编程智能体,这类任务通常需要根据任务描述对多个文件进行修改;
-
Our “computer use” reference implementation, where Claude uses a computer to accomplish tasks.
我们的“computer use(计算机使用)”参考实现,在这个实现中,Claude 会直接使用计算机来完成任务。
Combining and customizing these patterns
组合并定制这些模式
These building blocks aren't prescriptive. They're common patterns that developers can shape and combine to fit different use cases. The key to success, as with any LLM features, is measuring performance and iterating on implementations. To repeat: you should consider adding complexity only when it demonstrably improves outcomes.
这些基础构件并不是教条式规定。它们只是一些常见模式,开发者可以根据不同用例对其进行塑造与组合。与所有 LLM 特性一样,成功的关键在于测量性能,并不断迭代实现。再次强调:只有当复杂度能够被明确证明会改善结果时,你才应该考虑引入更复杂的设计。
Summary
总结
Success in the LLM space isn't about building the most sophisticated system. It's about building the right system for your needs. Start with simple prompts, optimize them with comprehensive evaluation, and add multi-step agentic systems only when simpler solutions fall short.
在 LLM 领域,成功并不在于构建“最复杂”的系统,而在于构建“最适合你需求”的系统。应当先从简单提示词开始,通过全面评估不断优化;只有当更简单的方案已经无法满足需求时,再引入多步骤的智能体系统。
When implementing agents, we try to follow three core principles:
在实现智能体时,我们通常遵循三个核心原则:
-
Maintain simplicity in your agent's design.
保持智能体设计的简洁性。
-
Prioritize transparency by explicitly showing the agent’s planning steps.
优先保证透明性,明确展示智能体的规划步骤。
-
Carefully craft your agent-computer interface (ACI) through thorough tool documentation and testing.
通过充分的工具文档与测试,精心设计你的智能体—计算机接口(ACI)。
Frameworks can help you get started quickly, but don't hesitate to reduce abstraction layers and build with basic components as you move to production. By following these principles, you can create agents that are not only powerful but also reliable, maintainable, and trusted by their users.
框架能帮助你快速起步,但当系统走向生产环境时,也不要犹豫于减少抽象层、回到更基础的组件自行构建。遵循这些原则,你就能构建出不仅强大,而且可靠、可维护、并且值得用户信赖的智能体。
Acknowledgements
致谢
Written by Erik Schluntz and Barry Zhang. This work draws upon our experiences building agents at Anthropic and the valuable insights shared by our customers, for which we're deeply grateful.
本文由 Erik Schluntz 和 Barry Zhang 撰写。本文内容基于我们在 Anthropic 构建智能体的实践经验,以及客户与我们分享的宝贵见解;对此我们深表感谢。
Appendix 1: Agents in practice
附录 1:智能体的实践应用
Our work with customers has revealed two particularly promising applications for AI agents that demonstrate the practical value of the patterns discussed above. Both applications illustrate how agents add the most value for tasks that require both conversation and action, have clear success criteria, enable feedback loops, and integrate meaningful human oversight.
我们与客户的合作表明,AI 智能体有两个尤其有前景的应用方向,它们很好地展示了上述模式的实际价值。这两个应用都说明:当一项任务同时需要“对话”和“行动”,拥有明确的成功标准,能够形成反馈闭环,并纳入有意义的人类监督时,智能体最能发挥价值。
A. Customer support
A. 客户支持
Customer support combines familiar chatbot interfaces with enhanced capabilities through tool integration. This is a natural fit for more open-ended agents because:
客户支持将人们熟悉的聊天机器人界面,与通过工具集成获得的增强能力结合起来。这天然适合更开放式的智能体,因为:
-
Support interactions naturally follow a conversation flow while requiring access to external information and actions;
支持场景中的交互天然遵循对话流程,同时又需要访问外部信息并执行操作;
-
Tools can be integrated to pull customer data, order history, and knowledge base articles;
可以集成工具来获取客户数据、订单历史以及知识库文章;
-
Actions such as issuing refunds or updating tickets can be handled programmatically; and
像退款、更新工单这样的动作可以通过程序自动处理;并且
-
Success can be clearly measured through user-defined resolutions.
成功与否可以通过用户定义的解决结果被清晰衡量。
Several companies have demonstrated the viability of this approach through usage-based pricing models that charge only for successful resolutions, showing confidence in their agents' effectiveness.
一些公司已经通过按使用效果计费的商业模式证明了这种方法的可行性:它们只对成功解决的问题收费,这表明它们对自身智能体的效果具有信心。
B. Coding agents
B. 编程智能体
The software development space has shown remarkable potential for LLM features, with capabilities evolving from code completion to autonomous problem-solving. Agents are particularly effective because:
软件开发领域已经展现出 LLM 功能的巨大潜力,其能力正在从代码补全逐步演进到自主解决问题。智能体之所以在这一领域尤其有效,是因为:
-
Code solutions are verifiable through automated tests;
代码解决方案可以通过自动化测试来验证;
-
Agents can iterate on solutions using test results as feedback;
智能体可以把测试结果作为反馈,不断迭代优化解决方案;
-
The problem space is well-defined and structured; and
问题空间通常定义明确、结构清晰;并且
-
Output quality can be measured objectively.
输出质量可以被客观衡量。
In our own implementation, agents can now solve real GitHub issues in the SWE-bench Verified benchmark based on the pull request description alone. However, whereas automated testing helps verify functionality, human review remains crucial for ensuring solutions align with broader system requirements.
在我们自己的实现中,智能体如今已经能够仅根据 pull request 描述,解决 SWE-bench Verified 基准中的真实 GitHub issue。不过,虽然自动化测试有助于验证功能是否正确,人类审查依然对确保解决方案符合更广泛的系统需求至关重要。
Appendix 2: Prompt engineering your tools
附录 2:为你的工具做提示工程
No matter which agentic system you're building, tools will likely be an important part of your agent. Tools enable Claude to interact with external services and APIs by specifying their exact structure and definition in our API. When Claude responds, it will include a tool use block in the API response if it plans to invoke a tool. Tool definitions and specifications should be given just as much prompt engineering attention as your overall prompts. In this brief appendix, we describe how to prompt engineer your tools.
无论你构建的是哪种智能体系统,工具都很可能是你的智能体中的关键组成部分。通过在我们的 API 中明确指定工具的结构与定义,工具可以让 Claude 与外部服务和 API 交互。当 Claude 打算调用某个工具时,它会在 API 响应中包含一个工具使用块(tool use block)。因此,工具的定义与规格说明,应该像整体提示词一样,得到同等重视的提示工程设计。在这个简短附录中,我们将介绍如何为你的工具做提示工程。
There are often several ways to specify the same action. For instance, you can specify a file edit by writing a diff, or by rewriting the entire file. For structured output, you can return code inside markdown or inside JSON. In software engineering, differences like these are cosmetic and can be converted losslessly from one to the other. However, some formats are much more difficult for an LLM to write than others. Writing a diff requires knowing how many lines are changing in the chunk header before the new code is written. Writing code inside JSON (compared to markdown) requires extra escaping of newlines and quotes.
对同一个动作,往往有多种不同的表达方式。比如,你可以通过写一个 diff 来描述文件修改,也可以通过重写整个文件来表达。对于结构化输出,你可以让模型把代码放在 markdown 里返回,也可以放在 JSON 里返回。在软件工程中,这类差异通常只是表现形式上的,二者可以无损互相转换。然而,对 LLM 而言,有些格式显然比另一些更难书写。比如,写 diff 时,模型需要在真正写出新代码之前,就先知道变更块头部里应填写多少行变更;而把代码写进 JSON(相较于 markdown)则需要额外处理换行和引号的转义。
Our suggestions for deciding on tool formats are the following:
我们对于如何决定工具格式的建议如下:
-
Give the model enough tokens to "think" before it writes itself into a corner.
在模型“把自己写进死胡同”之前,给它足够多的 token 去“思考”。
-
Keep the format close to what the model has seen naturally occurring in text on the internet.
让格式尽量接近模型在互联网上自然见过的大量文本形式。
-
Make sure there's no formatting "overhead" such as having to keep an accurate count of thousands of lines of code, or string-escaping any code it writes.
确保格式中不存在额外的“格式负担”,例如要求模型精确统计成千上万行代码的行数,或对它生成的代码做字符串转义。
One rule of thumb is to think about how much effort goes into human-computer interfaces (HCI), and plan to invest just as much effort in creating good agent-computer interfaces (ACI). Here are some thoughts on how to do so:
一个经验法则是:你愿意在人机界面(HCI)上投入多少精力,也应该计划在打造优秀的智能体—计算机接口(ACI)上投入同样多的精力。下面是一些具体建议:
-
Put yourself in the model's shoes. Is it obvious how to use this tool, based on the description and parameters, or would you need to think carefully about it? If so, then it’s probably also true for the model. A good tool definition often includes example usage, edge cases, input format requirements, and clear boundaries from other tools.
设身处地站在模型的角度思考:仅根据工具描述和参数定义,模型是否能一眼看懂该怎么使用这个工具?还是说它必须仔细琢磨一番?如果连人都需要认真思考,那模型大概率也一样。一个好的工具定义,通常会包含示例用法、边界情况、输入格式要求,以及与其他工具之间清晰的边界。
-
How can you change parameter names or descriptions to make things more obvious? Think of this as writing a great docstring for a junior developer on your team. This is especially important when using many similar tools.
你能否通过修改参数名或描述,让工具的使用方式更加显而易见?可以把这件事想象成:你正在为团队里一位初级开发者写一份高质量 docstring。尤其当你同时使用很多相似工具时,这一点会格外重要。
-
Test how the model uses your tools: Run many example inputs in our workbench to see what mistakes the model makes, and iterate.
测试模型是如何使用你工具的:在我们的工作台(workbench)中运行大量示例输入,观察模型会犯哪些错误,然后持续迭代。
-
Poka-yoke your tools. Change the arguments so that it is harder to make mistakes.
对你的工具做“防错设计(poka-yoke)”。通过调整参数设计,让模型更不容易出错。
While building our agent for SWE-bench, we actually spent more time optimizing our tools than the overall prompt. For example, we found that the model would make mistakes with tools using relative filepaths after the agent had moved out of the root directory. To fix this, we changed the tool to always require absolute filepaths—and we found that the model used this method flawlessly.
在为 SWE-bench 构建智能体时,我们实际上花在优化工具上的时间,比优化整体提示词还要多。举个例子,我们发现:当智能体离开项目根目录后,模型在使用“相对路径”文件工具时会频繁出错。为了解决这个问题,我们把工具改成了始终要求“绝对路径”;结果发现,模型几乎可以毫无差错地使用这种方式。