Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

把 agent 轨迹中的局部教训提炼成可迁移 skill，而不是每条轨迹都在线补丁式修修补补，是这篇最打动人的地方。它说明 skill 文档本身可以成为独立于参数的长期资产。

Agent SkillsarXivNi Jingwei, Liu Yihao, Liu Xinpeng, Sun Yutao, Zhou Mengyu, Cheng Pengyu, Wang Dexin, Zhao Erchao, Jiang Xiaoxi, Jiang Guanjun

arXiv alphaXiv

注：本条目基于 arXiv 原文摘要、PDF 首尾页抽取与 alphaXiv overview 交叉整理，定位为站内快速研究笔记，而非逐页复刻附录的正式评审稿。

Section 0 — Metadata

Field	Value
Title	Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills
Authors & affiliations	Ni Jingwei, Liu Yihao, Liu Xinpeng, Sun Yutao, Zhou Mengyu, Cheng Pengyu, Wang Dexin, Zhao Erchao 等
Venue / status	arXiv preprint, March 2026
Code / data available	当前抓取信息中仅确认 arXiv 与 alphaXiv 页面；未逐项核验额外代码仓库。
Reproducibility signals	[paper] 论文给出了任务设定与主实验结果；[inferred] 随机种子、显著性检验和完整实现细节仍需回到正文/附录逐项核对。

Section 1 — Problem and motivation

[paper] 复杂 agent 想稳定工作，往往需要高质量、领域化的 skill 文档；但人工维护极难扩展，而自动生成 skill 又容易碎片化、只会记住个别轨迹。[paper]

[paper] 作者认为问题不在“有没有自动写 skill”，而在多数方法按单条 trajectory 顺序更新，过早把局部经验固化成局部规则，最后得到一堆彼此冲突的小技巧。[paper]

[paper] 论文摘要强调的直接目标是：Equipping Large Language Model (LLM) agents with domain-specific skills is critical for tackling complex tasks. Yet, manual authoring creates a severe scalability bottleneck. Conversely, automated skill generation often yields fragile or fragmented results because it either relies on shallow parametric knowledge or sequentially overfits to non-generalizable trajectory-local lessons. To overcome this, we introduce Trace2Skill, a framework that mirrors how human experts author skills: by holistically analyzing broad execution experience before distilling it into a single, comprehensive guide. Instead of reacting sequentially to individual trajectories, Trace2Skill dispatches a parallel fleet of sub-agents to analyze a diverse pool of executions. It extracts trajectory-specific lessons and hierarchically consolidates them into a unified, conflict-free skill directory via inductive reasoning. Trace2Skill supports both deepening existing human-written skills and creating new ones from scratch. Experiments in challenging domains, such as spreadsheet, VisionQA and math reasoning, show that Trace2Skill significantly improves upon strong baselines, including Anthropic's official xlsx skills. Crucially, this trajectory-grounded evolution does not merely memorize task instances or model-specific quirks: evolved skills transfer across LLM scales and generalize to OOD settings. For example, skills evolved by Qwen3.5-35B on its own trajectories improved a Qwen3.5-122B agent by up to 57.65 absolute percentage points on WikiTableQuestions. Ultimately, our results demonstrate that complex agent experience can be packaged into highly transferable, declarative skills -- requiring no parameter updates, no external retrieval modules, and utilizing open-source models as small as 35B parameters.

Section 2 — Technical method

[paper] 核心贡献：Trace2Skill 采用更像人类专家写文档的流程：并行派出一组 sub-agents 去分析多条执行轨迹，先提取 trajectory-local lessons，再通过层级式整合把它们收束成统一、无冲突的 skill directory。[paper]

[paper] 逻辑增量：它把“经验蒸馏”从在线 patching，改成面向技能文档的离线综合，重点不只是生出新 skill，而是让 skill 具有跨模型、跨任务的可迁移性。[paper]

[inferred] 复杂度与工程含义：这类方法的关键成本不一定都体现在 FLOPs 上，而更可能体现在更长的训练链路、更多的 rollout / 验证步骤，或更复杂的系统编排上。是否值得采用，取决于你要优化的是 benchmark score、稳定性，还是可部署性。

Section 3 — Experimental evidence

[paper] 关键证据：论文在 spreadsheet、VisionQA 与 math reasoning 上显著超过强基线，包括 Anthropic 官方 xlsx skills；更强的例子是，Qwen3.5-35B 进化出的技能还能把 122B agent 在 WikiTableQuestions 上提升到最高 57.65 个绝对点。[paper]

[paper] 证据质量：最有说服力的部分是 transfer：作者不仅看原模型本地提升，也看技能是否能跨 LLM scale 与 OOD 设置迁移。[paper] 这让它不像单纯的 task-specific prompt engineering。

[inferred] 如果把这篇论文当成选型依据，最应该重点回看的不是摘要里的单个最好数字，而是它在不同数据集、模型尺度、预算设置下是否仍然保持同样趋势。

Section 4 — Critical assessment

[inferred] 主要担忧：不过，技能整合质量仍然高度依赖轨迹覆盖率与 consolidation heuristics；如果经验池本身偏窄，写出来的 skill 也可能只是“高质量但狭窄”的文档。[inferred] 另外，技能库一旦继续膨胀，检索和版本管理会再次成为问题。[inferred]

[inferred] 另外一个现实问题是，论文里最有效的 recipe 往往也最“重”。真正落地时，需要先判断这些收益是否能覆盖训练预算、推理延迟和维护复杂度带来的额外成本。

[paper] 这篇工作的真实强项在于，它没有只停留在直觉层面，而是把一个具体瓶颈拆成了可验证、可比较、可复用的方法设计。

Section 5 — Synthesis

TL;DR

Trace2Skill 的价值，在于把 agent 经验转成了独立于参数的可复用资产。它不是让模型“多记一点”，而是让经验以 skill 文档的形式被更大范围复用。

Innovation classification

Method advance. [inferred] 这篇工作的价值主要不在“宣称一个全新范式”，而在于把现有方向里的关键短板系统性补上，并给出较可信的工程/实验支撑。

Deployment readiness

[inferred] 如果你的工作和这篇论文的任务形态高度接近，它已经足以作为下一轮实验设计的直接参考；但若要进入生产或高风险场景，仍需要补齐更强的鲁棒性、预算分析与失败案例验证。

Open problems

如何在持续在线更新中避免 skill library 再次碎片化
如何度量 skill 的真实可迁移边界
如何把文本 skill 与工具调用 schema、检索策略统一起来