Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

这篇重新审视 OPD，不再把 teacher/student 匹配压缩成单 token 信号，而是改成 top-K local support matching 的截断 reverse-KL。对于长链 reasoning 和 agent 训练，稳定性提升比“多跑几步”更关键。

Post-TrainingarXivFu Yuqian, Huang Haohuan, Jiang Kaiwen, Zhu Yuanheng, Zhao Dongbin

arXiv alphaXiv

注：本条目基于 arXiv 原文摘要、PDF 首尾页抽取与 alphaXiv overview 交叉整理，定位为站内快速研究笔记，而非逐页复刻附录的正式评审稿。

Section 0 — Metadata

Field	Value
Title	Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes
Authors & affiliations	Fu Yuqian, Huang Haohuan, Jiang Kaiwen, Zhu Yuanheng, Zhao Dongbin
Venue / status	arXiv preprint, March 2026
Code / data available	当前抓取信息中仅确认 arXiv 与 alphaXiv 页面；未逐项核验额外代码仓库。
Reproducibility signals	[paper] 论文给出了任务设定与主实验结果；[inferred] 随机种子、显著性检验和完整实现细节仍需回到正文/附录逐项核对。

Section 1 — Problem and motivation

[paper] On-policy distillation 的吸引力在于它直接评估 student rollout，而不是只模仿固定 teacher trace。[paper] 但当 rollout 变长后，常见 sampled-token 版本会退化成极其脆弱的一步监督。

[paper] 作者指出三类失败模式：一是 one-token signal 严重失衡，二是 student 偏离 teacher 常见前缀后，teacher guidance 变得不可靠，三是 tokenizer / special token mismatch 会放大失真。[paper] 这些问题都直接伤害长链训练稳定性。

[paper] 论文摘要强调的直接目标是：On-policy distillation (OPD) is appealing for large language model (LLM) post-training because it evaluates teacher feedback on student-generated rollouts rather than fixed teacher traces. In long-horizon settings, however, the common sampled-token variant is fragile: it reduces distribution matching to a one-token signal and becomes increasingly unreliable as rollouts drift away from prefixes the teacher commonly visits. We revisit OPD from the estimator and implementation sides. Theoretically, token-level OPD is biased relative to sequence-level reverse-KL, but it has a much tighter worst-case variance bound; our toy study shows the same tradeoff empirically, with stronger future-reward coupling producing higher gradient variance and less stable learning. Empirically, we identify three failure modes of sampled-token OPD: an imbalanced one-token signal, unreliable teacher guidance on student-generated prefixes, and distortions caused by tokenizer or special-token mismatch. We address these issues with teacher top-K local support matching, implemented as truncated reverse-KL with top-p rollout sampling and special-token masking. Across single-task math reasoning and multi-task agentic-plus-math training, this objective yields more stable optimization and better downstream performance than sampled-token OPD.

Section 2 — Technical method

[paper] 核心贡献：为此论文把 OPD 重新表述为与 sequence-level reverse-KL 的近似关系，并提出 teacher top-K local support matching，用 truncated reverse-KL、top-p rollout sampling 和 special-token masking 来稳定估计。[paper]

[paper] 逻辑增量：这不是简单换一个 loss 名字，而是把“为什么 sampled-token OPD 在长链场景会坏掉”讲清楚，再给出局部支持集层面的修补方案。[paper]

[inferred] 复杂度与工程含义：这类方法的关键成本不一定都体现在 FLOPs 上，而更可能体现在更长的训练链路、更多的 rollout / 验证步骤，或更复杂的系统编排上。是否值得采用，取决于你要优化的是 benchmark score、稳定性，还是可部署性。

Section 3 — Experimental evidence

[paper] 关键证据：论文摘要明确说明，该目标在单任务数学推理和多任务 agentic-plus-math 训练中都比 sampled-token OPD 更稳定，并带来更好的 downstream 表现。[paper]

[paper] 证据质量：作者同时给了理论和经验两条证据：理论上讨论 bias-variance tradeoff，实验上验证 stronger future-reward coupling 带来更高梯度方差与不稳定学习。[paper]

[inferred] 如果把这篇论文当成选型依据，最应该重点回看的不是摘要里的单个最好数字，而是它在不同数据集、模型尺度、预算设置下是否仍然保持同样趋势。

Section 4 — Critical assessment

[inferred] 主要担忧：方法虽然更稳，但也引入更多实现细节和超参数，尤其 top-K、top-p 与 masking 策略如何迁移到不同 teacher/student 组合，仍可能比较敏感。[inferred] 此外，它仍是 local support 近似，不是完整 sequence matching。[paper]

[inferred] 另外一个现实问题是，论文里最有效的 recipe 往往也最“重”。真正落地时，需要先判断这些收益是否能覆盖训练预算、推理延迟和维护复杂度带来的额外成本。

[paper] 这篇工作的真实强项在于，它没有只停留在直觉层面，而是把一个具体瓶颈拆成了可验证、可比较、可复用的方法设计。

Section 5 — Synthesis

TL;DR

这篇工作的贡献，是把 OPD 的脆弱点从经验直觉变成了可分析、可修复的问题。对做 reasoning 或 agent post-training 的团队来说，它提供的是更稳的蒸馏底盘，而不是单个 benchmark trick。

Innovation classification

Method advance. [inferred] 这篇工作的价值主要不在“宣称一个全新范式”，而在于把现有方向里的关键短板系统性补上，并给出较可信的工程/实验支撑。

Deployment readiness

[inferred] 如果你的工作和这篇论文的任务形态高度接近，它已经足以作为下一轮实验设计的直接参考；但若要进入生产或高风险场景，仍需要补齐更强的鲁棒性、预算分析与失败案例验证。

Open problems

如何自适应选择 K 与 rollout 采样策略
如何把局部支持匹配扩展到更强 teacher ensemble
如何在更大规模 agent 轨迹上验证稳定性收益