Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

这篇把 self-distillation 在数学推理里“越蒸越短、越短越差”的反常现象拆开看，核心诊断是 epistemic verbalization 被抑制。对所有追求更短 CoT 的 post-training 工作都是很好的提醒。

ReasoningarXivKim Jeonghye, Luo Xufang, Kim Minbeom, Lee Sangmook, Kim Dohyung, Jeon Jiwon, Li Dongsheng, Yang Yuqing

arXiv alphaXiv

注：本条目基于 arXiv 原文摘要、PDF 首尾页抽取与 alphaXiv overview 交叉整理，定位为站内快速研究笔记，而非逐页复刻附录的正式评审稿。

Section 0 — Metadata

Field	Value
Title	Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?
Authors & affiliations	Kim Jeonghye, Luo Xufang, Kim Minbeom, Lee Sangmook, Kim Dohyung, Jeon Jiwon, Li Dongsheng, Yang Yuqing
Venue / status	arXiv preprint, March 2026
Code / data available	当前抓取信息中仅确认 arXiv 与 alphaXiv 页面；未逐项核验额外代码仓库。
Reproducibility signals	[paper] 论文给出了任务设定与主实验结果；[inferred] 随机种子、显著性检验和完整实现细节仍需回到正文/附录逐项核对。

Section 1 — Problem and motivation

[paper] Self-distillation 在很多任务里都能让模型回答更短、更干净，但作者发现到了数学推理场景，这种“更短”并不稳定地对应“更强”。[paper] 特别是在 OOD 数学题上，性能下降与推理链压缩同时发生。

[paper] 作者把问题指向 epistemic verbalization，也就是模型在思考时暴露不确定性、尝试、修正的语言痕迹。[paper] 若这些痕迹被训练过程过度压掉，模型可能在 in-domain 更快，却在困难样本上失去探索空间。[inferred]

[paper] 论文摘要强调的直接目标是：Self-distillation has emerged as an effective post-training paradigm for LLMs, often improving performance while shortening reasoning traces. However, in mathematical reasoning, we find that it can reduce response length while degrading performance. We trace this degradation to the suppression of epistemic verbalization - the model's expression of uncertainty during reasoning. Through controlled experiments varying conditioning context richness and task coverage, we show that conditioning the teacher on rich information suppresses uncertainty expression, enabling rapid in-domain optimization with limited task coverage but harming OOD performance, where unseen problems benefit from expressing uncertainty and adjusting accordingly. Across Qwen3-8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct, we observe performance drops of up to 40%. Our findings highlight that exposing appropriate levels of uncertainty is crucial for robust reasoning and underscore the importance of optimizing reasoning behavior beyond merely reinforcing correct answer traces.

Section 2 — Technical method

[paper] 核心贡献：论文通过控制 conditioning context richness 与 task coverage，比较自蒸馏与 GRPO 等设置，观察回答长度、准确率以及不确定性 token 的变化。[paper] 它不是提出一个新大模型，而是给 post-training 过程做“机制诊断”。

[paper] 逻辑增量：真正的新意不在 distillation 技术本身，而在于作者提出了一个更细的失败机制解释：问题不只是“变短了”，而是表达 uncertainty 的中间语言被压没了。[paper]

[inferred] 复杂度与工程含义：这类方法的关键成本不一定都体现在 FLOPs 上，而更可能体现在更长的训练链路、更多的 rollout / 验证步骤，或更复杂的系统编排上。是否值得采用，取决于你要优化的是 benchmark score、稳定性，还是可部署性。

Section 3 — Experimental evidence

[paper] 关键证据：从文中与附图可见，SDPO 在训练中稳定压缩 response length，但在 AMC23 与 AIME24 上相比 GRPO 呈现更差的 OOD 表现；同时 epistemic token 使用显著下降。[paper]

[paper] 证据质量：这是一篇偏诊断型工作，实验围绕长度、准确率和 uncertainty expression 三者关系展开。[paper] 不过从当前可见信息看，因果链条仍主要来自对照实验与相关性分析，而不是完全闭环的机制证明。[inferred]

[inferred] 如果把这篇论文当成选型依据，最应该重点回看的不是摘要里的单个最好数字，而是它在不同数据集、模型尺度、预算设置下是否仍然保持同样趋势。

Section 4 — Critical assessment

[inferred] 主要担忧：最大的未解点是 epistemic verbalization 是否只是“可观测代理变量”，而不是决定泛化的根因。[inferred] 此外，证据重心放在数学 reasoning，迁移到代码、agent 或多模态任务时未必保持同样结论。[inferred]

[inferred] 另外一个现实问题是，论文里最有效的 recipe 往往也最“重”。真正落地时，需要先判断这些收益是否能覆盖训练预算、推理延迟和维护复杂度带来的额外成本。

[paper] 这篇工作的真实强项在于，它没有只停留在直觉层面，而是把一个具体瓶颈拆成了可验证、可比较、可复用的方法设计。

Section 5 — Synthesis

TL;DR

这篇论文最重要的价值，是给“更短 CoT 一定更好”泼了一盆冷水。它说明蒸馏过程中被抹掉的，可能不是冗余，而是模型处理不确定性的能力。下一代 reasoning distillation 很可能需要显式保护这部分表达。

Innovation classification

Method advance. [inferred] 这篇工作的价值主要不在“宣称一个全新范式”，而在于把现有方向里的关键短板系统性补上，并给出较可信的工程/实验支撑。

Deployment readiness

[inferred] 如果你的工作和这篇论文的任务形态高度接近，它已经足以作为下一轮实验设计的直接参考；但若要进入生产或高风险场景，仍需要补齐更强的鲁棒性、预算分析与失败案例验证。

Open problems

能否显式正则化 uncertainty expression 而不是事后观测
不同任务类型下最优的 reasoning length 是否应该不同
如何把 distillation 的效率收益与 OOD 稳定性同时保住