Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning

PRCO 把感知和推理拆成 Observer / Solver 两个角色，并用不同 reward 做 credit assignment。对多模态 RL 来说，这比“只看最后答案对不对”更像一次真正的结构升级。

Multimodal ReasoningarXivMiao Ziqi, Jia Haonan, Li Lijun, Qian Chen, Xiong Yuan, Yan Wenting, Shao Jing

注：本条目基于 arXiv 原文摘要、PDF 首尾页抽取与 alphaXiv overview 交叉整理，定位为站内快速研究笔记，而非逐页复刻附录的正式评审稿。

Section 0 — Metadata

Field	Value
Title	Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning
Authors & affiliations	Miao Ziqi, Jia Haonan, Li Lijun, Qian Chen, Xiong Yuan, Yan Wenting, Shao Jing
Venue / status	arXiv preprint, March 2026
Code / data available	当前抓取信息中仅确认 arXiv 与 alphaXiv 页面；未逐项核验额外代码仓库。
Reproducibility signals	[paper] 论文给出了任务设定与主实验结果；[inferred] 随机种子、显著性检验和完整实现细节仍需回到正文/附录逐项核对。

Section 1 — Problem and motivation

[paper] 多模态 RLVR 近来显著增强了 MLLM 的推理能力，但常见做法只根据最终答案对错给统一奖励。[paper] 这样会把 perception 和 reasoning 的 credit assignment 混在一起，导致上游视觉证据提取改善有限。

[paper] 作者将其概括为 perception bottleneck：模型也许学会了更会“想”，却不一定更会“看”。[paper] 在需要细粒度视觉证据的题目上，这会直接成为准确率天花板。

[paper] 论文摘要强调的直接目标是：Reinforcement learning with verifiable rewards (RLVR) has substantially enhanced the reasoning capabilities of multimodal large language models (MLLMs). However, existing RLVR approaches typically rely on outcome-driven optimization that updates both perception and reasoning using a shared reward based solely on the final answer. This shared reward blurs credit assignment, frequently improving reasoning patterns while failing to reliably enhance the accuracy of upstream visual evidence extraction. To address this perception bottleneck, we introduce PRCO (Perception-Reasoning Coevolution), a dual-role RLVR framework with a shared policy. PRCO consists of two cooperative roles: an Observer that generates an evidence caption tailored to the question and a Solver that predicts the final answer based on this caption. Crucially, PRCO employs role-specific reward signals: the Solver is optimized using verifiable outcome rewards on the final answer, while the Observer receives a utility reward derived from the Solver's downstream success. Extensive experiments across eight challenging multimodal reasoning benchmarks demonstrate that PRCO yields consistent improvements across model scales by over 7 points on average accuracy compared to the base model, outperforming prior open-source RL-tuned baselines.

Section 2 — Technical method

[paper] 核心贡献：PRCO 设计了两个协作角色：Observer 先生成与问题对齐的 evidence caption，Solver 再基于该 caption 输出答案；两者共享策略，但拿到不同 reward，Solver 用 outcome reward，Observer 用 downstream utility reward。[paper]

[paper] 逻辑增量：新意在于把最终答案 supervision 拆解为角色化训练信号，让感知模块不再只能“顺带”被更新，而是拥有更明确的优化目标。[paper]

[inferred] 复杂度与工程含义：这类方法的关键成本不一定都体现在 FLOPs 上，而更可能体现在更长的训练链路、更多的 rollout / 验证步骤，或更复杂的系统编排上。是否值得采用，取决于你要优化的是 benchmark score、稳定性，还是可部署性。

Section 3 — Experimental evidence

[paper] 关键证据：论文在 8 个困难多模态推理 benchmark 上报告，相比 base model 平均提升超过 7 个点，并超过现有开源 RL-tuned baseline。[paper]

[paper] 证据质量：从摘要和案例可以看出，作者既关注总体分数，也关注 perception error 与 reasoning error 的分离诊断。[paper] 这让改进来源更可解释。

[inferred] 如果把这篇论文当成选型依据，最应该重点回看的不是摘要里的单个最好数字，而是它在不同数据集、模型尺度、预算设置下是否仍然保持同样趋势。

Section 4 — Critical assessment

[inferred] 主要担忧：不过，Observer/Solver 的两段式设计也引入额外推理开销与接口瓶颈；一旦 caption 质量有系统偏差，下游 Solver 仍会被带偏。[inferred] utility reward 的噪声问题也值得继续观察。[inferred]

[inferred] 另外一个现实问题是，论文里最有效的 recipe 往往也最“重”。真正落地时，需要先判断这些收益是否能覆盖训练预算、推理延迟和维护复杂度带来的额外成本。

[paper] 这篇工作的真实强项在于，它没有只停留在直觉层面，而是把一个具体瓶颈拆成了可验证、可比较、可复用的方法设计。

Section 5 — Synthesis

TL;DR

PRCO 的意义在于，它把多模态 RL 从“结果导向微调”推进到“结构化 credit assignment”。如果未来 MLLM 想真正提升视觉证据抽取，这类角色分工很可能会成为默认设计。

Innovation classification

Method advance. [inferred] 这篇工作的价值主要不在“宣称一个全新范式”，而在于把现有方向里的关键短板系统性补上，并给出较可信的工程/实验支撑。

Deployment readiness

[inferred] 如果你的工作和这篇论文的任务形态高度接近，它已经足以作为下一轮实验设计的直接参考；但若要进入生产或高风险场景，仍需要补齐更强的鲁棒性、预算分析与失败案例验证。

Open problems

如何进一步细化 perception reward 到区域或 token 级别
如何压低双角色推理带来的额外成本
如何把该框架迁移到视频与交互式多模态任务