Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design

这篇 deep research agent 的重点不是“搜得更长”，而是把 verification 前移到数据合成、轨迹构造和 test-time scaling 三层。对研究型 agent 来说，这比单纯堆 tool-call budget 更实用。

Research AgentsarXivZhu Bin, Jia Qianghuai, Lan Tian, Ren Junyang, Gu Feng, Jiang Feihu, Wang Longyue, Xu Zhao, Luo Weihua

arXiv alphaXiv

注：本条目基于 arXiv 原文摘要、PDF 首尾页抽取与 alphaXiv overview 交叉整理，定位为站内快速研究笔记，而非逐页复刻附录的正式评审稿。

Section 0 — Metadata

Field	Value
Title	Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design
Authors & affiliations	Zhu Bin, Jia Qianghuai, Lan Tian, Ren Junyang, Gu Feng, Jiang Feihu, Wang Longyue, Xu Zhao 等
Venue / status	arXiv preprint, March 2026
Code / data available	当前抓取信息中仅确认 arXiv 与 alphaXiv 页面；未逐项核验额外代码仓库。
Reproducibility signals	[paper] 论文给出了任务设定与主实验结果；[inferred] 随机种子、显著性检验和完整实现细节仍需回到正文/附录逐项核对。

Section 1 — Problem and motivation

[paper] Deep research agent 面对开放式调查任务时，很容易在数据合成、轨迹训练和推理放大阶段逐层积累错误。[paper] 一旦验证机制缺位，这些错误会持续向后传播。

[paper] 作者认为现有瓶颈不在“不会搜”，而在于缺少 explicit verification：问题生成时答案未必唯一，训练轨迹里 critique 模式不够强，推理时也缺少可靠自检。[paper]

[paper] 论文摘要强调的直接目标是：Deep research agents autonomously conduct open-ended investigations, integrating complex information retrieval with multi-step reasoning across diverse sources to solve real-world problems. To sustain this capability on long-horizon tasks, reliable verification is critical during both training and inference. A major bottleneck in existing paradigms stems from the lack of explicit verification mechanisms in QA data synthesis, trajectory construction, and test-time scaling. Errors introduced at each stage propagate downstream and degrade the overall agent performance. To address this, we present Marco DeepResearch, a deep research agent optimized with a verification-centric framework design at three levels: (1) QA Data Synthesis: We introduce verification mechanisms to graph-based and agent-based QA synthesis to control question difficulty while ensuring answers are unique and correct; (2) Trajectory Construction: We design a verification-driven trajectory synthesis method that injects explicit verification patterns into training trajectories; and (3) Test-time scaling: We use Marco DeepResearch itself as a verifier at inference time and effectively improve performance on challenging questions. Extensive experimental results demonstrate that our proposed Marco DeepResearch agent significantly outperforms 8B-scale deep research agents on most challenging benchmarks, such as BrowseComp and BrowseComp-ZH. Crucially, under a maximum budget of 600 tool calls, Marco DeepResearch even surpasses or approaches several 30B-scale agents, like Tongyi DeepResearch-30B.

Section 2 — Technical method

[paper] 核心贡献：Marco DeepResearch 在三层引入 verification-centric design：QA data synthesis 阶段控制难度并验证唯一正确答案；trajectory construction 阶段注入显式 verification pattern；test-time scaling 阶段让 agent 自己担当 verifier 提升难题性能。[paper]

[paper] 逻辑增量：这篇工作的思路更像“把 verifier 前置并贯穿全流程”，而不是在最终回答后再补一层 reranker 或 self-consistency。[paper]

[inferred] 复杂度与工程含义：这类方法的关键成本不一定都体现在 FLOPs 上，而更可能体现在更长的训练链路、更多的 rollout / 验证步骤，或更复杂的系统编排上。是否值得采用，取决于你要优化的是 benchmark score、稳定性，还是可部署性。

Section 3 — Experimental evidence

[paper] 关键证据：论文报告在 BrowseComp 与 BrowseComp-ZH 等高难基准上显著优于 8B deep research agents；在最多 600 次 tool call 预算下，甚至接近或超过部分 30B 级系统。[paper]

[paper] 证据质量：实验覆盖训练和推理两个阶段的验证策略，说明收益并非来自单点 patch。[paper] 但摘要层面仍看不出不同 budget 下的收益曲线与失败模式分布。 [paper]

[inferred] 如果把这篇论文当成选型依据，最应该重点回看的不是摘要里的单个最好数字，而是它在不同数据集、模型尺度、预算设置下是否仍然保持同样趋势。

Section 4 — Critical assessment

[inferred] 主要担忧：让 agent 自己当 verifier 很容易产生 confirmation bias；若没有足够异质的检查器，错误可能被自我放大。[inferred] 同时，600 tool calls 级别的预算在真实产品里成本不低。[inferred]

[inferred] 另外一个现实问题是，论文里最有效的 recipe 往往也最“重”。真正落地时，需要先判断这些收益是否能覆盖训练预算、推理延迟和维护复杂度带来的额外成本。

[paper] 这篇工作的真实强项在于，它没有只停留在直觉层面，而是把一个具体瓶颈拆成了可验证、可比较、可复用的方法设计。

Section 5 — Synthesis

TL;DR

Marco DeepResearch 证明，研究型 agent 的关键不是盲目加长轨迹，而是让 verification 成为训练和推理的主干。它更像一套面向长链任务的质量控制框架。

Innovation classification

Method advance. [inferred] 这篇工作的价值主要不在“宣称一个全新范式”，而在于把现有方向里的关键短板系统性补上，并给出较可信的工程/实验支撑。

Deployment readiness

[inferred] 如果你的工作和这篇论文的任务形态高度接近，它已经足以作为下一轮实验设计的直接参考；但若要进入生产或高风险场景，仍需要补齐更强的鲁棒性、预算分析与失败案例验证。

Open problems

如何引入外部 verifier 以缓解自证循环
如何根据预算自适应决定何时停止检索与验证
如何为 open-ended research 建立更可信的质量评估标准