Uni-World VLA: Interleaved World Modeling and Planning for Autonomous Driving

把世界模型和规划从“先预测后规划”改成交替式闭环生成，是这篇最核心的贡献。它更像 driving agent 在边想边看未来，而不是先脑补完整世界再开车。

Autonomous DrivingarXivLiu Qiqi, Xu Huan, Li Jingyu, Sun Bin, Hao Zhihui, She Dangen, Zhu Xiatian, Zhang Li

注：本条目基于 arXiv 原文摘要、PDF 首尾页抽取与 alphaXiv overview 交叉整理，定位为站内快速研究笔记，而非逐页复刻附录的正式评审稿。

Section 0 — Metadata

Field	Value
Title	Uni-World VLA: Interleaved World Modeling and Planning for Autonomous Driving
Authors & affiliations	Liu Qiqi, Xu Huan, Li Jingyu, Sun Bin, Hao Zhihui, She Dangen, Zhu Xiatian, Zhang Li
Venue / status	arXiv preprint, March 2026
Code / data available	当前抓取信息中仅确认 arXiv 与 alphaXiv 页面；未逐项核验额外代码仓库。
Reproducibility signals	[paper] 论文给出了任务设定与主实验结果；[inferred] 随机种子、显著性检验和完整实现细节仍需回到正文/附录逐项核对。

Section 1 — Problem and motivation

[paper] 自动驾驶既要理解环境如何演化，也要据此规划动作。[paper] 传统 world-model 方法通常先完整预测未来，再在预测结果上规划，这种 open-loop imagination 很容易与真实决策脱节。

[paper] 一旦世界预测和控制被解耦，规划器面对的“未来”就更像静态输入，而不是可被当前决策持续影响的动态过程。[inferred] 这在复杂交通流里尤其容易导致 drift。

[paper] 论文摘要强调的直接目标是：Autonomous driving requires reasoning about how the environment evolves and planning actions accordingly. Existing world-model-based approaches typically predict future scenes first and plan afterwards, resulting in open-loop imagination that may drift from the actual decision process. In this paper, we present Uni-World VLA, a unified vision-language-action (VLA) model that tightly interleaves future frame prediction and trajectory planning. Instead of generating a full world rollout before planning, our model alternates between predicting future frames and ego actions step by step, allowing planning decisions to be continuously conditioned on the imagined future observations. This interleaved generation forms a closed-loop interaction between world modeling and control, enabling more adaptive decision-making in dynamic traffic scenarios. In addition, we incorporate monocular depth information into frames to provide stronger geometric cues for world modeling, improving long-horizon scene prediction. Experiments on the NAVSIM benchmark show that our approach achieves competitive closed-loop planning performance while producing high-fidelity future frame predictions. These results demonstrate that tightly coupling world prediction and planning is a promising direction for scalable VLA driving systems.

Section 2 — Technical method

[paper] 核心贡献：Uni-World VLA 用统一 vision-language-action 模型，把未来帧预测和 ego trajectory planning 交替生成：先看一步未来，再出一步动作，然后继续滚动。[paper] 同时引入 monocular depth 作为额外几何线索强化世界建模。

[paper] 逻辑增量：相较于 predict-then-plan，它把世界预测与动作生成紧耦合成 closed-loop interleaving；相较于普通 VLA，它又明确把 imagined future 作为规划条件纳入循环。[paper]

[inferred] 复杂度与工程含义：这类方法的关键成本不一定都体现在 FLOPs 上，而更可能体现在更长的训练链路、更多的 rollout / 验证步骤，或更复杂的系统编排上。是否值得采用，取决于你要优化的是 benchmark score、稳定性，还是可部署性。

Section 3 — Experimental evidence

[paper] 关键证据：论文在 NAVSIM 上展示了有竞争力的 closed-loop planning 表现，同时还能生成较高保真的未来帧预测。[paper] 可视化也表明模型能在代表性复杂场景下维持较一致的未来演化。

[paper] 证据质量：从摘要看，作者同时报告 planning 与 frame prediction 两侧结果，这比只报控制分或只报视频质量更完整。[paper] 但真实道路部署、罕见事件与传感器噪声鲁棒性仍未在当前材料里充分体现。[inferred]

[inferred] 如果把这篇论文当成选型依据，最应该重点回看的不是摘要里的单个最好数字，而是它在不同数据集、模型尺度、预算设置下是否仍然保持同样趋势。

Section 4 — Critical assessment

[inferred] 主要担忧：交替式生成的计算路径更长，真实车端延迟可能成为瓶颈。[inferred] 另外，单目深度虽便宜，但其误差本身也可能被滚动放大。 [inferred]

[inferred] 另外一个现实问题是，论文里最有效的 recipe 往往也最“重”。真正落地时，需要先判断这些收益是否能覆盖训练预算、推理延迟和维护复杂度带来的额外成本。

[paper] 这篇工作的真实强项在于，它没有只停留在直觉层面，而是把一个具体瓶颈拆成了可验证、可比较、可复用的方法设计。

Section 5 — Synthesis

TL;DR

这篇工作把 driving VLA 从“世界模型辅助规划”推进成了“世界模型与规划共同滚动”。如果未来想做更统一的端到端驾驶 agent，这种闭环 interleaving 很值得继续押注。

Innovation classification

Method advance. [inferred] 这篇工作的价值主要不在“宣称一个全新范式”，而在于把现有方向里的关键短板系统性补上，并给出较可信的工程/实验支撑。

Deployment readiness

[inferred] 如果你的工作和这篇论文的任务形态高度接近，它已经足以作为下一轮实验设计的直接参考；但若要进入生产或高风险场景，仍需要补齐更强的鲁棒性、预算分析与失败案例验证。

Open problems

如何把 interleaved planning 压到车规实时延迟
如何在 rare events 与传感器异常下保持闭环稳定
如何与 BEV、多相机和地图先验进一步融合