注:本条目基于 arXiv 原文摘要、PDF 首尾页抽取与 alphaXiv overview 交叉整理,定位为站内快速研究笔记,而非逐页复刻附录的正式评审稿。
Section 0 — Metadata
| Field | Value |
|---|---|
| Title | VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward |
| Authors & affiliations | An Zhaochong, Kupyn Orest, Uscidda Théo, Colaco Andrea, Ahuja Karan, Belongie Serge, Gonzalez-Franco Mar, Gazulla Marta Tintore |
| Venue / status | arXiv preprint, March 2026 |
| Code / data available | 当前抓取信息中仅确认 arXiv 与 alphaXiv 页面;未逐项核验额外代码仓库。 |
| Reproducibility signals | [paper] 论文给出了任务设定与主实验结果;[inferred] 随机种子、显著性检验和完整实现细节仍需回到正文/附录逐项核对。 |
Section 1 — Problem and motivation
[paper] 大规模视频扩散模型已经能生成漂亮视频,但几何一致性依旧是硬伤:镜头抖动、结构漂移、跨帧几何不连贯都会破坏“世界是真实连续的”感觉。[paper]
[paper] 已有做法通常两条路:改生成器结构,或在 RGB 空间施加 geometry-aware alignment。[paper] 前者容易伤害互联网级预训练模型的泛化能力,后者则依赖重复 VAE 解码,计算代价高且对动态场景支持弱。
[paper] 论文摘要强调的直接目标是:Large-scale video diffusion models achieve impressive visual quality, yet often fail to preserve geometric consistency. Prior approaches improve consistency either by augmenting the generator with additional modules or applying geometry-aware alignment. However, architectural modifications can compromise the generalization of internet-scale pretrained models, while existing alignment methods are limited to static scenes and rely on RGB-space rewards that require repeated VAE decoding, incurring substantial compute overhead and failing to generalize to highly dynamic real-world scenes. To preserve the pretrained capacity while improving geometric consistency, we propose VGGRPO (Visual Geometry GRPO), a latent geometry-guided framework for geometry-aware video post-training. VGGRPO introduces a Latent Geometry Model (LGM) that stitches video diffusion latents to geometry foundation models, enabling direct decoding of scene geometry from the latent space. By constructing LGM from a geometry model with 4D reconstruction capability, VGGRPO naturally extends to dynamic scenes, overcoming the static-scene limitations of prior methods. Building on this, we perform latent-space Group Relative Policy Optimization with two complementary rewards: a camera motion smoothness reward that penalizes jittery trajectories, and a geometry reprojection consistency reward that enforces cross-view geometric coherence. Experiments on both static and dynamic benchmarks show that VGGRPO improves camera stability, geometry consistency, and overall quality while eliminating costly VAE decoding, making latent-space geometry-guided reinforcement an efficient and flexible approach to world-consistent video generation.
Section 2 — Technical method
[paper] 核心贡献:VGGRPO 保留原始生成器,只额外引入 Latent Geometry Model,把视频 latent 直接映射到几何 foundation model;再在 latent 空间执行 GRPO,用 camera motion smoothness reward 与 geometry reprojection consistency reward 做后训练。[paper]
[paper] 逻辑增量:关键增量有两点:一是奖励计算从 RGB 空间挪到 latent 空间,二是借助具有 4D reconstruction 能力的几何模型,把一致性约束自然推广到动态场景。[paper]
[inferred] 复杂度与工程含义:这类方法的关键成本不一定都体现在 FLOPs 上,而更可能体现在更长的训练链路、更多的 rollout / 验证步骤,或更复杂的系统编排上。是否值得采用,取决于你要优化的是 benchmark score、稳定性,还是可部署性。
Section 3 — Experimental evidence
[paper] 关键证据:论文报告在静态和动态 benchmark 上同时提升 camera stability、geometry consistency 与整体质量,并省掉了昂贵的 VAE decoding。[paper] 附图还显示 latent geometry model 在噪声扰动下比 RGB-based 几何模型更稳。[paper]
[paper] 证据质量:这类工作最容易只在单一视觉指标上好看,而本文至少同时覆盖了静态/动态场景和几何/运动两类目标。[paper] 不过摘要层面尚看不到更细的显著性或失败案例统计。[paper]
[inferred] 如果把这篇论文当成选型依据,最应该重点回看的不是摘要里的单个最好数字,而是它在不同数据集、模型尺度、预算设置下是否仍然保持同样趋势。
Section 4 — Critical assessment
[inferred] 主要担忧:该方法的上限依赖 LGM 本身的质量;若 latent-to-geometry 映射不够稳,整个奖励就可能偏掉。[inferred] 另外,过度强调几何一致性也可能压制部分生成多样性或风格自由度。[inferred]
[inferred] 另外一个现实问题是,论文里最有效的 recipe 往往也最“重”。真正落地时,需要先判断这些收益是否能覆盖训练预算、推理延迟和维护复杂度带来的额外成本。
[paper] 这篇工作的真实强项在于,它没有只停留在直觉层面,而是把一个具体瓶颈拆成了可验证、可比较、可复用的方法设计。
Section 5 — Synthesis
TL;DR
这篇论文证明,视频生成里的“世界一致性”可以用 latent-space RL 更高效地逼出来,而不必大改主模型结构。它更像一层几何校准器,而不是再造一个视频模型。
Innovation classification
Method advance. [inferred] 这篇工作的价值主要不在“宣称一个全新范式”,而在于把现有方向里的关键短板系统性补上,并给出较可信的工程/实验支撑。
Deployment readiness
[inferred] 如果你的工作和这篇论文的任务形态高度接近,它已经足以作为下一轮实验设计的直接参考;但若要进入生产或高风险场景,仍需要补齐更强的鲁棒性、预算分析与失败案例验证。
Open problems
- 如何同时优化几何一致性与外观多样性
- 如何把 4D 奖励推广到更长视频和复杂交互场景
- 如何让 latent geometry reward 与下游 embodied task 联动