VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward

不改主生成器结构，而是在 latent 空间做几何一致性的奖励优化，是这篇最聪明的地方。它把“世界一致性”从昂贵的 RGB 解码里解耦出来，更像是给视频生成补了一层 geometry-aware RL。

Video GenerationarXivAn Zhaochong, Kupyn Orest, Uscidda Théo, Colaco Andrea, Ahuja Karan, Belongie Serge, Gonzalez-Franco Mar, Gazulla Marta Tintore

arXiv alphaXiv

注：本条目基于 arXiv 原文摘要、PDF 首尾页抽取与 alphaXiv overview 交叉整理，定位为站内快速研究笔记，而非逐页复刻附录的正式评审稿。

Section 0 — Metadata

Field	Value
Title	VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward
Authors & affiliations	An Zhaochong, Kupyn Orest, Uscidda Théo, Colaco Andrea, Ahuja Karan, Belongie Serge, Gonzalez-Franco Mar, Gazulla Marta Tintore
Venue / status	arXiv preprint, March 2026
Code / data available	当前抓取信息中仅确认 arXiv 与 alphaXiv 页面；未逐项核验额外代码仓库。
Reproducibility signals	[paper] 论文给出了任务设定与主实验结果；[inferred] 随机种子、显著性检验和完整实现细节仍需回到正文/附录逐项核对。

Section 1 — Problem and motivation

[paper] 大规模视频扩散模型已经能生成漂亮视频，但几何一致性依旧是硬伤：镜头抖动、结构漂移、跨帧几何不连贯都会破坏“世界是真实连续的”感觉。[paper]

[paper] 已有做法通常两条路：改生成器结构，或在 RGB 空间施加 geometry-aware alignment。[paper] 前者容易伤害互联网级预训练模型的泛化能力，后者则依赖重复 VAE 解码，计算代价高且对动态场景支持弱。

[paper] 论文摘要强调的直接目标是：Large-scale video diffusion models achieve impressive visual quality, yet often fail to preserve geometric consistency. Prior approaches improve consistency either by augmenting the generator with additional modules or applying geometry-aware alignment. However, architectural modifications can compromise the generalization of internet-scale pretrained models, while existing alignment methods are limited to static scenes and rely on RGB-space rewards that require repeated VAE decoding, incurring substantial compute overhead and failing to generalize to highly dynamic real-world scenes. To preserve the pretrained capacity while improving geometric consistency, we propose VGGRPO (Visual Geometry GRPO), a latent geometry-guided framework for geometry-aware video post-training. VGGRPO introduces a Latent Geometry Model (LGM) that stitches video diffusion latents to geometry foundation models, enabling direct decoding of scene geometry from the latent space. By constructing LGM from a geometry model with 4D reconstruction capability, VGGRPO naturally extends to dynamic scenes, overcoming the static-scene limitations of prior methods. Building on this, we perform latent-space Group Relative Policy Optimization with two complementary rewards: a camera motion smoothness reward that penalizes jittery trajectories, and a geometry reprojection consistency reward that enforces cross-view geometric coherence. Experiments on both static and dynamic benchmarks show that VGGRPO improves camera stability, geometry consistency, and overall quality while eliminating costly VAE decoding, making latent-space geometry-guided reinforcement an efficient and flexible approach to world-consistent video generation.

Section 2 — Technical method

[paper] 核心贡献：VGGRPO 保留原始生成器，只额外引入 Latent Geometry Model，把视频 latent 直接映射到几何 foundation model；再在 latent 空间执行 GRPO，用 camera motion smoothness reward 与 geometry reprojection consistency reward 做后训练。[paper]

[paper] 逻辑增量：关键增量有两点：一是奖励计算从 RGB 空间挪到 latent 空间，二是借助具有 4D reconstruction 能力的几何模型，把一致性约束自然推广到动态场景。[paper]

[inferred] 复杂度与工程含义：这类方法的关键成本不一定都体现在 FLOPs 上，而更可能体现在更长的训练链路、更多的 rollout / 验证步骤，或更复杂的系统编排上。是否值得采用，取决于你要优化的是 benchmark score、稳定性，还是可部署性。

Section 3 — Experimental evidence

[paper] 关键证据：论文报告在静态和动态 benchmark 上同时提升 camera stability、geometry consistency 与整体质量，并省掉了昂贵的 VAE decoding。[paper] 附图还显示 latent geometry model 在噪声扰动下比 RGB-based 几何模型更稳。[paper]

[paper] 证据质量：这类工作最容易只在单一视觉指标上好看，而本文至少同时覆盖了静态/动态场景和几何/运动两类目标。[paper] 不过摘要层面尚看不到更细的显著性或失败案例统计。[paper]

[inferred] 如果把这篇论文当成选型依据，最应该重点回看的不是摘要里的单个最好数字，而是它在不同数据集、模型尺度、预算设置下是否仍然保持同样趋势。

Section 4 — Critical assessment

[inferred] 主要担忧：该方法的上限依赖 LGM 本身的质量；若 latent-to-geometry 映射不够稳，整个奖励就可能偏掉。[inferred] 另外，过度强调几何一致性也可能压制部分生成多样性或风格自由度。[inferred]

[inferred] 另外一个现实问题是，论文里最有效的 recipe 往往也最“重”。真正落地时，需要先判断这些收益是否能覆盖训练预算、推理延迟和维护复杂度带来的额外成本。

[paper] 这篇工作的真实强项在于，它没有只停留在直觉层面，而是把一个具体瓶颈拆成了可验证、可比较、可复用的方法设计。

Section 5 — Synthesis

TL;DR

这篇论文证明，视频生成里的“世界一致性”可以用 latent-space RL 更高效地逼出来，而不必大改主模型结构。它更像一层几何校准器，而不是再造一个视频模型。

Innovation classification

Method advance. [inferred] 这篇工作的价值主要不在“宣称一个全新范式”，而在于把现有方向里的关键短板系统性补上，并给出较可信的工程/实验支撑。

Deployment readiness

[inferred] 如果你的工作和这篇论文的任务形态高度接近，它已经足以作为下一轮实验设计的直接参考；但若要进入生产或高风险场景，仍需要补齐更强的鲁棒性、预算分析与失败案例验证。

Open problems

如何同时优化几何一致性与外观多样性
如何把 4D 奖励推广到更长视频和复杂交互场景
如何让 latent geometry reward 与下游 embodied task 联动