2026-04-13
语言
主题
Wiki
2026-04-07

Vision-Language Models

Models that jointly encode images and text, enabling zero-shot transfer, visual question answering, and image-conditioned generation.

MultimodalVisionLanguageMLLMVLM
Back to Wiki

Overview

Vision-Language Models (VLMs) bridge visual and linguistic representations. The core challenge is aligning heterogeneous modalities — pixel patches and token sequences — into a shared embedding space.

Alignment Strategies

Contrastive pre-training (CLIP) pulls matching image–text pairs together while pushing mismatched pairs apart. This creates semantically aligned embeddings without requiring hard-coded labels.

Projector-based fusion (LLaVA family) encodes images with a frozen vision encoder, then projects the resulting patch tokens into a language model's token space via a lightweight MLP or cross-attention adapter.

Unified sequence modeling (Flamingo, GPT-4V) interleaves visual tokens directly into the autoregressive stream, enabling rich few-shot visual reasoning.

Representative Models

ModelStrategyYear
CLIPContrastive2021
FlamingoCross-attention2022
LLaVAMLP projector2023
InternVLHybrid2024
Qwen2-VLDynamic resolution2024

Dynamic Resolution

Early VLMs resized all images to a fixed resolution, discarding fine-grained spatial detail. Dynamic resolution methods (Qwen2-VL, InternVL2) tile or slice images at their native resolution and encode each tile independently, dramatically improving OCR and dense prediction tasks.

Evaluation Dimensions

  • VQA — visual question answering accuracy
  • OCR / document understanding — reading text in images
  • Spatial reasoning — object counting, relation grounding
  • Hallucination rate — fraction of plausible-sounding but incorrect statements

Open Challenges

  • Faithful grounding: models often generate text unsupported by the image.
  • Long-context vision: attending over thousands of patch tokens efficiently.
  • Video understanding: temporal reasoning across frames.