Vision-Language Models

Overview

Vision-Language Models (VLMs) bridge visual and linguistic representations. The core challenge is aligning heterogeneous modalities — pixel patches and token sequences — into a shared embedding space.

Alignment Strategies

Contrastive pre-training (CLIP) pulls matching image–text pairs together while pushing mismatched pairs apart. This creates semantically aligned embeddings without requiring hard-coded labels.

Projector-based fusion (LLaVA family) encodes images with a frozen vision encoder, then projects the resulting patch tokens into a language model's token space via a lightweight MLP or cross-attention adapter.

Unified sequence modeling (Flamingo, GPT-4V) interleaves visual tokens directly into the autoregressive stream, enabling rich few-shot visual reasoning.

Representative Models

Model	Strategy	Year
CLIP	Contrastive	2021
Flamingo	Cross-attention	2022
LLaVA	MLP projector	2023
InternVL	Hybrid	2024
Qwen2-VL	Dynamic resolution	2024

Dynamic Resolution

Early VLMs resized all images to a fixed resolution, discarding fine-grained spatial detail. Dynamic resolution methods (Qwen2-VL, InternVL2) tile or slice images at their native resolution and encode each tile independently, dramatically improving OCR and dense prediction tasks.

Evaluation Dimensions

VQA — visual question answering accuracy
OCR / document understanding — reading text in images
Spatial reasoning — object counting, relation grounding
Hallucination rate — fraction of plausible-sounding but incorrect statements

Open Challenges

Faithful grounding: models often generate text unsupported by the image.
Long-context vision: attending over thousands of patch tokens efficiently.
Video understanding: temporal reasoning across frames.