A Multi-Modal Evaluation Framework for Image Captioning via Distributional, Perceptual, and Linguistic Signal Triangulation
WACV 2026 Algorithms Track
Anonymous Submission
Notification
Abstract
Evaluating image captions requires cohesive assessment of both visual semantics and language pragmatics, which is often not entirely captured by most metrics. We introduce Redemption Score (RS), a novel hybrid framework that ranks image captions by triangulating three complementary signals: (1) Mutual Information Divergence (MID) for global image-text distributional alignment, (2) DINO-based perceptual similarity of cycle-generated images for visual grounding, and (3) LLM Text Embeddings for contextual text similarity against human references. A calibrated fusion of these signals allows RS to offer a more holistic assessment. On the Flickr8k benchmark, RS achieves a Kendall-Ï„ of 58.42, outperforming most prior methods and demonstrating superior correlation with human judgments without requiring task-specific training.
Method Overview
MID Component
Captures distributional alignment using Gaussian-assumed CLIP embeddings for global image-text relationships. Provides statistical grounding for image-caption pair evaluation.
DINO Similarity
Measures visual grounding through cycle-generated images using self-supervised features. Evaluates visual consistency and object-level alignment.
GTE Embeddings
Evaluates linguistic fidelity with contextual text similarity, handling paraphrases effectively. Captures semantic meaning beyond surface-level text matching.
Key Results
58.42
Kendall-Ï„ on Flickr8k
+2.0%
Improvement over Polos
±0.43%
Standard Deviation
Training-Free
No Task-Specific Training
Cross-Dataset Generalization
The framework demonstrates robust transferability across datasets without parameter retuning:
Conceptual Captions: Consistent performance with fixed parameters optimized on Flickr8k
MS-COCO: Maintained ranking trends across different captioning systems
Multiple Models: Evaluated on BLIP, BLIP-2, MS-GIT, ViT-GPT-2, and Qwen 2.5-VL 7B
Parameter Stability: Single parameter set works across diverse datasets and domains
Qualitative Analysis
Statistical Robustness
1000-run bootstrap analysis on Flickr8k demonstrates excellent stability and reliability: