Redemption Score

A Multi-Modal Evaluation Framework for Image Captioning via Distributional, Perceptual, and Linguistic Signal Triangulation

WACV 2026 Algorithms Track
Anonymous Submission
Notification

Abstract

Evaluating image captions requires cohesive assessment of both visual semantics and language pragmatics, which is often not entirely captured by most metrics. We introduce Redemption Score (RS), a novel hybrid framework that ranks image captions by triangulating three complementary signals: (1) Mutual Information Divergence (MID) for global image-text distributional alignment, (2) DINO-based perceptual similarity of cycle-generated images for visual grounding, and (3) LLM Text Embeddings for contextual text similarity against human references. A calibrated fusion of these signals allows RS to offer a more holistic assessment. On the Flickr8k benchmark, RS achieves a Kendall-Ï„ of 58.42, outperforming most prior methods and demonstrating superior correlation with human judgments without requiring task-specific training.

Method Overview

Overview of Redemption Score calculation showing the three components (MID, DINO, GTEScore) and their fusion

MID Component

Captures distributional alignment using Gaussian-assumed CLIP embeddings for global image-text relationships. Provides statistical grounding for image-caption pair evaluation.

DINO Similarity

Measures visual grounding through cycle-generated images using self-supervised features. Evaluates visual consistency and object-level alignment.

GTE Embeddings

Evaluates linguistic fidelity with contextual text similarity, handling paraphrases effectively. Captures semantic meaning beyond surface-level text matching.

Key Results

Comparison with other mertrics
58.42
Kendall-Ï„ on Flickr8k
+2.0%
Improvement over Polos
±0.43%
Standard Deviation
Training-Free
No Task-Specific Training

Cross-Dataset Generalization

The framework demonstrates robust transferability across datasets without parameter retuning:

  • Conceptual Captions: Consistent performance with fixed parameters optimized on Flickr8k
  • MS-COCO: Maintained ranking trends across different captioning systems
  • Multiple Models: Evaluated on BLIP, BLIP-2, MS-GIT, ViT-GPT-2, and Qwen 2.5-VL 7B
  • Parameter Stability: Single parameter set works across diverse datasets and domains

Qualitative Analysis

Complementary failure modes showing how individual components fail while RS succeeds

Statistical Robustness

1000-run bootstrap analysis on Flickr8k demonstrates excellent stability and reliability:

58.4%
Mean Kendall-Ï„
0.43%
Standard Deviation
[57.6, 59.2]
95% Confidence Interval
Most Stable
Among All Metrics