Seek-and-Solve: Benchmarking MLLMs for
Visual Clue-Driven Reasoning in Daily Scenarios

Xiaomin Li1,*, Tala Wang1,*, Zichen Zhong1,*, Ying Zhang2, Zirui Zheng1, Takashi Isobe3, Dezhuang Li1, Huchuan Lu1, You He3, Xu Jia1,†
1Dalian University of Technology, 2WeChat, Tencent Inc., 3Tsinghua University
*Equal contribution.
Corresponding author.
DailyClue Overview

DailyClue is a benchmark designed for visual clue-driven reasoning in daily scenarios.

Abstract

Daily scenarios are characterized by visual richness, requiring Multimodal Large Language Models (MLLMs) to filter noise and identify decisive visual clues for accurate reasoning. Yet, current benchmarks predominantly aim at evaluating MLLMs' pre-existing knowledge or perceptual understanding, often neglecting the critical capability of reasoning.

To bridge this gap, we introduce DailyClue, a benchmark designed for visual clue-driven reasoning in daily scenarios. Our construction is guided by two core principles: (1) strict grounding in authentic daily activities, and (2) challenging query design that necessitates more than surface-level perception. Instead of simple recognition, our questions compel MLLMs to actively explore suitable visual clues and leverage them for subsequent reasoning. To this end, we curate a comprehensive dataset spanning four major daily domains and 16 distinct subtasks.

Comprehensive evaluation across MLLMs and agentic models underscores the formidable challenge posed by our benchmark. Our analysis reveals several critical insights, emphasizing that the accurate identification of visual clues is essential for robust reasoning.

Data Construction Pipeline

We employ a carefully controlled generation and filtering pipeline to ensure high-quality, reasoning-intensive questions. The process involves candidate triplet generation by top-tier MLLMs, followed by multi-model verification and rigorous human cross-checks.

Construction Pipeline

Reasoning Examples

DailyClue features four daily life scenarios across 16 reasoning subtasks. Click arrows to navigate.

Performance Comparison

Performance comparison on the DailyClue benchmark. Metrics denote accuracy (%). The best result is highlighted in bold red, and the second best is underlined.

Model Overall Location
Identification
Spatial
Relationship
Daily
Commonsense
Scientific
Commonsense
Open-source MLLMs
LLaVA-OneVision-7B 24.47 10.50 34.97 25.56 31.71
LLaVA-OneVision-72B 33.18 15.50 47.85 33.33 42.28
LLaVA-OneVision-1.5-8B-Instruct 29.43 10.50 47.85 27.78 38.21
InternVL3-8B 31.08 13.50 31.67 31.67 41.46
InternVL3-38B 36.94 17.00 47.85 47.22 39.84
InternVL3-78B 40.84 18.00 54.60 52.78 42.28
InternVL-3.5-38B 36.91 14.00 49.69 43.33 43.90
Qwen2.5-VL-7B 30.63 15.00 39.88 37.22 34.15
Qwen2.5-VL-32B 35.59 21.50 42.94 42.78 38.21
Qwen2.5-VL-72B 40.84 24.50 47.85 48.33 47.15
Qwen3-VL-235B-A22B-Thinking 44.59 23.00 49.08 56.67 56.10
Qwen3-VL-235B-A22B-Instruct 40.69 22.50 46.63 50.00 48.78
Close-source MLLMs
🥉 Gemini-2.5-Flash 50.00 32.50 55.83 59.44 56.91
🥇 Gemini-2.5-Pro 56.90 41.50 61.35 62.77 67.48
Claude-3.7-Sonnet 41.14 18.50 57.06 47.22 47.97
Claude-sonnet-4 41.74 22.00 52.15 48.89 49.59
Claude-sonnet-4.5 41.74 21.00 53.99 49.44 47.97
o4-mini 47.00 25.50 58.28 58.33 50.41
🥈 GPT-5 50.90 38.00 57.67 51.67 61.79
Agentic Models
DeepEyes-7B 30.93 18.50 44.17 30.00 34.96
VLM-R3 33.18 19.00 42.33 36.11 39.84
TreeVGR-7B 27.78 14.00 40.49 27.18 33.33
REVPT 25.83 6.50 38.04 32.22 31.71
Thyme 46.25 69.00 42.33 29.44 39.02
PyVision 39.48 18.50 47.23 48.33 50.40
Human Baseline
Human Baseline 45.50 19.33 70.67 40.00 52.00

Further Insights: The Power of Visual Clues

Insight 1: Probing MLLMs with Explicit Clues

Despite strong reasoning capabilities, MLLMs' performance is heavily bottlenecked by the failure to capture critical visual semantics. Providing Ground Truth (GT) clues yields substantial gains. However, models are vulnerable to visual sycophancy—misleading textual clues can override their own visual judgment and induce hallucinations.

Model Clue Source
Qwen2.5-VL-72B Claude-3.7 Gemini-2.5-Pro GT Clue No Clue
Qwen2.5-VL-72B 40.09 (-0.75) 41.29 (+0.45) 48.80 (+7.96) 51.50 (+10.66) 40.84
Claude-3.7 42.49 (+1.35) 43.39 (+2.25) 51.95 (+10.81) 56.00 (+14.86) 41.14
Gemini-2.5-Pro 52.85 (-4.05) 53.15 (-3.75) 55.26 (-1.64) 58.55 (+1.65) 56.90

Insight 2: Efficacy of Visual Clue-Driven Reasoning

Mandating active visual clues within the Chain-of-Thought (CoT) acts as a critical anchor. This constraint effectively mitigates reasoning drift. Across different model scales, explicitly prompting models to actively seek visual clues (Our Clue-guided CoT) consistently outperforms vanilla reasoning baselines.

Figure 6

Insight 3: Rigorous Evaluation — Right Answer, Wrong Reason?

Superficial accuracy may belie a disconnect between reasoning and prediction. Under our Rigorous Evaluation Protocol (which requires semantic intersection between predicted clues and GT clues), models like Qwen2.5-VL-72B and Claude-3.7 show noticeable accuracy drops, indicating occasional reliance on illusionary or useless clues. Gemini-2.5-Pro demonstrated the highest reasoning fidelity with a negligible drop of only 0.44%.

Figure 7

BibTeX

@article{dailyclue2026,
  title={Seek-and-Solve: Benchmarking MLLMs for Visual Clue-Driven Reasoning in Daily Scenarios},
  author={Anonymous Authors},
  year={2026}
}