Grounded Vision-Language Models Reduce Visual Hallucination (demo paper)
Vision-language models answer more reliably when they ground text in image regions, OCR evidence, and uncertainty.
multimodal vision-language / groundingsource: mockfaithfulness 88% · halluc 12%
Tier 1 · Understand
The paper studies how multimodal models can avoid plausible but unsupported answers by tying language outputs to visual regions, explicit OCR evidence, and calibrated refusal thresholds. A weak VLM guesses from scene priors; the grounded model cites evidence or refuses when no visual support exists.
- 1A VLM should not just say a likely answer; it should point to the pixels or text that support it.
- 2Object questions need region grounding so the model does not follow a nearby false lead.
- 3Receipt and document questions need OCR because the important evidence is written text.
- 4If the image does not contain the requested object, a grounded model should refuse instead of hallucinating.
Concept map
● Concept Map✦ ARCADE
Ask the paper
● Grounded Q&A✦ ARCADE
💬 Ask the paper — answers are grounded in its text, with sections cited
Checking…
Tier 2 & 3 · Play / Prove
template: vision-detective · conf 96%● vision-detective✦ ARCADE
Vision Detective
VLM evidence labloading evidence room
Object Case
Identify the object blocking the doorway.
compute / score
0 / 20 compute
0 pts
Hidden target: cardboard box0/12 questions asked
What is blocking the doorway?
Visual Guess Who Interrogation
Ask targeted yes/no questions, then submit a grounded VLM verdict.
Start here: ask one or two clue questions, enable the tools that match the evidence, then press Submit verdict. Better scores use fewer tools and fewer questions.
Compute budget: 0 used / 20 max. 20 compute left.
Model answer
Submit a verdict when you have enough evidence.