Grounded Vision-Language Models Reduce Visual Hallucination (demo paper)

Vision-language models answer more reliably when they ground text in image regions, OCR evidence, and uncertainty.

multimodal vision-language / groundingsource: mockfaithfulness 88% · halluc 12%

Tier 1 · Understand

The paper studies how multimodal models can avoid plausible but unsupported answers by tying language outputs to visual regions, explicit OCR evidence, and calibrated refusal thresholds. A weak VLM guesses from scene priors; the grounded model cites evidence or refuses when no visual support exists.

1A VLM should not just say a likely answer; it should point to the pixels or text that support it.
2Object questions need region grounding so the model does not follow a nearby false lead.
3Receipt and document questions need OCR because the important evidence is written text.
4If the image does not contain the requested object, a grounded model should refuse instead of hallucinating.

Concept map

● Concept Map✦ ARCADE

Ask the paper

● Grounded Q&A✦ ARCADE

💬 Ask the paper — answers are grounded in its text, with sections cited

Checking…

Tier 2 & 3 · Play / Prove

template: vision-detective · conf 96%

● vision-detective✦ ARCADE

Vision Detective

VLM evidence lab

loading evidence room

Object Case

Identify the object blocking the doorway.

compute / score

0 / 20 compute

0 pts

Hidden target: cardboard box0/12 questions asked

What is blocking the doorway?

Visual Guess Who Interrogation

Ask targeted yes/no questions, then submit a grounded VLM verdict.

Start here: ask one or two clue questions, enable the tools that match the evidence, then press Submit verdict. Better scores use fewer tools and fewer questions.

Compute budget: 0 used / 20 max. 20 compute left.

Model answer

Submit a verdict when you have enough evidence.