Neural ArcadeCal Hacks · Ddoski's Lab

Grounded Vision-Language Models Reduce Visual Hallucination (demo paper)

Vision-language models answer more reliably when they ground text in image regions, OCR evidence, and uncertainty.

multimodal vision-language / groundingsource: mockfaithfulness 88% · halluc 12%

Tier 1 · Understand

The paper studies how multimodal models can avoid plausible but unsupported answers by tying language outputs to visual regions, explicit OCR evidence, and calibrated refusal thresholds. A weak VLM guesses from scene priors; the grounded model cites evidence or refuses when no visual support exists.

  1. 1A VLM should not just say a likely answer; it should point to the pixels or text that support it.
  2. 2Object questions need region grounding so the model does not follow a nearby false lead.
  3. 3Receipt and document questions need OCR because the important evidence is written text.
  4. 4If the image does not contain the requested object, a grounded model should refuse instead of hallucinating.

Concept map

Concept Map✦ ARCADE
Vision-Language ModelVisual RegionsOCR TextUncertainty ThresholdGrounded RefusalVisual Hallucination

Ask the paper

Grounded Q&A✦ ARCADE

💬 Ask the paper — answers are grounded in its text, with sections cited

Checking…

Tier 2 & 3 · Play / Prove

template: vision-detective · conf 96%
vision-detective✦ ARCADE
Vision Detective
VLM evidence lab
loading evidence room
Object Case
Identify the object blocking the doorway.
compute / score
0 / 20 compute
0 pts
Hidden target: cardboard box0/12 questions asked
What is blocking the doorway?

Visual Guess Who Interrogation

Ask targeted yes/no questions, then submit a grounded VLM verdict.

Start here: ask one or two clue questions, enable the tools that match the evidence, then press Submit verdict. Better scores use fewer tools and fewer questions.
Compute budget: 0 used / 20 max. 20 compute left.
Model answer

Submit a verdict when you have enough evidence.