Knowing when to say “I don’t know”  /  embodied agents

Semantic Flip

Synthetic OOD generation for robust refusal in embodied question answering and spatial localization.

Dongbin Na*†   Chanwoo Kim*   Giyun Choi   Dooyoung Hong
RGA Inc.  ·  *equal contribution  ·  †corresponding

Embodied vision-language agents tend to answer confidently even when their visual memory does not actually support the query, which becomes dangerous the moment a wrong answer turns into a wrong physical action. Semantic Flip teaches the agent when to refuse. It synthesizes out-of-distribution (OOD) query–memory pairs from in-distribution data, with no external OOD annotation, and trains a small rejection module on top of a frozen VLM. The module drops into an existing pipeline without touching the underlying model.

Semantic Flip overview
OverviewFrom answerable training pairs, Q-Flip corrupts the query and V-Flip corrupts the video; a frozen VLM encodes all three distributions and only a small rejection gate is trained, then plugs into both an EQA decoder and a navigation agent.
§1

Abstract

Detecting unanswerable user queries remains essential for the reliable deployment of real-world embodied agents. However, modern vision-language models (VLMs) often generate overly confident answers even when the available visual memory cannot support the query. Such overconfidence carries different costs across tasks: the agent may give misleading information in Embodied Question Answering, or pick an arbitrary coordinate and physically guide the user there in spatial reasoning for navigation. Despite these stakes, only a few prior studies address when and how an embodied VLM should respond with “I do not know.”

We propose Semantic Flip, a simple and effective framework that synthesizes auxiliary OOD samples for embodied refusal without external OOD annotation. The key idea is to independently transform the query and the video memory to construct auxiliary OOD pairs that lack sufficient visual grounding. These synthesized pairs train a lightweight rejection module on top of a frozen pretrained VLM, and the module attaches to any existing VLM-based pipeline without retraining the underlying model. Across two complementary benchmarks, Semantic Flip consistently outperforms strong prompting baselines. We also introduce SpaceReject, a new refusal benchmark for spatial localization with deliberately unanswerable queries over long video memory, where Semantic Flip reaches an F1 of 0.9559. All experiments use only open-source models, so the full pipeline is reproducible.

§2

Method

Semantic Flip builds two complementary OOD distributions from answerable training pairs by flipping exactly one side of an otherwise answerable pair.

Q-Flip  ·  corrupt the query

Rewrite the question into an ungroundable variant while keeping the video memory unchanged.

V-Flip  ·  corrupt the memory

Keep the question, but erase its referent from the memory through a parse → detect → inpaint pipeline (spaCy, Grounding-DINO, LaMa).

Grounded
( query , memory )
the question is supported by what the robot saw → answer
FLIPONE SIDE
Flipped
( q̃ , memory )  or  ( query , ṽ )
one side no longer matches the other → abstain

A frozen VLM encoder produces one joint embedding for the in-distribution, Q-Flip, and V-Flip samples, and only a small 3-layer MLP rejection gate is trained on top. Because the backbone stays frozen, the gate reuses the forward pass the agent already runs and adds essentially no extra inference cost.

Concrete examples of Q-Flip and V-Flip
MethodQ-Flip keeps the video and rewrites the query into an ungroundable variant; V-Flip keeps the query and erases its referent from every frame, so the label flips to abstain.
§3

Results

With a frozen 7B encoder and a small head, Semantic Flip outperforms strong prompting baselines on both tasks. It also generalizes to abstention categories never seen during synthesis, e.g. 0.89 OOD recall on Information Unavailability, a category Q-Flip does not target.

AbstainEQA · HM3D-380
0.7110F1
embodied question answering
SpaceReject
0.9559F1
spatial localization over long video memory
AbstainEQA (HM3D-380)
MethodF1Bal. AccRecallSpec.
Qwen2.5-VL-32B prompting (Coarse / Fine / CoT)Coming soon
Semantic Flip (ours)0.71100.66840.81580.5211
SpaceReject (spatial localization over long video memory)
MethodBalAccF1RecallSpec.
C2 (Tool), Qwen3-8B prompting0.87780.88740.96300.7926
Semantic Flip, Q-Flip only0.95040.94940.93630.9644
Semantic Flip, Q-Flip + V-Flip0.95630.95590.94670.9659

Full appendix tables (threshold sweep, fill-operator ablation, pool-size sweep, per-category recall, per-LLM baselines) are reproduced end to end by the reproduction notebook.

§4

Get started

A single 48 GB GPU (e.g. RTX A6000) is enough for the whole pipeline. All checkpoints are public and pulled from the Hugging Face Hub on first use.

# set up the environment
git clone https://github.com/ndb796/SemanticFlip.git
cd SemanticFlip
conda create -n semflip python=3.10 -y
conda activate semflip
pip install -r requirements.txt

# reproduce every AbstainEQA result, top to bottom
jupyter lab notebooks/reproduce.ipynb

Backbones are all open-source: Qwen2.5-VL-7B / 32B-AWQ, Qwen2.5-7B, and grounding-dino-tiny.

§5

Dataset

The SpaceReject and SpaceRejectExtra queries, videos, annotations, and trained models will be released on Hugging Face.

Status · coming soon
Hugging Face release
Will be available at huggingface.co/datasets/ndb796/SpaceReject.
§6

Citation

@article{na2026semanticflip,
  title   = {Semantic Flip: Synthetic OOD Generation for Robust Refusal
             in Embodied Question Answering and Spatial Localization},
  author  = {Na, Dongbin and Kim, Chanwoo and Choi, Giyun and Hong, Dooyoung},
  year    = {2026}
}