Open-source spatial QA · service robots

Binary Tracking for Spatial QA and Navigation with Open Vision-Language Models

A fully open-source agent that localizes a queried place by binary-searching the robot's own trajectory — running entirely onboard, with no closed-source API.

Dongbin Na*†, Chanwoo Kim*, Soonbin Rho, Giyun Choi, Gangbok Lee, Dooyoung Hong
RGA Inc.
* equal contribution  ·  † correspondence to dongbinna@postech.ac.kr and dooyoung@rgarobot.com
BinTrack and GangnamLoop on a representative round-trip route: day and night trajectories on a SLAM map, with memory-construction and query examples.
System overview. A representative GangnamLoop round trip — day (red) and night (blue) routes on one SLAM map, with offline memory construction and inference-time query examples.

Abstract

Spatial question answering lets a service robot turn a query such as “where can I find a dry cleaner on the way back home?” into a metric coordinate that navigation can act on. Prior approaches rely on retrieval-augmented agents built on closed-source models such as GPT-4o, but robots in the real world often cannot depend on online closed-source models due to network instability, communication latency, and deployment cost.

We present BinTrack, a simple yet effective, fully open-source spatial-localization agent that exploits the temporal ordering of a robot's trajectory: it performs a binary search over the segments between two anchor landmarks identified from the query. BinTrack improves overall accuracy by up to 22.8% over open-source implementations and matches the reported closed-source result on the global category of SpaceLocQA, while running more than 1.5× faster. Every component is open-source, so the full pipeline is reproducible without any API access. We also release GangnamLoop, a multi-trip outdoor benchmark recorded by a quadruped robot on public streets, revisiting the same locations under different conditions and pairing the robot's low viewpoint with the human owner's.

8
round-trip recordings
360
spatial queries
4
day / night pairs
2
paired viewpoints
221 min
total recording
383.8K
RGB frames

How it works

One search primitive over the trajectory

BinTrack reads the route as a temporally ordered list of segments and localizes a place through a small set of components built around a single search primitive.

Binary Tracking Binary Tracking: three steps of binary search over ordered trajectory segments between two anchors, ending in verification of a leaf interval.
Binary Tracking. Given two anchors X and Y from the query, the agent compares the semantic evidence of the left and right halves of the current interval, keeps the stronger half, and repeats until the interval shrinks to a small leaf — where the verifier selects the target segment and returns its coordinate.
01Multi-view memory

A segment can be captioned three ways — full, center, detail — so retrieval can match the right view of a place.

02Binary Tracking

A VL-guided binary search over the segments between X and Y, halving the interval until it brackets the target.

The planning agent is a text-only 32B model and the verifier is a separate 7B vision-language model; keeping the two roles apart avoids the vote-count hallucination of a single combined model.

Results

Open-source, competitive with closed-source

Under a fair open-source evaluation, BinTrack beats every open-source baseline across all categories, and ties the closed-source Meta-Memory on the hardest setting.

SpaceLocQA — success rate (%) @ τ = 15 m, 270 queries
MethodBackboneBasicLocalGlobalOverall
Meta-Memoryclosed-source67.861.862.263.9
Meta-Memoryopen-source50.051.132.644.6
ReMEmbRclosed-source58.557.846.354.2
ReMEmbRopen-source56.761.132.650.1
BinTrack (ours)open-source74.465.662.267.4

Using only open-source models, BinTrack reaches the closed-source Meta-Memory's global score (62.2) and the best overall accuracy.

GangnamLoop — success rate (%) @ τ = 15 m, 360 queries, 32B agent
MethodR1R2R3R4R5R6R7R8Overall
Meta-Memory (open)15.631.10.06.720.08.926.715.615.6
ReMEmbR (open)24.433.34.48.924.411.131.16.718.0
BinTrack (ours)55.660.024.440.060.031.157.833.345.3

Environment

Runs onboard, on a single GPU

Every role is filled by an open-source model, so the pipeline needs no external API.

RoleModel
Captioner (memory build)Qwen2.5-VL-7B-Instruct
Verifier (4-view ensemble)Qwen2.5-VL-7B-Instruct (shared)
Planning agentQwen2.5-32B-Instruct-AWQ
Text encodermxbai-embed-large-v1
Vector databaseMilvus (Lite)
# Python 3.10 (conda recommended)
conda create -n vln python=3.10 -y
conda activate vln

# Pinned versions verified for Qwen2.5-VL
pip install torch==2.4.1                 # CUDA 12.x build
pip install transformers==4.49.0         # newer releases break on torch.library.custom_op
pip install accelerate==1.0.1 "tokenizers>=0.21,<0.22" "setuptools<81"
pip install pymilvus sentence-transformers qwen-vl-utils autoawq pandas numpy

Benchmark

GangnamLoop — the same streets, day and night

A quadruped robot walks four out-and-back routes through Gangnam, Seoul, then walks each one again under the opposite lighting. Day (red) and night (blue) trajectories are aligned on a shared SLAM map.

Common SLAM map All eight GangnamLoop recordings aligned on a common LiDAR-SLAM map.
One coordinate frame. All eight recordings are registered to a single LiDAR-SLAM map, so revisits of the same place are directly comparable across day, night, and route.
RecordingsQueriesDay/night pairsViewpointsTotal recordingRGB frames
836042 (robot + owner)221 min383,800
E → A → ER1 dayR8 night
E → B → ER2 dayR7 night
E → C → ER3 nightR6 day
E → D → ER4 nightR5 day
Recording schedule — 2 days × 4 destinations × day/night
#DayConditionRound tripPairDurationRGB framesQueries
1Day 1dayE → A → E#814:4325,54545
2Day 1dayE → B → E#712:3722,02045
3Day 1nightE → C → E#632:4756,88345
4Day 1nightE → D → E#545:1278,50745
5Day 2dayE → D → E#449:4086,20845
6Day 2dayE → C → E#335:1961,39745
7Day 2nightE → B → E#213:4123,75345
8Day 2nightE → A → E#117:0329,48745

Recordings (1,8) (2,7) (3,6) (4,5) share a destination but differ in time of day, forming the day/night cross-domain evaluation set.

On the robot

Real-robot deployment

We are bringing BinTrack onboard our service-robot platform. Recorded runs, on-device latency, and a deployment guide will be added here.

Coming soon

Onboard run

Query to coordinate to navigation, recorded on the robot.

Coming soon

On-device latency

Per-query timing of retrieval, search, and verification on the robot.

Coming soon

Deployment guide

The steps to run BinTrack onboard a quadruped robot.

Data

The GangnamLoop benchmark

GangnamLoop will be released on Hugging Face. It contains 8 round-trip recordings (360 queries) over four routes, each recorded once during the day and once at night, with paired robot and human viewpoints.

Coming soon

Hugging Face release

A link will appear here once the dataset is published.

Cite

BibTeX

If this work is useful for your research, please cite our paper.

@article{na2026bintrack,
  title   = {Binary Tracking for Spatial QA and Navigation with Open Vision-Language Models},
  author  = {Na, Dongbin and Kim, Chanwoo and Rho, Soonbin and Choi, Giyun and Lee, Gangbok and Hong, Dooyoung},
  year    = {2026}
}