Cue the Flow: Steering Flow-Matching Policies
for Open-World Delivery Manipulation

A grounding module turns free-form delivery instructions into a spatial cue; a lightweight contrastive adapter uses that cue to steer a frozen flow-matching VLA toward novel targets — no policy fine-tuning required.

1University of Illinois Chicago 2University of California, Los Angeles Equal advising

Abstract

Open-world delivery requires mobile manipulators to follow free-form user instructions and manipulate potentially novel objects. Existing dual-system approaches use high-level grounding models to convert language into grounded visual prompts, but their low-level controllers can remain brittle under noisy perception, dynamic scenes, and contact-rich interactions.

We instead use a pretrained flow-matching vision-language-action model as the low-level control interface, leveraging its reactivity and robustness to environmental changes while treating the grounding output as a spatial cue for policy steering. Our key insight is that the pretrained VLA already provides a strong manipulation prior, while the spatial cue supplies the missing target information needed to guide actions under novel language–object mappings.

Concretely, we introduce a lightweight cue-conditioned adapter. The adapter is first trained with contrastive objectives to produce salient and spatially discriminative cue representations, and is then supervised to predict a diagonal affine transformation over the generated action chunk, aligning policy steering with the cued target. Across tabletop and mobile-base settings, our method improves instruction following and manipulation success on both in-domain and out-of-domain objects, achieving up to near 2× improvement in average task success rate with negligible inference overhead.

The open-world delivery challenge. (a.1) The policy is trained on in-context demonstrations collected in a fixed environment. (a.2) In real-world delivery, the robot must execute manipulation under changing environments and novel language–object mappings. (b) A grounding module converts the open-ended instruction into a spatial cue, which conditions a contrastive adapter to predict affine transformations over the frozen flow-matching policy's action output, steering the generated action chunk toward the target object.

Open-world delivery challenge and method overview

How It Works

Keep the VLA frozen. Ground the instruction once. Steer the action chunk.

Spatial Cue as the Interface

A zero-shot grounding module (Qwen3-VL + SAM2.1) converts free-form instructions like “load the bag with order #2” into a Gaussian heatmap at the target object — a sparse cue that is far easier to produce reliably than dense trajectory-level guidance.

Contrastive Cue Encoder

Spatial-shift negatives force the cue representation to encode where the target is rather than what the scene looks like, and a residual encoding against a zero heatmap isolates cue-induced features — blocking scene-specific shortcuts.

Affine Action Steering

The adapter predicts a diagonal scale-and-shift (γ, β) over the frozen policy's action chunk during flow integration — adding only ~0.02% parameters, with the full pipeline running at 15 Hz on a single RTX 5080.

Video Overview

Narrated supplementary video (3 min)

Experimental Results

Evaluation across three settings of increasing difficulty:

I
Seen objects + Paraphrased instructions
II
Unseen objects + In-domain instructions
III
Unseen objects + Novel instructions
Success rates (%) under tabletop and mobile-base conditions. All learned methods share the same fine-tuned π0.5 backbone. Best results in each column in bold.
Method Tabletop Mobile Base
Setting I Setting II Setting III Average Setting I Setting II Setting III Average
Base (π0.5) 70.845.80.038.9 33.350.08.330.5
Base-L 37.58.329.225.0 8.30.08.35.5
MOKA 20.825.08.318.0 16.78.30.08.3
VP-VLA 58.345.84.236.1 58.366.78.344.4
Ours 83.366.766.772.2 83.366.741.763.9

Method Comparison

The same delivery task ("load the bag with order #2") executed by each method on a static mobile platform.

Base — untargeted behavior
Base-L — biased motion, unreliable grounding
VP-VLA — oscillation / wrong target
Ours — follows the cue and succeeds

Mobile-Base Demonstrations

Cue-conditioned manipulation across diverse instructions and scene configurations.

Demo 1
Demo 2
Demo 3
Demo 4
Demo 5
Demo 6

Towards a Complete Delivery Pipeline

Combining navigation with autonomous manipulation, indoors and outdoors.

Outdoor delivery 1 (2× speed)
Outdoor delivery 2 (2× speed)
Outdoor delivery 3