Looking Back to Move Forward:
Temporal Verification for Generative Robot Policies

An efficient, temporally-aware verification framework for flow-matching VLAs.

1University of Illinois Chicago 2University of California, Los Angeles Equal advising

Abstract

Generative policies have emerged as a promising paradigm for robot learning, combining expressive generative action modeling with scalable imitation learning from large demonstration corpora. Yet when trained on heterogeneous demonstrations, their chunked generation process can produce suboptimal actions, causing errors to compound during execution and eventually driving the robot into out-of-distribution failure states. Action verification has recently emerged as a test-time scaling approach to mitigate this issue — sampling multiple action candidates and using an external verifier to select the best one. However, existing approaches remain temporally myopic: they evaluate candidates using only static-timestep observations, and often rely on large-scale verifiers and additional expert demonstrations.

We introduce Temporal Verification (TeV), an efficient temporally aware action verification framework for flow-matching VLAs. TeV first learns a temporal token that summarizes recent observation–action history, allowing candidate chunks to be evaluated as continuations of the robot's execution trajectory rather than isolated predictions. Conditioned on this token, TeV constructs positive–negative sample pairs without additional expert data or preference annotations, and trains an energy-based verifier with a contrastive objective to favor higher-quality chunks that are temporally compatible with recent execution.

Beyond post-hoc ranking, TeV uses the learned energy score to steer intermediate samples toward lower-energy regions during flow integration. The verifier adds less than 0.15% parameters on top of the frozen base policy and requires no additional expert demonstrations. Extensive experiments in simulation and real-world settings show that TeV provides reliable action candidate ranking, yields consistent task-success gains of 6%–18%, and produces smoother execution trajectories.

Paradigm comparison. (a) Conventional action verification selects among multiple sampled candidates at each decision step, relying only on the current observation. (b) Temporal-consistency methods refine the current prediction using previous action chunks to promote cross-chunk coherence. (c) Temporal Verification integrates both: conditioned on past observation–action context, the learned temporal verifier steers and selects candidates that are consistent with recent execution.

Comparison of action verification, consistency promotion, and temporal verification paradigms

How TeV Works

A lightweight add-on for flow-matching VLAs — no policy fine-tuning, no extra demonstrations.

Temporal Token

A lightweight temporal encoder compresses the recent observation–action history into a single compact token, trained with a dynamics-aware future-action prediction objective. It gives the verifier execution context beyond the current observation.

Contrastive Energy Verifier

Conditioned on the temporal token, an energy model ranks candidate chunks. Training needs no extra expert data: temporally mismatched expert chunks and coarse-integration policy samples serve as free, informative negatives.

Steering + Selection

The differentiable energy score guides intermediate samples toward lower-energy regions during flow integration, then selects the lowest-energy completed chunk for execution — adding less than 0.15% parameters to the base policy.

Overview of Temporal Verification. (a) Overall pipeline: observation–action history is encoded into a temporal token that conditions the energy-based contrastive verifier. At each flow-matching integration step, candidate representations are scored by the verifier; the resulting energies steer intermediate samples during integration and rank completed chunks after integration. (b) Verifier training: expert chunks serve as positives, while temporally mismatched chunks and policy-generated deviations serve as negatives. The verifier learns to assign positives lower energy than negatives by at least a specified margin.

TeV pipeline: temporal encoder, contrastive verifier, steering and selection

Experimental Results

All methods share the same frozen π0.5 backbone and the same candidate budget (M = 4).

Performance (%) on LIBERO-Plus across seven perturbation types. Best in bold. TACO* uses M = 50 candidates.
Method Camera Robot Language Light Background Noise Layout Average
Base (π0.5) 41.872.379.982.884.476.284.973.2
Random 43.970.286.781.885.579.381.174.3
TE 44.472.382.276.385.174.884.073.0
BID 44.273.376.281.480.376.681.772.2
TACO 44.977.190.177.486.277.187.276.0
TACO* 42.774.883.876.676.872.280.171.5
TeV (Ours) 48.475.691.984.795.880.092.679.8
Performance (%) on RoboTwin 2.0 tasks. Best in bold.
Method Adjust Bottle Pick Bottles Place Container Stack Bowls Place Cup Open Laptop Press Stapler Average
Base (π0.5) 86.048.086.089.078.075.044.072.3
Random 85.052.090.088.079.072.050.073.7
TE 87.043.086.088.082.074.047.072.4
BID 87.059.087.091.076.077.047.074.9
TACO 81.046.087.085.080.062.049.070.0
TACO* 85.032.085.087.080.067.052.069.7
TeV (Ours) 95.064.094.093.088.081.054.081.3
Task completion score (%) on real-world tasks (mean ± std over 30 trials per setting). Best in bold.
Method Bottle to Bag Cup Transport Water Filling
Base (π0.5) 60.7±31.9 79.8±27.5 40.5±32.4
BID 59.8±27.8 71.5±25.9 51.3±30.5
TACO 64.0±30.5 75.2±21.2 51.5±29.7
TeV (Ours) 69.3±26.7 88.3±15.9 68.0±12.8

Real-World Demonstrations

Real-world tasks require stable execution and smooth movement: the bottle is deformable, and two tasks involve transporting or pouring liquid. Baselines often exhibit abrupt motions and oscillations — leading to dropped bottles or spilled water — whereas TeV maintains a steadier trajectory and completes the task reliably.

Demonstration of TeV's effect

Baselines exhibit jerky, unstable motion and fail; TeV stays stable and smooth, and succeeds

Real-World Rollouts

Bottle to Bag — put the water bottle into the bag
Cup Transport — move the cup from the shelf to the table
Water Filling — pour water from the right cup into the left cup