Temporal Verification for Generative Robot Policies

Abstract

Generative policies have emerged as a promising paradigm for robot learning, combining expressive generative action modeling with scalable imitation learning from large demonstration corpora. Yet when trained on heterogeneous demonstrations, their chunked generation process can produce suboptimal actions, causing errors to compound during execution and eventually driving the robot into out-of-distribution failure states. Action verification has recently emerged as a test-time scaling approach to mitigate this issue — sampling multiple action candidates and using an external verifier to select the best one. However, existing approaches remain temporally myopic: they evaluate candidates using only static-timestep observations, and often rely on large-scale verifiers and additional expert demonstrations.

We introduce Temporal Verification (TeV), an efficient temporally aware action verification framework for flow-matching VLAs. TeV first learns a temporal token that summarizes recent observation–action history, allowing candidate chunks to be evaluated as continuations of the robot's execution trajectory rather than isolated predictions. Conditioned on this token, TeV constructs positive–negative sample pairs without additional expert data or preference annotations, and trains an energy-based verifier with a contrastive objective to favor higher-quality chunks that are temporally compatible with recent execution.

Beyond post-hoc ranking, TeV uses the learned energy score to steer intermediate samples toward lower-energy regions during flow integration. The verifier adds less than 0.15% parameters on top of the frozen base policy and requires no additional expert demonstrations. Extensive experiments in simulation and real-world settings show that TeV provides reliable action candidate ranking, yields consistent task-success gains of 6%–18%, and produces smoother execution trajectories.

Paradigm comparison. (a) Conventional action verification selects among multiple sampled candidates at each decision step, relying only on the current observation. (b) Temporal-consistency methods refine the current prediction using previous action chunks to promote cross-chunk coherence. (c) Temporal Verification integrates both: conditioned on past observation–action context, the learned temporal verifier steers and selects candidates that are consistent with recent execution.

Comparison of action verification, consistency promotion, and temporal verification paradigms

How TeV Works

A lightweight add-on for flow-matching VLAs — no policy fine-tuning, no extra demonstrations.

Temporal Token

A lightweight temporal encoder compresses the recent observation–action history into a single compact token, trained with a dynamics-aware future-action prediction objective. It gives the verifier execution context beyond the current observation.

Contrastive Energy Verifier

Conditioned on the temporal token, an energy model ranks candidate chunks. Training needs no extra expert data: temporally mismatched expert chunks and coarse-integration policy samples serve as free, informative negatives.

Steering + Selection

The differentiable energy score guides intermediate samples toward lower-energy regions during flow integration, then selects the lowest-energy completed chunk for execution — adding less than 0.15% parameters to the base policy.

Overview of Temporal Verification. (a) Overall pipeline: observation–action history is encoded into a temporal token that conditions the energy-based contrastive verifier. At each flow-matching integration step, candidate representations are scored by the verifier; the resulting energies steer intermediate samples during integration and rank completed chunks after integration. (b) Verifier training: expert chunks serve as positives, while temporally mismatched chunks and policy-generated deviations serve as negatives. The verifier learns to assign positives lower energy than negatives by at least a specified margin.

TeV pipeline: temporal encoder, contrastive verifier, steering and selection

Experimental Results

All methods share the same frozen π_0.5 backbone and the same candidate budget (M = 4).

Performance (%) on LIBERO-Plus across seven perturbation types. Best in bold. TACO^* uses M = 50 candidates.

Method	Camera	Robot	Language	Light	Background	Noise	Layout	Average
Base (π_0.5)	41.8	72.3	79.9	82.8	84.4	76.2	84.9	73.2
Random	43.9	70.2	86.7	81.8	85.5	79.3	81.1	74.3
TE	44.4	72.3	82.2	76.3	85.1	74.8	84.0	73.0
BID	44.2	73.3	76.2	81.4	80.3	76.6	81.7	72.2
TACO	44.9	77.1	90.1	77.4	86.2	77.1	87.2	76.0
TACO^*	42.7	74.8	83.8	76.6	76.8	72.2	80.1	71.5
TeV (Ours)	48.4	75.6	91.9	84.7	95.8	80.0	92.6	79.8

Performance (%) on RoboTwin 2.0 tasks. Best in bold.

Method	Adjust Bottle	Pick Bottles	Place Container	Stack Bowls	Place Cup	Open Laptop	Press Stapler	Average
Base (π_0.5)	86.0	48.0	86.0	89.0	78.0	75.0	44.0	72.3
Random	85.0	52.0	90.0	88.0	79.0	72.0	50.0	73.7
TE	87.0	43.0	86.0	88.0	82.0	74.0	47.0	72.4
BID	87.0	59.0	87.0	91.0	76.0	77.0	47.0	74.9
TACO	81.0	46.0	87.0	85.0	80.0	62.0	49.0	70.0
TACO^*	85.0	32.0	85.0	87.0	80.0	67.0	52.0	69.7
TeV (Ours)	95.0	64.0	94.0	93.0	88.0	81.0	54.0	81.3

Task completion score (%) on real-world tasks (mean ± std over 30 trials per setting). Best in bold.

Method	Bottle to Bag	Cup Transport	Water Filling
Base (π_0.5)	60.7±31.9	79.8±27.5	40.5±32.4
BID	59.8±27.8	71.5±25.9	51.3±30.5
TACO	64.0±30.5	75.2±21.2	51.5±29.7
TeV (Ours)	69.3±26.7	88.3±15.9	68.0±12.8

Real-World Demonstrations

Real-world tasks require stable execution and smooth movement: the bottle is deformable, and two tasks involve transporting or pouring liquid. Baselines often exhibit abrupt motions and oscillations — leading to dropped bottles or spilled water — whereas TeV maintains a steadier trajectory and completes the task reliably.