E-TTS

TL;DR:

E-TTS is a plug-and-play embodied test-time scaling framework that jointly scales reasoning and action with history-aware verification and closed-loop feedback, improving multiple VLA base models across multiple robotic benchmarks without retraining.

Abstract

Recent test-time scaling methods for embodied tasks mainly scale actions, but robotic manipulation also depends on explicit reasoning and historical context. E-TTS addresses these two gaps through a modular embodied test-time scaling framework that unifies reasoning and action scaling for robotic manipulation.

E-TTS performs reasoning-action joint sampling and pairwise scoring, stores historical context in a history buffer for verification, and introduces feedback generation into the sampling process to form a closed-loop iterative refinement mechanism. Each component is composable, allowing E-TTS to be flexibly integrated with different VLA policies.

The framework is evaluated across 4 benchmarks, 6 environments, 3 embodiments, and 4 base VLA models. Without additional expert data collection or retraining, E-TTS consistently improves performance, with gains up to a 33.14% increase in simulation and 26.62% in real-world experiments.

Figure 1: Overview. E-TTS integrates reasoning and action scaling through history-aware, closed-loop interactions with vision-language verifiers and is evaluated on SimplerEnv, LIBERO, LIBERO-Plus, VLABench, and real-world Franka tasks.

4

Benchmarks

6

Environments

4

Base Policies

+33.14%

Max Simulation Gain

E-TTS wraps around an existing VLA policy at inference time. At each decision step, the base policy generates multiple reasoning-action candidates. Separate verifiers evaluate the reasoning and action quality, and a joint score selects high-confidence candidates for execution.

Figure 2: Framework. E-TTS jointly samples reasoning and actions, verifies them with history-aware reasoning and action verifiers, and uses feedback-guided resampling when candidates are rejected.

Joint Reasoning-Action Scaling

Reasoning traces and low-level actions are sampled as coupled hypotheses, so test-time compute improves high-level planning and executable control together.

History-Aware Verification

A history buffer stores recent observations, reasoning, and actions. The reasoning verifier uses this temporal context to evaluate consistency in long-horizon manipulation.

Feedback-Guided Refinement

Rejected candidates trigger structured feedback, which is injected into the next sampling round to refine the policy's reasoning and action proposals.

The method supports different intermediate reasoning types: multimodal reasoning from E-CoT, spatial reasoning from MolmoAct, and textual task-plan reasoning from pi0.5 and ER-1.

E-TTS is evaluated as a framework rather than as a single-model improvement. The experiments cover multiple benchmark families and base policies: E-CoT on SimplerEnv WidowX, MolmoAct on SimplerEnv Google Robot, LIBERO and LIBERO-Plus, pi0.5 on VLABench, ER-1 on SimplerEnv WidowX, and MolmoAct on real-world Franka tasks.

SimplerEnv WidowX SimplerEnv Google Robot LIBERO LIBERO-Plus VLABench Real-World Franka

SimplerEnv: E-CoT and MolmoAct

On SimplerEnv, E-TTS improves both a multimodal-reasoning policy and a spatial-reasoning policy. With E-CoT on WidowX, E-TTS raises the average success rate from 6.67% to 39.81%, outperforming naive TTS and RoboMonkey. With MolmoAct on Google Robot, it improves both visual matching and variant aggregation.

Method	Spoon on Towel	Eggplant in Basket	Carrot on Plate	Average
E-CoT	20.00	0.00	0.00	6.67
E-CoT + naive TTS	41.67	4.17	25.00	23.61
E-CoT + RoboMonkey	33.33	8.33	37.50	26.38
E-CoT + E-TTS	58.33	22.22	38.89	39.81

Table 1: SimplerEnv WidowX visual matching results with E-CoT.

Method	Visual Matching				Variant Aggregation
Method	Pick Coke Can	Move Near	Open/Close Drawer	Average	Pick Coke Can	Move Near	Open/Close Drawer	Average
MolmoAct	74.50	75.45	63.25	71.05	66.95	52.55	77.75	65.68
MolmoAct + naive TTS	76.47	80.00	66.60	74.36	56.00	50.00	80.00	62.00
MolmoAct + E-TTS	83.00	85.00	72.00	80.00	80.00	70.00	83.30	77.77

Table 2: SimplerEnv Google Robot results with MolmoAct.

LIBERO and LIBERO-Plus: MolmoAct

E-TTS also improves MolmoAct on Franka-based simulation benchmarks. LIBERO evaluates Spatial, Object, Goal, and Long-Horizon task suites, while LIBERO-Plus stresses robustness under viewpoint, object layout, robot initialization, instruction, lighting, texture, and sensor perturbations.

Benchmark / Method	Spatial	Object	Goal	Long	Average
LIBERO / MolmoAct	87.00	92.67	87.60	75.00	85.57
LIBERO / MolmoAct + naive TTS	88.90	93.10	70.00	75.00	81.75
LIBERO / MolmoAct + E-TTS	93.60	97.50	92.00	80.00	90.78
LIBERO-Plus / MolmoAct	88.80	90.40	84.30	77.46	85.24
LIBERO-Plus / MolmoAct + E-TTS	91.20	91.36	85.04	80.99	87.15

Table 3: LIBERO and LIBERO-Plus results. LIBERO-Plus average is computed from the four reported categories.

pi0.5 on VLABench

E-TTS also integrates with pi0.5 on VLABench, a long-horizon benchmark with compositional tasks that require multi-step planning, world knowledge transfer, semantic instruction understanding, and physical-law awareness. E-TTS improves the average success rate across all listed tracks, with notable gains on tasks such as select_poker from 0.46 to 0.63 and add_condiment from 0.10 to 0.16.

Method	Track	IF	SB	SCT	SPo	AC	SD	Avg.
pi0.5	Track 1	0.18	0.48	0.48	0.46	0.50	0.42	0.39
pi0.5 + E-TTS	Track 1	0.20	0.53	0.53	0.63	0.54	0.48	0.42
pi0.5	Track 2	-	0.02	0.15	0.40	0.06	0.14	0.20
pi0.5 + E-TTS	Track 2	-	0.02	0.16	0.48	0.08	0.14	0.24
pi0.5	Track 3	0.08	0.38	0.50	-	0.14	0.14	0.19
pi0.5 + E-TTS	Track 3	0.10	0.36	0.54	-	0.18	0.16	0.21
pi0.5	Track 4	0.04	0.38	0.21	0.06	0.10	0.20	0.15
pi0.5 + E-TTS	Track 4	0.06	0.48	0.24	0.08	0.16	0.22	0.16
pi0.5	Track 6	0.06	0.45	0.29	0.22	0.46	0.30	0.25
pi0.5 + E-TTS	Track 6	0.04	0.48	0.31	0.28	0.31	0.34	0.26

Table 4: VLABench success rates for representative tasks with pi0.5 and pi0.5 + E-TTS.

Reasoning-Action Scaling Trade-off

The paper studies how to allocate a fixed test-time sampling budget between reasoning scaling and action scaling. Over-prioritizing either side is suboptimal, while an approximately balanced 1/1 proportion achieves the best success rate across the reported tasks.

Figure 3: Success rate across different proportions. The 1/1 proportion achieves peak performance across all tasks.

The real-robot evaluation uses a Franka Research 3 platform and evaluates E-TTS on four manipulation tasks with rigid and deformable objects. The visual setup and task-level results are shown below, followed by the expanded 160-trial real-world table.

Figure 4: Real-world robotic setup and results

Figure 4: Real-robot experiments. The paper evaluates E-TTS with a Franka Research 3 setup across Lion, Bag, Sanitizer, and Drawer tasks, showing consistent gains over the MolmoAct Finetuned baseline.

Method	Lion	Bag	Sanitizer	Drawer	Average
MolmoAct Finetuned	22.50	17.50	32.50	16.00	22.13
MolmoAct + E-TTS	60.00	42.50	57.50	35.00	48.75

Rebuttal Table 1: Expanded real-world evaluation over 160 trials, improving average success from 22.13% to 48.75%.

E-CoT: SimplerEnv WidowX, carrot on plate

E-CoT + ours: SimplerEnv WidowX, carrot on plate

E-CoT: SimplerEnv WidowX, eggplant in basket

E-CoT + ours: SimplerEnv WidowX, eggplant in basket

MolmoAct: LIBERO Goal, wine bottle on cabinet

MolmoAct + ours: LIBERO Goal, wine bottle on cabinet

MolmoAct: LIBERO Object, cream cheese in basket

MolmoAct + ours: LIBERO Object, cream cheese in basket

MolmoAct: LIBERO Long, black bowl in drawer

MolmoAct + ours: LIBERO Long, black bowl in drawer

MolmoAct: LIBERO Spatial, black bowl next to ramekin

MolmoAct + ours: LIBERO Spatial, black bowl next to ramekin

@inproceedings{ye2026etts,
  title={E-TTS: A New Embodied Test-Time Scaling Framework for Robotic Manipulation},
  author={Ye, Wen and Li, Peiyan and Yuan, Tingyu and Xu, Yuan and Wu, Xiangnan and Zhao, Chaoyang and Liu, Jing and Liu, Nianfeng and Huang, Yan and Wang, Liang},
  booktitle={ECCV},
  year={2026}
}

E-TTS

A New Embodied Test-Time Scaling Framework for Robotic Manipulation

TL;DR:

Abstract

Method

Joint Reasoning-Action Scaling

History-Aware Verification

Feedback-Guided Refinement

Experiments

SimplerEnv: E-CoT and MolmoAct

LIBERO and LIBERO-Plus: MolmoAct

pi0.5 on VLABench

Reasoning-Action Scaling Trade-off

Real-World Experiments

Videos

Citation