E-TTS

A New Embodied Test-Time Scaling Framework for Robotic Manipulation

Wen Ye,1,2,* Peiyan Li,1,2,* Tingyu Yuan,2,3 Yuan Xu,1,2 Xiangnan Wu,1,2
Chaoyang Zhao,3 Jing Liu,4 Nianfeng Liu,4 Yan Huang,1,2,4 Liang Wang1,2,
*Equal Contribution, Corresponding Author
1New Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences
2School of Artificial Intelligence, University of Chinese Academy of Sciences
3Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences
4FiveAges

TL;DR:

E-TTS is a plug-and-play embodied test-time scaling framework that jointly scales reasoning and action with history-aware verification and closed-loop feedback, improving multiple VLA base models across multiple robotic benchmarks without retraining.

Abstract

Recent test-time scaling methods for embodied tasks mainly scale actions, but robotic manipulation also depends on explicit reasoning and historical context. E-TTS addresses these two gaps through a modular embodied test-time scaling framework that unifies reasoning and action scaling for robotic manipulation.

E-TTS performs reasoning-action joint sampling and pairwise scoring, stores historical context in a history buffer for verification, and introduces feedback generation into the sampling process to form a closed-loop iterative refinement mechanism. Each component is composable, allowing E-TTS to be flexibly integrated with different VLA policies.

The framework is evaluated across 4 benchmarks, 6 environments, 3 embodiments, and 4 base VLA models. Without additional expert data collection or retraining, E-TTS consistently improves performance, with gains up to a 33.14% increase in simulation and 26.62% in real-world experiments.

Figure 1: Overview of E-TTS

Figure 1: Overview. E-TTS integrates reasoning and action scaling through history-aware, closed-loop interactions with vision-language verifiers and is evaluated on SimplerEnv, LIBERO, LIBERO-Plus, VLABench, and real-world Franka tasks.

4
Benchmarks
6
Environments
4
Base Policies
+33.14%
Max Simulation Gain

Method

E-TTS wraps around an existing VLA policy at inference time. At each decision step, the base policy generates multiple reasoning-action candidates. Separate verifiers evaluate the reasoning and action quality, and a joint score selects high-confidence candidates for execution.

Figure 2: E-TTS framework

Figure 2: Framework. E-TTS jointly samples reasoning and actions, verifies them with history-aware reasoning and action verifiers, and uses feedback-guided resampling when candidates are rejected.

Joint Reasoning-Action Scaling

Reasoning traces and low-level actions are sampled as coupled hypotheses, so test-time compute improves high-level planning and executable control together.

History-Aware Verification

A history buffer stores recent observations, reasoning, and actions. The reasoning verifier uses this temporal context to evaluate consistency in long-horizon manipulation.

Feedback-Guided Refinement

Rejected candidates trigger structured feedback, which is injected into the next sampling round to refine the policy's reasoning and action proposals.

The method supports different intermediate reasoning types: multimodal reasoning from E-CoT, spatial reasoning from MolmoAct, and textual task-plan reasoning from pi0.5 and ER-1.

Experiments

E-TTS is evaluated as a framework rather than as a single-model improvement. The experiments cover multiple benchmark families and base policies: E-CoT on SimplerEnv WidowX, MolmoAct on SimplerEnv Google Robot, LIBERO and LIBERO-Plus, pi0.5 on VLABench, ER-1 on SimplerEnv WidowX, and MolmoAct on real-world Franka tasks.

SimplerEnv WidowX SimplerEnv Google Robot LIBERO LIBERO-Plus VLABench Real-World Franka

SimplerEnv: E-CoT and MolmoAct

On SimplerEnv, E-TTS improves both a multimodal-reasoning policy and a spatial-reasoning policy. With E-CoT on WidowX, E-TTS raises the average success rate from 6.67% to 39.81%, outperforming naive TTS and RoboMonkey. With MolmoAct on Google Robot, it improves both visual matching and variant aggregation.

Method Spoon on Towel Eggplant in Basket Carrot on Plate Average
E-CoT20.000.000.006.67
E-CoT + naive TTS41.674.1725.0023.61
E-CoT + RoboMonkey33.338.3337.5026.38
E-CoT + E-TTS58.3322.2238.8939.81

Table 1: SimplerEnv WidowX visual matching results with E-CoT.

Method Visual Matching Variant Aggregation
Pick Coke Can Move Near Open/Close Drawer Average Pick Coke Can Move Near Open/Close Drawer Average
MolmoAct74.5075.4563.2571.0566.9552.5577.7565.68
MolmoAct + naive TTS76.4780.0066.6074.3656.0050.0080.0062.00
MolmoAct + E-TTS83.0085.0072.0080.0080.0070.0083.3077.77

Table 2: SimplerEnv Google Robot results with MolmoAct.

LIBERO and LIBERO-Plus: MolmoAct

E-TTS also improves MolmoAct on Franka-based simulation benchmarks. LIBERO evaluates Spatial, Object, Goal, and Long-Horizon task suites, while LIBERO-Plus stresses robustness under viewpoint, object layout, robot initialization, instruction, lighting, texture, and sensor perturbations.

Benchmark / Method Spatial Object Goal Long Average
LIBERO / MolmoAct87.0092.6787.6075.0085.57
LIBERO / MolmoAct + naive TTS88.9093.1070.0075.0081.75
LIBERO / MolmoAct + E-TTS93.6097.5092.0080.0090.78
LIBERO-Plus / MolmoAct88.8090.4084.3077.4685.24
LIBERO-Plus / MolmoAct + E-TTS91.2091.3685.0480.9987.15

Table 3: LIBERO and LIBERO-Plus results. LIBERO-Plus average is computed from the four reported categories.

pi0.5 on VLABench

E-TTS also integrates with pi0.5 on VLABench, a long-horizon benchmark with compositional tasks that require multi-step planning, world knowledge transfer, semantic instruction understanding, and physical-law awareness. E-TTS improves the average success rate across all listed tracks, with notable gains on tasks such as select_poker from 0.46 to 0.63 and add_condiment from 0.10 to 0.16.

Method Track IF SB SCT SPo AC SD Avg.
pi0.5Track 10.180.480.480.460.500.420.39
pi0.5 + E-TTSTrack 10.200.530.530.630.540.480.42
pi0.5Track 2-0.020.150.400.060.140.20
pi0.5 + E-TTSTrack 2-0.020.160.480.080.140.24
pi0.5Track 30.080.380.50-0.140.140.19
pi0.5 + E-TTSTrack 30.100.360.54-0.180.160.21
pi0.5Track 40.040.380.210.060.100.200.15
pi0.5 + E-TTSTrack 40.060.480.240.080.160.220.16
pi0.5Track 60.060.450.290.220.460.300.25
pi0.5 + E-TTSTrack 60.040.480.310.280.310.340.26

Table 4: VLABench success rates for representative tasks with pi0.5 and pi0.5 + E-TTS.

Reasoning-Action Scaling Trade-off

The paper studies how to allocate a fixed test-time sampling budget between reasoning scaling and action scaling. Over-prioritizing either side is suboptimal, while an approximately balanced 1/1 proportion achieves the best success rate across the reported tasks.

Figure 3: Success rate across different proportions

Figure 3: Success rate across different proportions. The 1/1 proportion achieves peak performance across all tasks.

Real-World Experiments

The real-robot evaluation uses a Franka Research 3 platform and evaluates E-TTS on four manipulation tasks with rigid and deformable objects. The visual setup and task-level results are shown below, followed by the expanded 160-trial real-world table.

Figure 4: Real-world robotic setup and results

Figure 4: Real-robot experiments. The paper evaluates E-TTS with a Franka Research 3 setup across Lion, Bag, Sanitizer, and Drawer tasks, showing consistent gains over the MolmoAct Finetuned baseline.

Method Lion Bag Sanitizer Drawer Average
MolmoAct Finetuned22.5017.5032.5016.0022.13
MolmoAct + E-TTS60.0042.5057.5035.0048.75

Rebuttal Table 1: Expanded real-world evaluation over 160 trials, improving average success from 22.13% to 48.75%.

Videos

E-CoT: SimplerEnv WidowX, carrot on plate

E-CoT + ours: SimplerEnv WidowX, carrot on plate

E-CoT: SimplerEnv WidowX, eggplant in basket

E-CoT + ours: SimplerEnv WidowX, eggplant in basket

MolmoAct: LIBERO Goal, wine bottle on cabinet

MolmoAct + ours: LIBERO Goal, wine bottle on cabinet

MolmoAct: LIBERO Object, cream cheese in basket

MolmoAct + ours: LIBERO Object, cream cheese in basket

MolmoAct: LIBERO Long, black bowl in drawer

MolmoAct + ours: LIBERO Long, black bowl in drawer

MolmoAct: LIBERO Spatial, black bowl next to ramekin

MolmoAct + ours: LIBERO Spatial, black bowl next to ramekin

Citation

@inproceedings{ye2026etts,
  title={E-TTS: A New Embodied Test-Time Scaling Framework for Robotic Manipulation},
  author={Ye, Wen and Li, Peiyan and Yuan, Tingyu and Xu, Yuan and Wu, Xiangnan and Zhao, Chaoyang and Liu, Jing and Liu, Nianfeng and Huang, Yan and Wang, Liang},
  booktitle={ECCV},
  year={2026}
}