E-TTS is evaluated as a framework rather than as a single-model improvement. The experiments cover multiple benchmark families and base policies: E-CoT on SimplerEnv WidowX, MolmoAct on SimplerEnv Google Robot, LIBERO and LIBERO-Plus, pi0.5 on VLABench, ER-1 on SimplerEnv WidowX, and MolmoAct on real-world Franka tasks.
SimplerEnv: E-CoT and MolmoAct
On SimplerEnv, E-TTS improves both a multimodal-reasoning policy and a spatial-reasoning policy. With E-CoT on WidowX, E-TTS raises the average success rate from 6.67% to 39.81%, outperforming naive TTS and RoboMonkey. With MolmoAct on Google Robot, it improves both visual matching and variant aggregation.
| Method | Spoon on Towel | Eggplant in Basket | Carrot on Plate | Average |
|---|---|---|---|---|
| E-CoT | 20.00 | 0.00 | 0.00 | 6.67 |
| E-CoT + naive TTS | 41.67 | 4.17 | 25.00 | 23.61 |
| E-CoT + RoboMonkey | 33.33 | 8.33 | 37.50 | 26.38 |
| E-CoT + E-TTS | 58.33 | 22.22 | 38.89 | 39.81 |
Table 1: SimplerEnv WidowX visual matching results with E-CoT.
| Method | Visual Matching | Variant Aggregation | ||||||
|---|---|---|---|---|---|---|---|---|
| Pick Coke Can | Move Near | Open/Close Drawer | Average | Pick Coke Can | Move Near | Open/Close Drawer | Average | |
| MolmoAct | 74.50 | 75.45 | 63.25 | 71.05 | 66.95 | 52.55 | 77.75 | 65.68 |
| MolmoAct + naive TTS | 76.47 | 80.00 | 66.60 | 74.36 | 56.00 | 50.00 | 80.00 | 62.00 |
| MolmoAct + E-TTS | 83.00 | 85.00 | 72.00 | 80.00 | 80.00 | 70.00 | 83.30 | 77.77 |
Table 2: SimplerEnv Google Robot results with MolmoAct.
LIBERO and LIBERO-Plus: MolmoAct
E-TTS also improves MolmoAct on Franka-based simulation benchmarks. LIBERO evaluates Spatial, Object, Goal, and Long-Horizon task suites, while LIBERO-Plus stresses robustness under viewpoint, object layout, robot initialization, instruction, lighting, texture, and sensor perturbations.
| Benchmark / Method | Spatial | Object | Goal | Long | Average |
|---|---|---|---|---|---|
| LIBERO / MolmoAct | 87.00 | 92.67 | 87.60 | 75.00 | 85.57 |
| LIBERO / MolmoAct + naive TTS | 88.90 | 93.10 | 70.00 | 75.00 | 81.75 |
| LIBERO / MolmoAct + E-TTS | 93.60 | 97.50 | 92.00 | 80.00 | 90.78 |
| LIBERO-Plus / MolmoAct | 88.80 | 90.40 | 84.30 | 77.46 | 85.24 |
| LIBERO-Plus / MolmoAct + E-TTS | 91.20 | 91.36 | 85.04 | 80.99 | 87.15 |
Table 3: LIBERO and LIBERO-Plus results. LIBERO-Plus average is computed from the four reported categories.
pi0.5 on VLABench
E-TTS also integrates with pi0.5 on VLABench, a long-horizon benchmark with compositional tasks that require multi-step planning, world knowledge transfer, semantic instruction understanding, and physical-law awareness. E-TTS improves the average success rate across all listed tracks, with notable gains on tasks such as select_poker from 0.46 to 0.63 and add_condiment from 0.10 to 0.16.
| Method | Track | IF | SB | SCT | SPo | AC | SD | Avg. |
|---|---|---|---|---|---|---|---|---|
| pi0.5 | Track 1 | 0.18 | 0.48 | 0.48 | 0.46 | 0.50 | 0.42 | 0.39 |
| pi0.5 + E-TTS | Track 1 | 0.20 | 0.53 | 0.53 | 0.63 | 0.54 | 0.48 | 0.42 |
| pi0.5 | Track 2 | - | 0.02 | 0.15 | 0.40 | 0.06 | 0.14 | 0.20 |
| pi0.5 + E-TTS | Track 2 | - | 0.02 | 0.16 | 0.48 | 0.08 | 0.14 | 0.24 |
| pi0.5 | Track 3 | 0.08 | 0.38 | 0.50 | - | 0.14 | 0.14 | 0.19 |
| pi0.5 + E-TTS | Track 3 | 0.10 | 0.36 | 0.54 | - | 0.18 | 0.16 | 0.21 |
| pi0.5 | Track 4 | 0.04 | 0.38 | 0.21 | 0.06 | 0.10 | 0.20 | 0.15 |
| pi0.5 + E-TTS | Track 4 | 0.06 | 0.48 | 0.24 | 0.08 | 0.16 | 0.22 | 0.16 |
| pi0.5 | Track 6 | 0.06 | 0.45 | 0.29 | 0.22 | 0.46 | 0.30 | 0.25 |
| pi0.5 + E-TTS | Track 6 | 0.04 | 0.48 | 0.31 | 0.28 | 0.31 | 0.34 | 0.26 |
Table 4: VLABench success rates for representative tasks with pi0.5 and pi0.5 + E-TTS.
Reasoning-Action Scaling Trade-off
The paper studies how to allocate a fixed test-time sampling budget between reasoning scaling and action scaling. Over-prioritizing either side is suboptimal, while an approximately balanced 1/1 proportion achieves the best success rate across the reported tasks.
Figure 3: Success rate across different proportions. The 1/1 proportion achieves peak performance across all tasks.