The evaluation platform for robotics researchers

Run every major simulator benchmark in hours, not days. See exactly where your policy fails. Compare against verified baselines.

Free for research partners

manifold-cli

$ manifold run █

STEP 01

Test on every benchmark that matters

Evaluate broadly without spending weeks wiring up each new simulator. One harness runs LIBERO, RoboCasa, and your own scenarios across every major simulator.

Choosing the LIBERO benchmark from Manifold's benchmark picker, with RoboCasa, RoboMimic, CALVIN, and more in the list.

Pick a policy and a benchmark. No per-simulator harness to build.

STEP 02

Get results in a fraction of the time

Benchmarks come sharded across GPUs by default. LIBERO runs 8x faster than a single-GPU baseline, so an overnight job becomes a lunch break.

LIBERO-90 rollouts running in parallel, sharded across 8 GPUs.

LIBERO-90, 1,000 rollouts: single GPU vs sharded across 8.

STEP 03

Track progress over time and against the SOTA

Compare your daily performance against verified results from every major model, across every major benchmark. See your rank move against every published baseline, run over run.

Manifold's public leaderboard: policy scores climbing against verified baselines over time.

Verified baselines on shared benchmarks. Every run gets a citable manifold:// URI.

STEP 04

Discover where your policy fails

Cluster failed episodes by failure mode to see the specific task families and subtasks that break your policy, not just an aggregate score.

Failed episodes clustered by failure mode in a Manifold run detail.

Run detail from manifold.bifrost.ai: score, per-task pass rates, clustered failure modes.

Open source release coming soon

The standards layer can't be proprietary

Manifold's runner, harness, and leaderboard schema will be open, so results compound across the field instead of staying locked inside a single lab. We're opening it up soon. Get on the waitlist and we'll bring you in early.

Focus on the science
Manifold runs the evals