The evaluation platform for robotics researchers
Run every major simulator benchmark in hours, not days. See exactly where your policy fails. Compare against verified baselines.
Free for research partners
Test on every benchmark that matters
Evaluate broadly without spending weeks wiring up each new simulator. One harness runs LIBERO, RoboCasa, and your own scenarios across every major simulator.
Pick a policy and a benchmark. No per-simulator harness to build.
Get results in a fraction of the time
Benchmarks come sharded across GPUs by default. LIBERO runs 8x faster than a single-GPU baseline, so an overnight job becomes a lunch break.
LIBERO-90, 1,000 rollouts: single GPU vs sharded across 8.
Track progress over time and against the SOTA
Compare your daily performance against verified results from every major model, across every major benchmark. See your rank move against every published baseline, run over run.
Verified baselines on shared benchmarks. Every run gets a citable manifold:// URI.
Discover where your policy fails
Cluster failed episodes by failure mode to see the specific task families and subtasks that break your policy, not just an aggregate score.
Run detail from manifold.bifrost.ai: score, per-task pass rates, clustered failure modes.
The standards layer can't be proprietary
Manifold's runner, harness, and leaderboard schema will be open, so results compound across the field instead of staying locked inside a single lab. We're opening it up soon. Get on the waitlist and we'll bring you in early.
Focus on the science
Manifold runs the evals
Free for research partners