RoboLens¶
The Inspect AI for robotics.
An open-source evaluation framework for physical AI and VLA (vision-language-action) models. Define a robotics benchmark once, then run any policy against any compatible embodiment — a real robot or a simulator — with reproducible logs and first-class Rerun visualization.
One framework, two swappable inputs¶
LLM evals have a single swappable input: the model. Robotics evals have two — and RoboLens makes both first-class and orthogonal.
-
Policy— the VLA
The "brain". Maps an observation + a language instruction to an action chunk (a horizon of actions executed open-loop, as π0 / ACT / diffusion policies do).
-
Embodiment— the robot or sim
The "body + world". Produces observations, executes actions, and owns the action/observation spaces and control rate. Real-robot-first; sims are a stricter special case.
A Task — a dataset of Scenes (initial conditions, instructions, success
targets) plus scorers — is defined independently of both. Before any rollout,
RoboLens verifies the (policy, embodiment) pair is compatible and fails fast
and loud if not.
Quickstart¶
No hardware or simulator required — the dependency-free CubePick mock world
exercises the whole stack:
from robolens import eval
from robolens.mock import CubePickEmbodiment, ScriptedPolicy
from robolens.scene import Scene
from robolens.scorer import success_at_end
from robolens.task import Task
task = Task(
name="cubepick-reach",
scenes=[Scene(id=f"layout-{i}", instruction="reach the cube", init_seed=i) for i in range(5)],
scorer=success_at_end(),
max_steps=80,
)
# The two swappable inputs: a policy (VLA) and an embodiment (robot/sim).
(log,) = eval(task, ScriptedPolicy(), CubePickEmbodiment())
print(log.status, log.results.metrics) # success {'success_at_end': 1.0}
…or from the command line:
robolens list # registered components
robolens run --task cubepick-reach --policy scripted --embodiment cubepick
robolens inspect logs/cubepick-reach_*.json # results table
Why RoboLens¶
-
Real-world first
Interfaces assume real-robot reality: human-in-the-loop reset, no privileged success oracle, wall-clock control rate. Simulators just offer more.
-
Reproducible
Every run yields an immutable, schema-versioned
EvalLogwith the resolved config, git revision, and package versions — re-readable across releases. -
Light core
The core depends only on NumPy. Rerun and simulator/VLA backends are optional extras and separately installable plugins.
-
Safe unattended
An explicit error taxonomy separates "record and continue" from "halt and require a human", so a faulted robot never auto-advances overnight.
-
Rerun visualization
Stream camera images, 3D poses, joint/action time-series, and success markers to a Rerun recording.
-
Pluggable
Ship
robolens-maniskillorrobolens-openvlaas separate packages — entry points make them appear inrobolens listautomatically.
How it maps to Inspect AI¶
If you know Inspect AI, you already know RoboLens.
| Inspect AI | RoboLens |
|---|---|
Model |
Policy (VLA) + Embodiment (two inputs) |
Task = dataset + solver + scorer |
Task = scenes + controller + scorer |
Sample |
Scene |
Solver chain |
Controller middleware (chunking, ensembling, smoothing) |
eval() → EvalLog |
eval() → EvalLog |
@task/@solver/@scorer + registry |
@task/@policy/@embodiment/@scorer + entry points |
For LLMs: llms.txt ·
llms-full.txt.