RoboLens¶

The Inspect AI for robotics.

An open-source evaluation framework for physical AI and VLA (vision-language-action) models. Define a robotics benchmark once, then run any policy against any compatible embodiment — a real robot or a simulator — with reproducible logs and first-class Rerun visualization.

Get started Concepts GitHub

One framework, two swappable inputs¶

LLM evals have a single swappable input: the model. Robotics evals have two — and RoboLens makes both first-class and orthogonal.

Policy — the VLA

The "brain". Maps an observation + a language instruction to an action chunk (a horizon of actions executed open-loop, as π0 / ACT / diffusion policies do).
Embodiment — the robot or sim

The "body + world". Produces observations, executes actions, and owns the action/observation spaces and control rate. Real-robot-first; sims are a stricter special case.

A Task — a dataset of Scenes (initial conditions, instructions, success targets) plus scorers — is defined independently of both. Before any rollout, RoboLens verifies the (policy, embodiment) pair is compatible and fails fast and loud if not.

Quickstart¶

pip install robolens            # core (numpy only)
pip install "robolens[rerun]"   # + Rerun visualization

No hardware or simulator required — the dependency-free CubePick mock world exercises the whole stack:

from robolens import eval
from robolens.mock import CubePickEmbodiment, ScriptedPolicy
from robolens.scene import Scene
from robolens.scorer import success_at_end
from robolens.task import Task

task = Task(
    name="cubepick-reach",
    scenes=[Scene(id=f"layout-{i}", instruction="reach the cube", init_seed=i) for i in range(5)],
    scorer=success_at_end(),
    max_steps=80,
)

# The two swappable inputs: a policy (VLA) and an embodiment (robot/sim).
(log,) = eval(task, ScriptedPolicy(), CubePickEmbodiment())
print(log.status, log.results.metrics)   # success {'success_at_end': 1.0}

…or from the command line:

robolens list                                   # registered components
robolens run --task cubepick-reach --policy scripted --embodiment cubepick
robolens inspect logs/cubepick-reach_*.json     # results table

Why RoboLens¶

Real-world first

Interfaces assume real-robot reality: human-in-the-loop reset, no privileged success oracle, wall-clock control rate. Simulators just offer more.
Reproducible

Every run yields an immutable, schema-versioned EvalLog with the resolved config, git revision, and package versions — re-readable across releases.
Light core

The core depends only on NumPy. Rerun and simulator/VLA backends are optional extras and separately installable plugins.
Safe unattended

An explicit error taxonomy separates "record and continue" from "halt and require a human", so a faulted robot never auto-advances overnight.
Rerun visualization

Stream camera images, 3D poses, joint/action time-series, and success markers to a Rerun recording.
Pluggable

Ship robolens-maniskill or robolens-openvla as separate packages — entry points make them appear in robolens list automatically.

How it maps to Inspect AI¶

If you know Inspect AI, you already know RoboLens.

Inspect AI	RoboLens
`Model`	`Policy` (VLA) + `Embodiment` (two inputs)
`Task = dataset + solver + scorer`	`Task = scenes + controller + scorer`
`Sample`	`Scene`
`Solver` chain	`Controller` middleware (chunking, ensembling, smoothing)
`eval()` → `EvalLog`	`eval()` → `EvalLog`
`@task`/`@solver`/`@scorer` + registry	`@task`/`@policy`/`@embodiment`/`@scorer` + entry points

For LLMs: llms.txt · llms-full.txt.