Skip to content

RoboLens

The Inspect AI for robotics.

An open-source evaluation framework for physical AI and VLA (vision-language-action) models. Define a robotics benchmark once, then run any policy against any compatible embodiment — a real robot or a simulator — with reproducible logs and first-class Rerun visualization.

Get started Concepts GitHub


One framework, two swappable inputs

LLM evals have a single swappable input: the model. Robotics evals have two — and RoboLens makes both first-class and orthogonal.

  • Policy — the VLA


    The "brain". Maps an observation + a language instruction to an action chunk (a horizon of actions executed open-loop, as π0 / ACT / diffusion policies do).

  • Embodiment — the robot or sim


    The "body + world". Produces observations, executes actions, and owns the action/observation spaces and control rate. Real-robot-first; sims are a stricter special case.

A Task — a dataset of Scenes (initial conditions, instructions, success targets) plus scorers — is defined independently of both. Before any rollout, RoboLens verifies the (policy, embodiment) pair is compatible and fails fast and loud if not.


Quickstart

pip install robolens            # core (numpy only)
pip install "robolens[rerun]"   # + Rerun visualization

No hardware or simulator required — the dependency-free CubePick mock world exercises the whole stack:

from robolens import eval
from robolens.mock import CubePickEmbodiment, ScriptedPolicy
from robolens.scene import Scene
from robolens.scorer import success_at_end
from robolens.task import Task

task = Task(
    name="cubepick-reach",
    scenes=[Scene(id=f"layout-{i}", instruction="reach the cube", init_seed=i) for i in range(5)],
    scorer=success_at_end(),
    max_steps=80,
)

# The two swappable inputs: a policy (VLA) and an embodiment (robot/sim).
(log,) = eval(task, ScriptedPolicy(), CubePickEmbodiment())
print(log.status, log.results.metrics)   # success {'success_at_end': 1.0}

…or from the command line:

robolens list                                   # registered components
robolens run --task cubepick-reach --policy scripted --embodiment cubepick
robolens inspect logs/cubepick-reach_*.json     # results table

Why RoboLens

  • Real-world first


    Interfaces assume real-robot reality: human-in-the-loop reset, no privileged success oracle, wall-clock control rate. Simulators just offer more.

  • Reproducible


    Every run yields an immutable, schema-versioned EvalLog with the resolved config, git revision, and package versions — re-readable across releases.

  • Light core


    The core depends only on NumPy. Rerun and simulator/VLA backends are optional extras and separately installable plugins.

  • Safe unattended


    An explicit error taxonomy separates "record and continue" from "halt and require a human", so a faulted robot never auto-advances overnight.

  • Rerun visualization


    Stream camera images, 3D poses, joint/action time-series, and success markers to a Rerun recording.

  • Pluggable


    Ship robolens-maniskill or robolens-openvla as separate packages — entry points make them appear in robolens list automatically.


How it maps to Inspect AI

If you know Inspect AI, you already know RoboLens.

Inspect AI RoboLens
Model Policy (VLA) + Embodiment (two inputs)
Task = dataset + solver + scorer Task = scenes + controller + scorer
Sample Scene
Solver chain Controller middleware (chunking, ensembling, smoothing)
eval()EvalLog eval()EvalLog
@task/@solver/@scorer + registry @task/@policy/@embodiment/@scorer + entry points

For LLMs: llms.txt · llms-full.txt.