# RoboLens

> The Inspect AI for robotics — an evaluation framework for physical AI and VLA (vision-language-action) models.

RoboLens is the open-source evaluation framework for physical AI and VLA (vision-language-action) models — the "Inspect AI for robotics". A benchmark is a Task (dataset of Scenes + scorers) run against two swappable inputs: a Policy (the VLA) and an Embodiment (a real robot or simulator).

# Guide

# RoboLens

The **Inspect AI** for robotics.

An open-source evaluation framework for **physical AI** and **VLA (vision-language-action)** models. Define a robotics benchmark once, then run *any* policy against *any* compatible embodiment — a real robot or a simulator — with reproducible logs and first-class [Rerun](https://github.com/rerun-io/rerun) visualization.

[Get started](https://robocurve.github.io/robolens/guide/quickstart/index.md) [Concepts](https://robocurve.github.io/robolens/guide/concepts/index.md) [GitHub](https://github.com/robocurve/robolens)

______________________________________________________________________

## One framework, two swappable inputs

LLM evals have a single swappable input: the model. **Robotics evals have two** — and RoboLens makes both first-class and orthogonal.

- **`Policy` — the VLA**

  ______________________________________________________________________

  The "brain". Maps an observation + a language instruction to an **action chunk** (a horizon of actions executed open-loop, as π0 / ACT / diffusion policies do).

- **`Embodiment` — the robot or sim**

  ______________________________________________________________________

  The "body + world". Produces observations, executes actions, and owns the action/observation spaces and control rate. Real-robot-first; sims are a stricter special case.

A **`Task`** — a dataset of `Scene`s (initial conditions, instructions, success targets) plus scorers — is defined *independently* of both. Before any rollout, RoboLens verifies the `(policy, embodiment)` pair is **compatible** and fails fast and loud if not.

______________________________________________________________________

## Quickstart

```
pip install robolens            # core (numpy only)
pip install "robolens[rerun]"   # + Rerun visualization
```

No hardware or simulator required — the dependency-free `CubePick` mock world exercises the whole stack:

```
from robolens import eval
from robolens.mock import CubePickEmbodiment, ScriptedPolicy
from robolens.scene import Scene
from robolens.scorer import success_at_end
from robolens.task import Task

task = Task(
    name="cubepick-reach",
    scenes=[Scene(id=f"layout-{i}", instruction="reach the cube", init_seed=i) for i in range(5)],
    scorer=success_at_end(),
    max_steps=80,
)

# The two swappable inputs: a policy (VLA) and an embodiment (robot/sim).
(log,) = eval(task, ScriptedPolicy(), CubePickEmbodiment())
print(log.status, log.results.metrics)   # success {'success_at_end': 1.0}
```

…or from the command line:

```
robolens list                                   # registered components
robolens run --task cubepick-reach --policy scripted --embodiment cubepick
robolens inspect logs/cubepick-reach_*.json     # results table
```

______________________________________________________________________

## Why RoboLens

- **Real-world first**

  ______________________________________________________________________

  Interfaces assume real-robot reality: human-in-the-loop reset, no privileged success oracle, wall-clock control rate. Simulators just offer more.

- **Reproducible**

  ______________________________________________________________________

  Every run yields an immutable, schema-versioned `EvalLog` with the resolved config, git revision, and package versions — re-readable across releases.

- **Light core**

  ______________________________________________________________________

  The core depends only on NumPy. Rerun and simulator/VLA backends are optional extras and separately installable plugins.

- **Safe unattended**

  ______________________________________________________________________

  An explicit error taxonomy separates "record and continue" from "halt and require a human", so a faulted robot never auto-advances overnight.

- **Rerun visualization**

  ______________________________________________________________________

  Stream camera images, 3D poses, joint/action time-series, and success markers to a [Rerun](https://github.com/rerun-io/rerun) recording.

- **Pluggable**

  ______________________________________________________________________

  Ship `robolens-maniskill` or `robolens-openvla` as separate packages — entry points make them appear in `robolens list` automatically.

______________________________________________________________________

## How it maps to Inspect AI

If you know [Inspect AI](https://inspect.aisi.org.uk/), you already know RoboLens.

| Inspect AI                             | RoboLens                                                  |
| -------------------------------------- | --------------------------------------------------------- |
| `Model`                                | `Policy` (VLA) **+** `Embodiment` *(two inputs)*          |
| `Task = dataset + solver + scorer`     | `Task = scenes + controller + scorer`                     |
| `Sample`                               | `Scene`                                                   |
| `Solver` chain                         | `Controller` middleware (chunking, ensembling, smoothing) |
| `eval()` → `EvalLog`                   | `eval()` → `EvalLog`                                      |
| `@task`/`@solver`/`@scorer` + registry | `@task`/`@policy`/`@embodiment`/`@scorer` + entry points  |

For LLMs: [`llms.txt`](https://robocurve.github.io/robolens/llms.txt) · [`llms-full.txt`](https://robocurve.github.io/robolens/llms-full.txt).

# Quickstart

## Install

```
pip install robolens            # core (numpy only)
pip install "robolens[rerun]"   # + Rerun visualization
```

For development, use [uv](https://github.com/astral-sh/uv):

```
uv venv && uv pip install -e ".[dev]"
uv run pytest
```

## Run your first evaluation

The dependency-free `CubePick` mock world lets you exercise the whole stack with no hardware or simulator:

```
from robolens import eval
from robolens.mock import CubePickEmbodiment, ScriptedPolicy
from robolens.scene import Scene
from robolens.scorer import success_at_end
from robolens.task import Task

task = Task(
    name="cubepick-reach",
    scenes=[Scene(id=f"layout-{i}", instruction="reach the cube", init_seed=i) for i in range(5)],
    scorer=success_at_end(),
    max_steps=80,
)

(log,) = eval(task, ScriptedPolicy(), CubePickEmbodiment())
print(log.status)                    # "success"
print(log.results.metrics)           # {"success_at_end": 1.0}
```

`eval()` returns a list of EvalLog (one per task, mirroring Inspect AI). Each log is immutable, schema-versioned, and written to `log_dir`.

## Use registry names

`task`, `policy`, and `embodiment` may also be **registry names** — the same mechanism the CLI uses:

```
from robolens import eval

(log,) = eval("cubepick-reach", "scripted", "cubepick")
```

## From the command line

```
robolens list                                          # all registered components
robolens list policies                                 # just policies
robolens run --task cubepick-reach --policy scripted --embodiment cubepick
robolens run --task cubepick-reach --policy scripted --embodiment cubepick -P chunk_size=6
robolens inspect logs/cubepick-reach_*.json            # print a saved log
```

## Next steps

- [Concepts](https://robocurve.github.io/robolens/guide/concepts/index.md) — the core abstractions.
- [Writing A Benchmark](https://robocurve.github.io/robolens/guide/writing-a-benchmark/index.md) — define your own `Task`.
- [Policies And Embodiments](https://robocurve.github.io/robolens/guide/policies-and-embodiments/index.md) — plug in a real VLA or robot/sim.

# Concepts

RoboLens factors a robotics evaluation into a few small, orthogonal pieces.

## The two inputs

Unlike LLM evals (one swappable input, the model), a robotics eval has **two**:

- Policy — the VLA "brain". Given an Observation, returns an ActionChunk: a horizon of actions executed open-loop (because VLA inference is slower than the control rate). `H = 1` is the degenerate reactive case.
- Embodiment — the "body + world": a real robot or a simulator. It produces observations, executes actions, and owns the action/observation spaces, the native control rate, and reset/safety machinery.

Both are runtime-checkable Protocols, so you can wrap an existing model or sim without inheriting anything. Convenience base classes (`PolicyBase`, `EmbodimentBase`) exist if you prefer.

## Tasks and scenes

A Task is an **embodiment-agnostic** benchmark: a dataset of Scenes plus scorer(s), a step horizon, and an epoch count. A `Scene` is the robotics analog of Inspect AI's `Sample` — one initial condition: an instruction, an optional success Target, and a seed.

## Compatibility

Before any rollout, check_compatibility verifies the `(policy, embodiment)` pair: action dimensions and ActionSemantics (control mode, rotation representation, gripper, frame), the observation cameras/state keys the policy requires (resolving a name remap), the control rate, and whether each scene is realizable on the embodiment. Hard mismatches fail fast with a CompatibilityError.

## The rollout

rollout runs one trial as a single control-rate loop:

1. A Controller decides the next action, internally calling `policy.act()` and buffering the chunk (so open-loop execution and temporal ensembling compose without forking the loop).
1. An Approver reviews the action before it reaches the embodiment — pass, clamp, or veto (a safety gate).
1. `embodiment.step(action)` executes it; everything is logged to sinks and recorded in an immutable TrialRecord (steps, a typed transcript, inference latencies).

Camera frames are streamed to a FrameStore and the record keeps lightweight references, so long multi-camera episodes stay memory-safe.

## Scoring

A Scorer maps a recorded `TrialRecord` (+ the scene's `Target`) to a Score. Because scorers consume the *recorded* trajectory (not a live environment), scoring is reproducible from a saved log. Across the `epochs` of a scene, an **epoch reducer** (`mean`, `max`, `pass_at_k`, …) collapses scores; metrics then aggregate across scenes.

## Errors and safety

The error taxonomy resolves the "fail fast vs never-crash-overnight" tension:

| Class                             | Policy                                                       |
| --------------------------------- | ------------------------------------------------------------ |
| CompatibilityError, `ConfigError` | fail fast, before any rollout                                |
| PolicyError                       | record the trial, continue (governed by `fail_on_error`)     |
| EmbodimentFault, SafetyAbort      | **always halt** — a faulted/unsafe robot never auto-advances |

## The eval log

eval orchestrates scenes × epochs and returns immutable EvalLogs (status, spec, results, stats, per-scene samples, error). Logs are written atomically as schema-versioned JSON with a read-back guarantee.

# Writing a benchmark

A benchmark is a Task: a dataset of scenes plus scorer(s). It is **embodiment-agnostic** — it describes *what* to evaluate, not *how* the robot is built.

```
from robolens.scene import Scene, Target
from robolens.scorer import success_at_end
from robolens.task import Epochs, Task

task = Task(
    name="cubepick-reach",
    scenes=[
        Scene(
            id=f"layout-{i}",
            instruction="reach the cube",
            target=Target(kind="reach_object", spec={"object": "cube"}),
            init_seed=i,
        )
        for i in range(50)
    ],
    scorer=success_at_end(),
    max_steps=200,
    epochs=Epochs(count=3, reducer="mean"),
)
```

## Scenes

Each Scene is one initial condition (the Inspect `Sample` analog):

- `id` — unique within the task.
- `instruction` — the language goal handed to the policy.
- `target` — an optional Target the scorer reads; its `kind` is resolved in the *embodiment's* namespace (compatibility checking verifies the embodiment can realize it).
- `init_seed` — combined with the eval seed and epoch index to seed each trial deterministically.

## Epochs and reducers

Repeat each scene `epochs` times to measure stochastic policies. The Epochs reducer collapses the per-epoch scores of a scene before metrics aggregate across scenes. Builtin reducers: `mean`, `median`, `max`, `min`, `mode`, and `pass_at_<k>` (an unbiased pass@k estimator).

## Multiple scorers

Pass a list to score several dimensions at once:

```
from robolens.scorer import episode_length, min_distance_to_goal, success_at_end

task = Task(
    name="cubepick-reach",
    scenes=[...],
    scorer=[success_at_end(), episode_length(), min_distance_to_goal()],
    max_steps=200,
)
```

## Registering for discovery

Wrap a task factory with task so it resolves by name in `eval("my-bench", ...)` and appears in `robolens list`:

```
from robolens.registry import task

@task("my-bench")
def my_bench(num_scenes: int = 50) -> Task:
    return Task(name="my-bench", scenes=[...], scorer=success_at_end(), max_steps=200)
```

See [Plugins](https://robocurve.github.io/robolens/guide/plugins/index.md) to ship a benchmark from a separate package.

# Policies and embodiments

Both are runtime-checkable Protocols — implement the methods on any class (no inheritance required), or subclass the convenience base classes.

## A policy (VLA)

A Policy maps an observation to an ActionChunk. It declares a PolicyInfo (the action space it emits and the observations it requires) used for compatibility checking.

```
import numpy as np
from robolens.policy import PolicyConfig, PolicyInfo
from robolens.scene import Scene
from robolens.spaces import ActionSemantics, Box, ObservationSpace
from robolens.types import Action, ActionChunk, Observation


class MyVLA:
    def __init__(self) -> None:
        self.info = PolicyInfo(
            name="my-vla",
            action_space=Box(
                shape=(7,),
                semantics=ActionSemantics(
                    control_mode="eef_delta_pose", rotation_repr="rot6d", gripper="continuous"
                ),
            ),
            observation_space=ObservationSpace(
                state_keys=frozenset({"eef_pose", "gripper"}),
            ),
        )
        self.config = PolicyConfig(action_horizon=16)

    def reset(self, scene: Scene) -> None:
        ...  # clear any per-episode state

    def act(self, observation: Observation) -> ActionChunk:
        # The policy owns model-specific preprocessing (resize/normalize/history).
        chunk = my_model_infer(observation)        # -> (H, 7) array
        actions = [Action(data=a) for a in chunk]
        return ActionChunk(actions=actions, inference_latency_s=...)
```

The policy owns model-specific spatial preprocessing; the embodiment emits raw frames. Temporal concerns (history, smoothing, ensembling) live in a [Controller](https://robocurve.github.io/robolens/guide/concepts/index.md).

## An embodiment (robot or sim)

An Embodiment produces observations and executes actions. It declares an EmbodimentInfo with its spaces, native control rate, and opt-in capability flags.

```
from robolens.embodiment import EmbodimentInfo, PRIVILEGED_SUCCESS, SEEDABLE
from robolens.scene import Scene
from robolens.spaces import Box, CameraSpec, ObservationSpace
from robolens.types import Action, Observation, StepResult


class MyArm:
    def __init__(self) -> None:
        self.info = EmbodimentInfo(
            name="my-arm",
            action_space=Box(shape=(7,), semantics=...),
            observation_space=ObservationSpace(
                cameras=(CameraSpec("base_rgb", 224, 224), CameraSpec("wrist_rgb", 224, 224)),
                state_keys=frozenset({"eef_pose", "gripper"}),
            ),
            control_hz=20.0,
            is_simulated=False,
            capabilities=frozenset({SEEDABLE}),  # real arms rarely have PRIVILEGED_SUCCESS
        )

    def reset(self, scene: Scene, *, seed: int | None = None) -> Observation:
        # On real hardware this may drive to home and block on operator confirmation.
        ...

    def step(self, action: Action) -> StepResult:
        # Returns as soon as the command is issued; the framework paces the loop
        # unless this embodiment declares the "self_paced" capability.
        ...

    def close(self) -> None:
        ...
```

## Real-robot vs simulator

The interfaces assume **real-robot reality**: no guaranteed privileged success, human-in-the-loop reset, wall-clock control. Simulators opt into more via `capabilities` (`SEEDABLE`, `AUTO_RESET`, `PRIVILEGED_SUCCESS`, `RENDERABLE`, …). A sim may put privileged success into `StepResult.info` for a scorer to read; a real robot typically relies on an operator verdict (operator_scorer) or a learned classifier.

## Compatibility

If the policy's action dimension/semantics or required observations don't match the embodiment, eval raises a CompatibilityError before any rollout. Use `remap=` to alias differing camera/state key names:

```
eval(task, MyVLA(), MyArm(), remap={"base_rgb": "camera_0"})
```

# Scoring

A Scorer maps a recorded TrialRecord (plus the scene's Target) to a Score. Scorers read the *recorded* trajectory — never a live environment — so scoring is **reproducible from a saved log**.

## Builtin scorers

```
from robolens.scorer import (
    success_at_end,        # 1.0 iff the episode terminated with reason "success"
    episode_length,        # number of steps taken
    min_distance_to_goal,  # closest the effector got (reads StepResult.info["distance"])
    reached_goal_state,    # success iff min distance <= threshold
    operator_scorer,       # reads a human verdict recorded during the rollout
)
```

## Custom scorers

A scorer is any object with a `name` and a `__call__(record, target) -> Score`:

```
from dataclasses import dataclass
from robolens.scorer import Score

@dataclass(frozen=True)
class SmoothMotion:
    name: str = "smooth_motion"

    def __call__(self, record, target) -> Score:
        deltas = [abs(float(s.action.data).sum()) for s in record.steps]
        return Score(value=-sum(deltas), explanation="negative total command magnitude")
```

Register it with scorer to resolve it by name.

## Epochs and reducers

When a `Task` runs `epochs > 1`, an **epoch reducer** collapses the per-epoch scores of a scene before metrics aggregate across scenes. Reducers are namespaced separately from metrics and are selected by name on Epochs:

| Reducer                        | Meaning                                           |
| ------------------------------ | ------------------------------------------------- |
| `mean`, `median`, `max`, `min` | numeric reductions (raise on non-numeric strings) |
| `mode`                         | most common value (works for categorical scores)  |
| `pass_at_<k>`                  | unbiased pass@k estimator (success = value ≥ 0.5) |

```
from robolens.task import Epochs, Task
Task(..., epochs=Epochs(count=5, reducer="pass_at_2"))
```

## Operator and VLM scoring (real world)

Real robots have no privileged success oracle. The dominant method is a **human verdict**, captured *once* during the rollout (as a transcript event) and read back by operator_scorer — keeping scoring reproducible. A VLMScorer interface is reserved for scoring final frames with a vision-language classifier.

# Logging & Rerun

## The eval log

Every run produces an immutable EvalLog — the canonical, reproducible record. It mirrors Inspect AI: `version`, `status`, an `eval` spec (task/policy/embodiment, created time, git revision, package versions), `results` (aggregate metrics), `stats` (timing, inference latency), per-scene `samples`, and a structured `error`.

```
from robolens import eval, read_eval_log

(log,) = eval("cubepick-reach", "scripted", "cubepick", log_dir="logs")
again = read_eval_log("logs/cubepick-reach_xxxx.json")   # always re-readable
```

Logs are written **atomically** (temp file + rename), schema-versioned, and carry a read-back guarantee: a newer RoboLens always reads an older log.

## Sinks

A LogSink observes the run lifecycle (`on_eval_start` → per trial `on_trial_start`/`log_step`/`on_trial_end` → `on_eval_end`). Builtins:

- JsonLogSink — always on; the canonical JSON record.
- RerunSink — optional, lazily imported.

```
from robolens.logging import JsonLogSink, RerunSink

eval(task, policy, embodiment, sinks=[JsonLogSink("logs"), RerunSink("run.rrd")])
```

## Rerun visualization

`RerunSink` streams camera images, proprioception, action vectors, reward, and termination markers to a [Rerun](https://github.com/rerun-io/rerun) recording. It imports `rerun-sdk` lazily — if it isn't installed, the sink warns once and no-ops, so core never depends on it. Install with `pip install "robolens[rerun]"`.

## Frame side-cars

Camera frames are large. With `store_frames=True`, the rollout streams frames to `<log_dir>/frames` through a FrameStore and the `TrialRecord` keeps lightweight FrameRef handles — so long, multi-camera episodes stay memory-safe and remain scorable from disk.

```
eval(task, policy, embodiment, log_dir="logs", store_frames=True)
```

# Plugins & the registry

RoboLens components register by name and resolve from strings — the mechanism the CLI and `eval("...", "...", "...")` use. In-tree builtins register via decorators; out-of-tree packages publish **entry points**, so an installed plugin appears in `robolens list` without being imported first.

## Decorators

```
from robolens.registry import embodiment, policy, scorer, task

@policy("my-vla")
class MyVLA: ...

@embodiment("my-arm")
class MyArm: ...

@scorer("smooth")
def smooth(): ...

@task("my-bench")
def my_bench(): ...
```

## Resolving

```
from robolens.registry import registered, resolve

registered("policy")          # {"scripted": ..., "random": ..., "my-vla": ...}
policy = resolve("policy", "my-vla", checkpoint="...")   # constructor kwargs forwarded
```

## Shipping an out-of-tree plugin

Publish entry points from your package's `pyproject.toml`:

```
[project.entry-points."robolens.embodiments"]
maniskill = "robolens_maniskill:ManiSkillEmbodiment"

[project.entry-points."robolens.policies"]
openvla = "robolens_openvla:OpenVLAPolicy"
```

Groups: `robolens.tasks`, `robolens.policies`, `robolens.embodiments`, `robolens.scorers`, `robolens.sinks`. After `pip install robolens-maniskill`, it shows up in `robolens list` and resolves by name in `eval()` and the CLI.

This is how the ecosystem stays decoupled: this repository is the **framework**; specific simulators, VLA weights, and benchmarks live in their own packages.

# Command-line interface

The `robolens` CLI wraps the registry and eval.

## `robolens list`

Show registered components (builtins + installed plugins):

```
robolens list                 # all kinds
robolens list policies        # just one kind
robolens list embodiments
```

## `robolens run`

Resolve a task/policy/embodiment from the registry and run an eval. Pass constructor arguments with `-T` (task), `-P` (policy), and `-E` (embodiment) as `key=value` (parsed as bool/int/float/None/str):

```
robolens run --task cubepick-reach --policy scripted --embodiment cubepick
robolens run --task cubepick-reach -T num_scenes=10 --policy scripted -P chunk_size=8 \
             --embodiment cubepick --log-dir logs --seed 0
```

The exit code is `0` on a successful eval, `1` otherwise.

## `robolens inspect`

Print a summary of a saved EvalLog:

```
robolens inspect logs/cubepick-reach_xxxx.json
```

```
task:        cubepick-reach
policy:      scripted
embodiment:  cubepick
status:      success
scenes:      5   trials: 5
metrics:
  success_at_end: 1
scenes:
  [success] scene-0: success_at_end=1
  ...
```

## `robolens --version`

```
robolens --version
```