# RoboLens > The Inspect AI for robotics — an evaluation framework for physical AI and VLA (vision-language-action) models. RoboLens is the open-source evaluation framework for physical AI and VLA (vision-language-action) models — the "Inspect AI for robotics". A benchmark is a Task (dataset of Scenes + scorers) run against two swappable inputs: a Policy (the VLA) and an Embodiment (a real robot or simulator). # Guide # RoboLens The **Inspect AI** for robotics. An open-source evaluation framework for **physical AI** and **VLA (vision-language-action)** models. Define a robotics benchmark once, then run *any* policy against *any* compatible embodiment — a real robot or a simulator — with reproducible logs and first-class [Rerun](https://github.com/rerun-io/rerun) visualization. [Get started](https://robocurve.github.io/robolens/guide/quickstart/index.md) [Concepts](https://robocurve.github.io/robolens/guide/concepts/index.md) [GitHub](https://github.com/robocurve/robolens) ______________________________________________________________________ ## One framework, two swappable inputs LLM evals have a single swappable input: the model. **Robotics evals have two** — and RoboLens makes both first-class and orthogonal. - **`Policy` — the VLA** ______________________________________________________________________ The "brain". Maps an observation + a language instruction to an **action chunk** (a horizon of actions executed open-loop, as π0 / ACT / diffusion policies do). - **`Embodiment` — the robot or sim** ______________________________________________________________________ The "body + world". Produces observations, executes actions, and owns the action/observation spaces and control rate. Real-robot-first; sims are a stricter special case. A **`Task`** — a dataset of `Scene`s (initial conditions, instructions, success targets) plus scorers — is defined *independently* of both. Before any rollout, RoboLens verifies the `(policy, embodiment)` pair is **compatible** and fails fast and loud if not. ______________________________________________________________________ ## Quickstart ``` pip install robolens # core (numpy only) pip install "robolens[rerun]" # + Rerun visualization ``` No hardware or simulator required — the dependency-free `CubePick` mock world exercises the whole stack: ``` from robolens import eval from robolens.mock import CubePickEmbodiment, ScriptedPolicy from robolens.scene import Scene from robolens.scorer import success_at_end from robolens.task import Task task = Task( name="cubepick-reach", scenes=[Scene(id=f"layout-{i}", instruction="reach the cube", init_seed=i) for i in range(5)], scorer=success_at_end(), max_steps=80, ) # The two swappable inputs: a policy (VLA) and an embodiment (robot/sim). (log,) = eval(task, ScriptedPolicy(), CubePickEmbodiment()) print(log.status, log.results.metrics) # success {'success_at_end': 1.0} ``` …or from the command line: ``` robolens list # registered components robolens run --task cubepick-reach --policy scripted --embodiment cubepick robolens inspect logs/cubepick-reach_*.json # results table ``` ______________________________________________________________________ ## Why RoboLens - **Real-world first** ______________________________________________________________________ Interfaces assume real-robot reality: human-in-the-loop reset, no privileged success oracle, wall-clock control rate. Simulators just offer more. - **Reproducible** ______________________________________________________________________ Every run yields an immutable, schema-versioned `EvalLog` with the resolved config, git revision, and package versions — re-readable across releases. - **Light core** ______________________________________________________________________ The core depends only on NumPy. Rerun and simulator/VLA backends are optional extras and separately installable plugins. - **Safe unattended** ______________________________________________________________________ An explicit error taxonomy separates "record and continue" from "halt and require a human", so a faulted robot never auto-advances overnight. - **Rerun visualization** ______________________________________________________________________ Stream camera images, 3D poses, joint/action time-series, and success markers to a [Rerun](https://github.com/rerun-io/rerun) recording. - **Pluggable** ______________________________________________________________________ Ship `robolens-maniskill` or `robolens-openvla` as separate packages — entry points make them appear in `robolens list` automatically. ______________________________________________________________________ ## How it maps to Inspect AI If you know [Inspect AI](https://inspect.aisi.org.uk/), you already know RoboLens. | Inspect AI | RoboLens | | -------------------------------------- | --------------------------------------------------------- | | `Model` | `Policy` (VLA) **+** `Embodiment` *(two inputs)* | | `Task = dataset + solver + scorer` | `Task = scenes + controller + scorer` | | `Sample` | `Scene` | | `Solver` chain | `Controller` middleware (chunking, ensembling, smoothing) | | `eval()` → `EvalLog` | `eval()` → `EvalLog` | | `@task`/`@solver`/`@scorer` + registry | `@task`/`@policy`/`@embodiment`/`@scorer` + entry points | For LLMs: [`llms.txt`](https://robocurve.github.io/robolens/llms.txt) · [`llms-full.txt`](https://robocurve.github.io/robolens/llms-full.txt). # Quickstart ## Install ``` pip install robolens # core (numpy only) pip install "robolens[rerun]" # + Rerun visualization ``` For development, use [uv](https://github.com/astral-sh/uv): ``` uv venv && uv pip install -e ".[dev]" uv run pytest ``` ## Run your first evaluation The dependency-free `CubePick` mock world lets you exercise the whole stack with no hardware or simulator: ``` from robolens import eval from robolens.mock import CubePickEmbodiment, ScriptedPolicy from robolens.scene import Scene from robolens.scorer import success_at_end from robolens.task import Task task = Task( name="cubepick-reach", scenes=[Scene(id=f"layout-{i}", instruction="reach the cube", init_seed=i) for i in range(5)], scorer=success_at_end(), max_steps=80, ) (log,) = eval(task, ScriptedPolicy(), CubePickEmbodiment()) print(log.status) # "success" print(log.results.metrics) # {"success_at_end": 1.0} ``` `eval()` returns a list of EvalLog (one per task, mirroring Inspect AI). Each log is immutable, schema-versioned, and written to `log_dir`. ## Use registry names `task`, `policy`, and `embodiment` may also be **registry names** — the same mechanism the CLI uses: ``` from robolens import eval (log,) = eval("cubepick-reach", "scripted", "cubepick") ``` ## From the command line ``` robolens list # all registered components robolens list policies # just policies robolens run --task cubepick-reach --policy scripted --embodiment cubepick robolens run --task cubepick-reach --policy scripted --embodiment cubepick -P chunk_size=6 robolens inspect logs/cubepick-reach_*.json # print a saved log ``` ## Next steps - [Concepts](https://robocurve.github.io/robolens/guide/concepts/index.md) — the core abstractions. - [Writing A Benchmark](https://robocurve.github.io/robolens/guide/writing-a-benchmark/index.md) — define your own `Task`. - [Policies And Embodiments](https://robocurve.github.io/robolens/guide/policies-and-embodiments/index.md) — plug in a real VLA or robot/sim. # Concepts RoboLens factors a robotics evaluation into a few small, orthogonal pieces. ## The two inputs Unlike LLM evals (one swappable input, the model), a robotics eval has **two**: - Policy — the VLA "brain". Given an Observation, returns an ActionChunk: a horizon of actions executed open-loop (because VLA inference is slower than the control rate). `H = 1` is the degenerate reactive case. - Embodiment — the "body + world": a real robot or a simulator. It produces observations, executes actions, and owns the action/observation spaces, the native control rate, and reset/safety machinery. Both are runtime-checkable Protocols, so you can wrap an existing model or sim without inheriting anything. Convenience base classes (`PolicyBase`, `EmbodimentBase`) exist if you prefer. ## Tasks and scenes A Task is an **embodiment-agnostic** benchmark: a dataset of Scenes plus scorer(s), a step horizon, and an epoch count. A `Scene` is the robotics analog of Inspect AI's `Sample` — one initial condition: an instruction, an optional success Target, and a seed. ## Compatibility Before any rollout, check_compatibility verifies the `(policy, embodiment)` pair: action dimensions and ActionSemantics (control mode, rotation representation, gripper, frame), the observation cameras/state keys the policy requires (resolving a name remap), the control rate, and whether each scene is realizable on the embodiment. Hard mismatches fail fast with a CompatibilityError. ## The rollout rollout runs one trial as a single control-rate loop: 1. A Controller decides the next action, internally calling `policy.act()` and buffering the chunk (so open-loop execution and temporal ensembling compose without forking the loop). 1. An Approver reviews the action before it reaches the embodiment — pass, clamp, or veto (a safety gate). 1. `embodiment.step(action)` executes it; everything is logged to sinks and recorded in an immutable TrialRecord (steps, a typed transcript, inference latencies). Camera frames are streamed to a FrameStore and the record keeps lightweight references, so long multi-camera episodes stay memory-safe. ## Scoring A Scorer maps a recorded `TrialRecord` (+ the scene's `Target`) to a Score. Because scorers consume the *recorded* trajectory (not a live environment), scoring is reproducible from a saved log. Across the `epochs` of a scene, an **epoch reducer** (`mean`, `max`, `pass_at_k`, …) collapses scores; metrics then aggregate across scenes. ## Errors and safety The error taxonomy resolves the "fail fast vs never-crash-overnight" tension: | Class | Policy | | --------------------------------- | ------------------------------------------------------------ | | CompatibilityError, `ConfigError` | fail fast, before any rollout | | PolicyError | record the trial, continue (governed by `fail_on_error`) | | EmbodimentFault, SafetyAbort | **always halt** — a faulted/unsafe robot never auto-advances | ## The eval log eval orchestrates scenes × epochs and returns immutable EvalLogs (status, spec, results, stats, per-scene samples, error). Logs are written atomically as schema-versioned JSON with a read-back guarantee. # Writing a benchmark A benchmark is a Task: a dataset of scenes plus scorer(s). It is **embodiment-agnostic** — it describes *what* to evaluate, not *how* the robot is built. ``` from robolens.scene import Scene, Target from robolens.scorer import success_at_end from robolens.task import Epochs, Task task = Task( name="cubepick-reach", scenes=[ Scene( id=f"layout-{i}", instruction="reach the cube", target=Target(kind="reach_object", spec={"object": "cube"}), init_seed=i, ) for i in range(50) ], scorer=success_at_end(), max_steps=200, epochs=Epochs(count=3, reducer="mean"), ) ``` ## Scenes Each Scene is one initial condition (the Inspect `Sample` analog): - `id` — unique within the task. - `instruction` — the language goal handed to the policy. - `target` — an optional Target the scorer reads; its `kind` is resolved in the *embodiment's* namespace (compatibility checking verifies the embodiment can realize it). - `init_seed` — combined with the eval seed and epoch index to seed each trial deterministically. ## Epochs and reducers Repeat each scene `epochs` times to measure stochastic policies. The Epochs reducer collapses the per-epoch scores of a scene before metrics aggregate across scenes. Builtin reducers: `mean`, `median`, `max`, `min`, `mode`, and `pass_at_` (an unbiased pass@k estimator). ## Multiple scorers Pass a list to score several dimensions at once: ``` from robolens.scorer import episode_length, min_distance_to_goal, success_at_end task = Task( name="cubepick-reach", scenes=[...], scorer=[success_at_end(), episode_length(), min_distance_to_goal()], max_steps=200, ) ``` ## Registering for discovery Wrap a task factory with task so it resolves by name in `eval("my-bench", ...)` and appears in `robolens list`: ``` from robolens.registry import task @task("my-bench") def my_bench(num_scenes: int = 50) -> Task: return Task(name="my-bench", scenes=[...], scorer=success_at_end(), max_steps=200) ``` See [Plugins](https://robocurve.github.io/robolens/guide/plugins/index.md) to ship a benchmark from a separate package. # Policies and embodiments Both are runtime-checkable Protocols — implement the methods on any class (no inheritance required), or subclass the convenience base classes. ## A policy (VLA) A Policy maps an observation to an ActionChunk. It declares a PolicyInfo (the action space it emits and the observations it requires) used for compatibility checking. ``` import numpy as np from robolens.policy import PolicyConfig, PolicyInfo from robolens.scene import Scene from robolens.spaces import ActionSemantics, Box, ObservationSpace from robolens.types import Action, ActionChunk, Observation class MyVLA: def __init__(self) -> None: self.info = PolicyInfo( name="my-vla", action_space=Box( shape=(7,), semantics=ActionSemantics( control_mode="eef_delta_pose", rotation_repr="rot6d", gripper="continuous" ), ), observation_space=ObservationSpace( state_keys=frozenset({"eef_pose", "gripper"}), ), ) self.config = PolicyConfig(action_horizon=16) def reset(self, scene: Scene) -> None: ... # clear any per-episode state def act(self, observation: Observation) -> ActionChunk: # The policy owns model-specific preprocessing (resize/normalize/history). chunk = my_model_infer(observation) # -> (H, 7) array actions = [Action(data=a) for a in chunk] return ActionChunk(actions=actions, inference_latency_s=...) ``` The policy owns model-specific spatial preprocessing; the embodiment emits raw frames. Temporal concerns (history, smoothing, ensembling) live in a [Controller](https://robocurve.github.io/robolens/guide/concepts/index.md). ## An embodiment (robot or sim) An Embodiment produces observations and executes actions. It declares an EmbodimentInfo with its spaces, native control rate, and opt-in capability flags. ``` from robolens.embodiment import EmbodimentInfo, PRIVILEGED_SUCCESS, SEEDABLE from robolens.scene import Scene from robolens.spaces import Box, CameraSpec, ObservationSpace from robolens.types import Action, Observation, StepResult class MyArm: def __init__(self) -> None: self.info = EmbodimentInfo( name="my-arm", action_space=Box(shape=(7,), semantics=...), observation_space=ObservationSpace( cameras=(CameraSpec("base_rgb", 224, 224), CameraSpec("wrist_rgb", 224, 224)), state_keys=frozenset({"eef_pose", "gripper"}), ), control_hz=20.0, is_simulated=False, capabilities=frozenset({SEEDABLE}), # real arms rarely have PRIVILEGED_SUCCESS ) def reset(self, scene: Scene, *, seed: int | None = None) -> Observation: # On real hardware this may drive to home and block on operator confirmation. ... def step(self, action: Action) -> StepResult: # Returns as soon as the command is issued; the framework paces the loop # unless this embodiment declares the "self_paced" capability. ... def close(self) -> None: ... ``` ## Real-robot vs simulator The interfaces assume **real-robot reality**: no guaranteed privileged success, human-in-the-loop reset, wall-clock control. Simulators opt into more via `capabilities` (`SEEDABLE`, `AUTO_RESET`, `PRIVILEGED_SUCCESS`, `RENDERABLE`, …). A sim may put privileged success into `StepResult.info` for a scorer to read; a real robot typically relies on an operator verdict (operator_scorer) or a learned classifier. ## Compatibility If the policy's action dimension/semantics or required observations don't match the embodiment, eval raises a CompatibilityError before any rollout. Use `remap=` to alias differing camera/state key names: ``` eval(task, MyVLA(), MyArm(), remap={"base_rgb": "camera_0"}) ``` # Scoring A Scorer maps a recorded TrialRecord (plus the scene's Target) to a Score. Scorers read the *recorded* trajectory — never a live environment — so scoring is **reproducible from a saved log**. ## Builtin scorers ``` from robolens.scorer import ( success_at_end, # 1.0 iff the episode terminated with reason "success" episode_length, # number of steps taken min_distance_to_goal, # closest the effector got (reads StepResult.info["distance"]) reached_goal_state, # success iff min distance <= threshold operator_scorer, # reads a human verdict recorded during the rollout ) ``` ## Custom scorers A scorer is any object with a `name` and a `__call__(record, target) -> Score`: ``` from dataclasses import dataclass from robolens.scorer import Score @dataclass(frozen=True) class SmoothMotion: name: str = "smooth_motion" def __call__(self, record, target) -> Score: deltas = [abs(float(s.action.data).sum()) for s in record.steps] return Score(value=-sum(deltas), explanation="negative total command magnitude") ``` Register it with scorer to resolve it by name. ## Epochs and reducers When a `Task` runs `epochs > 1`, an **epoch reducer** collapses the per-epoch scores of a scene before metrics aggregate across scenes. Reducers are namespaced separately from metrics and are selected by name on Epochs: | Reducer | Meaning | | ------------------------------ | ------------------------------------------------- | | `mean`, `median`, `max`, `min` | numeric reductions (raise on non-numeric strings) | | `mode` | most common value (works for categorical scores) | | `pass_at_` | unbiased pass@k estimator (success = value ≥ 0.5) | ``` from robolens.task import Epochs, Task Task(..., epochs=Epochs(count=5, reducer="pass_at_2")) ``` ## Operator and VLM scoring (real world) Real robots have no privileged success oracle. The dominant method is a **human verdict**, captured *once* during the rollout (as a transcript event) and read back by operator_scorer — keeping scoring reproducible. A VLMScorer interface is reserved for scoring final frames with a vision-language classifier. # Logging & Rerun ## The eval log Every run produces an immutable EvalLog — the canonical, reproducible record. It mirrors Inspect AI: `version`, `status`, an `eval` spec (task/policy/embodiment, created time, git revision, package versions), `results` (aggregate metrics), `stats` (timing, inference latency), per-scene `samples`, and a structured `error`. ``` from robolens import eval, read_eval_log (log,) = eval("cubepick-reach", "scripted", "cubepick", log_dir="logs") again = read_eval_log("logs/cubepick-reach_xxxx.json") # always re-readable ``` Logs are written **atomically** (temp file + rename), schema-versioned, and carry a read-back guarantee: a newer RoboLens always reads an older log. ## Sinks A LogSink observes the run lifecycle (`on_eval_start` → per trial `on_trial_start`/`log_step`/`on_trial_end` → `on_eval_end`). Builtins: - JsonLogSink — always on; the canonical JSON record. - RerunSink — optional, lazily imported. ``` from robolens.logging import JsonLogSink, RerunSink eval(task, policy, embodiment, sinks=[JsonLogSink("logs"), RerunSink("run.rrd")]) ``` ## Rerun visualization `RerunSink` streams camera images, proprioception, action vectors, reward, and termination markers to a [Rerun](https://github.com/rerun-io/rerun) recording. It imports `rerun-sdk` lazily — if it isn't installed, the sink warns once and no-ops, so core never depends on it. Install with `pip install "robolens[rerun]"`. ## Frame side-cars Camera frames are large. With `store_frames=True`, the rollout streams frames to `/frames` through a FrameStore and the `TrialRecord` keeps lightweight FrameRef handles — so long, multi-camera episodes stay memory-safe and remain scorable from disk. ``` eval(task, policy, embodiment, log_dir="logs", store_frames=True) ``` # Plugins & the registry RoboLens components register by name and resolve from strings — the mechanism the CLI and `eval("...", "...", "...")` use. In-tree builtins register via decorators; out-of-tree packages publish **entry points**, so an installed plugin appears in `robolens list` without being imported first. ## Decorators ``` from robolens.registry import embodiment, policy, scorer, task @policy("my-vla") class MyVLA: ... @embodiment("my-arm") class MyArm: ... @scorer("smooth") def smooth(): ... @task("my-bench") def my_bench(): ... ``` ## Resolving ``` from robolens.registry import registered, resolve registered("policy") # {"scripted": ..., "random": ..., "my-vla": ...} policy = resolve("policy", "my-vla", checkpoint="...") # constructor kwargs forwarded ``` ## Shipping an out-of-tree plugin Publish entry points from your package's `pyproject.toml`: ``` [project.entry-points."robolens.embodiments"] maniskill = "robolens_maniskill:ManiSkillEmbodiment" [project.entry-points."robolens.policies"] openvla = "robolens_openvla:OpenVLAPolicy" ``` Groups: `robolens.tasks`, `robolens.policies`, `robolens.embodiments`, `robolens.scorers`, `robolens.sinks`. After `pip install robolens-maniskill`, it shows up in `robolens list` and resolves by name in `eval()` and the CLI. This is how the ecosystem stays decoupled: this repository is the **framework**; specific simulators, VLA weights, and benchmarks live in their own packages. # Command-line interface The `robolens` CLI wraps the registry and eval. ## `robolens list` Show registered components (builtins + installed plugins): ``` robolens list # all kinds robolens list policies # just one kind robolens list embodiments ``` ## `robolens run` Resolve a task/policy/embodiment from the registry and run an eval. Pass constructor arguments with `-T` (task), `-P` (policy), and `-E` (embodiment) as `key=value` (parsed as bool/int/float/None/str): ``` robolens run --task cubepick-reach --policy scripted --embodiment cubepick robolens run --task cubepick-reach -T num_scenes=10 --policy scripted -P chunk_size=8 \ --embodiment cubepick --log-dir logs --seed 0 ``` The exit code is `0` on a successful eval, `1` otherwise. ## `robolens inspect` Print a summary of a saved EvalLog: ``` robolens inspect logs/cubepick-reach_xxxx.json ``` ``` task: cubepick-reach policy: scripted embodiment: cubepick status: success scenes: 5 trials: 5 metrics: success_at_end: 1 scenes: [success] scene-0: success_at_end=1 ... ``` ## `robolens --version` ``` robolens --version ```