Concepts¶
RoboLens factors a robotics evaluation into a few small, orthogonal pieces.
The two inputs¶
Unlike LLM evals (one swappable input, the model), a robotics eval has two:
Policy— the VLA "brain". Given anObservation, returns anActionChunk: a horizon of actions executed open-loop (because VLA inference is slower than the control rate).H = 1is the degenerate reactive case.Embodiment— the "body + world": a real robot or a simulator. It produces observations, executes actions, and owns the action/observation spaces, the native control rate, and reset/safety machinery.
Both are runtime-checkable Protocols, so you can wrap an existing model or sim
without inheriting anything. Convenience base classes (PolicyBase,
EmbodimentBase) exist if you prefer.
Tasks and scenes¶
A Task is an embodiment-agnostic benchmark: a dataset
of Scenes plus scorer(s), a step horizon, and an epoch
count. A Scene is the robotics analog of Inspect AI's Sample — one initial
condition: an instruction, an optional success Target,
and a seed.
Compatibility¶
Before any rollout, check_compatibility verifies the
(policy, embodiment) pair: action dimensions and ActionSemantics
(control mode, rotation representation, gripper, frame), the observation
cameras/state keys the policy requires (resolving a name remap), the control rate,
and whether each scene is realizable on the embodiment. Hard mismatches fail fast
with a CompatibilityError.
The rollout¶
rollout runs one trial as a single control-rate loop:
- A
Controllerdecides the next action, internally callingpolicy.act()and buffering the chunk (so open-loop execution and temporal ensembling compose without forking the loop). - An
Approverreviews the action before it reaches the embodiment — pass, clamp, or veto (a safety gate). embodiment.step(action)executes it; everything is logged to sinks and recorded in an immutableTrialRecord(steps, a typed transcript, inference latencies).
Camera frames are streamed to a FrameStore and the
record keeps lightweight references, so long multi-camera episodes stay
memory-safe.
Scoring¶
A Scorer maps a recorded TrialRecord (+ the scene's
Target) to a Score. Because scorers consume the
recorded trajectory (not a live environment), scoring is reproducible from a
saved log. Across the epochs of a scene, an epoch reducer (mean, max,
pass_at_k, …) collapses scores; metrics then aggregate across scenes.
Errors and safety¶
The error taxonomy resolves the "fail fast vs never-crash-overnight" tension:
| Class | Policy |
|---|---|
CompatibilityError, ConfigError |
fail fast, before any rollout |
PolicyError |
record the trial, continue (governed by fail_on_error) |
EmbodimentFault, SafetyAbort |
always halt — a faulted/unsafe robot never auto-advances |
The eval log¶
eval orchestrates scenes × epochs and returns immutable
EvalLogs (status, spec, results, stats, per-scene samples,
error). Logs are written atomically as schema-versioned JSON with a read-back
guarantee.