API reference¶
Generated automatically from the source docstrings. The public,
stability-guaranteed surface is everything exported by robolens.__all__
(eval, eval_set, read_eval_log, EvalLog and the other log dataclasses);
the sections below document the full framework.
Core types & spaces¶
types
¶
Core observation/action data types exchanged between policy and embodiment.
These are the wire format of a rollout. They are deliberately small, immutable,
and NumPy-native. Arrays are raw (the policy owns model-specific preprocessing);
images are (H, W, C) uint8.
The dataclasses set eq=False because they carry NumPy arrays, whose
element-wise == does not yield a single bool — identity/round-trip semantics
are what callers actually need here.
Observation
dataclass
¶
Observation(images: Mapping[str, ImageArray] = dict(), state: Mapping[str, StateArray] = dict(), instruction: str | None = None, image_times: Mapping[str, float] = dict(), state_time: float = 0.0, extra: Mapping[str, Any] = dict())
A single multi-modal observation produced by an embodiment.
images are keyed by camera name; state holds proprioception keyed by a
controlled vocabulary (e.g. "eef_pos", "gripper"). instruction is
the language goal for this step (usually constant across an episode, but may
change for long-horizon tasks).
Action
dataclass
¶
A single action to apply to an embodiment.
Semantics (control mode, rotation representation, gripper kind, frame) live on
the action space, not on every action instance — see robolens.spaces.
ActionChunk
dataclass
¶
ActionChunk(actions: Sequence[Action], control_hz: float | None = None, inference_latency_s: float | None = None, meta: Mapping[str, Any] = dict())
A horizon of actions predicted by one policy inference.
Modern VLAs (π0, ACT, diffusion policies) predict H future actions that
are executed open-loop because inference is slower than the control rate.
H == 1 is the degenerate "reactive policy" case. control_hz is the
rate the chunk was intended to be played at (None defers to the
embodiment's native rate); inference_latency_s, when measured, is logged.
StepResult
dataclass
¶
StepResult(observation: Observation, reward: float | None = None, terminated: bool = False, termination_reason: str | None = None, truncated: bool = False, info: Mapping[str, Any] = dict())
The outcome of applying one action to an embodiment.
terminated means the task ended (success or hard failure);
termination_reason disambiguates (e.g. "success", "collision",
"fault", "out_of_bounds"). truncated means a time/horizon cutoff.
A simulator may expose privileged success via info.
spaces
¶
Action/observation spaces and action semantics.
Spaces describe the shape of actions and observations;
ActionSemantics
describes what an action means (control mode, rotation representation, gripper
kind, reference frame). Semantics are what make compatibility checking real (a
7-DoF VLA vs a 6-DoF arm; delta vs absolute poses) and make temporal ensembling
correct.
This module ships a minimal-but-functional core for the tracer slice; richer
validation and the full StateSpec vocabulary are layered on in a
later
step without changing these signatures.
ActionSemantics
dataclass
¶
ActionSemantics(control_mode: ControlMode, rotation_repr: RotationRepr = 'none', gripper: GripperKind = 'none', frame: Frame = 'base')
What an action vector means. Attached to an action Box.
Box
dataclass
¶
Box(shape: tuple[int, ...], low: NDArray[floating[Any]] | None = None, high: NDArray[floating[Any]] | None = None, semantics: ActionSemantics | None = None)
A continuous box-shaped space. Optional low/high bounds and, for
action spaces, ActionSemantics.
CameraSpec
dataclass
¶
An image stream an embodiment provides or a policy requires.
StateField
dataclass
¶
One proprioception field: its key, shape, unit, and dtype.
StateSpec
dataclass
¶
StateSpec(fields: tuple[StateField, ...] = ())
A richer description of an embodiment's proprioception than a bare key set.
ObservationSpace
dataclass
¶
ObservationSpace(cameras: tuple[CameraSpec, ...] = (), state_keys: frozenset[str] = frozenset(), state: StateSpec | None = None)
The observations an embodiment provides / a policy requires.
state_keys is the compatibility-relevant set of proprioception keys.
state optionally carries the richer StateSpec (shapes/units).
Policy & embodiment¶
policy
¶
The Policy (VLA) interface — one of RoboLens's two swappable inputs.
A Policy is the "brain": given an
Observation
(plus the scene's instruction), it returns an
ActionChunk to be executed open-loop.
The public contract is a runtime-checkable Policy Protocol so
callers
can wrap existing models without inheriting. PolicyBase is an
optional
convenience ABC with sane defaults.
PolicyConfig
dataclass
¶
PolicyConfig(action_horizon: int = 1, replan_interval: int | None = None, temperature: float | None = None)
Inference-time configuration, recorded in the eval log.
The VLA analog of Inspect's GenerateConfig: action-chunk handling and
sampling knobs that affect reproducibility.
PolicyInfo
dataclass
¶
PolicyInfo(name: str, action_space: Box, observation_space: ObservationSpace = ObservationSpace(), control_hz: float | None = None)
Static description of a policy used for compatibility checking + logging.
Policy
¶
Bases: Protocol
The VLA contract.
PolicyBase
¶
Bases: ABC
Optional base class providing defaults; inherit only for the helpers.
embodiment
¶
The Embodiment interface — RoboLens's second swappable input.
An Embodiment is the "body + world": a real robot or a
simulator. It
produces observations, executes actions, and owns the action/observation spaces,
the native control rate, and reset/safety machinery.
Designed around real-robot reality: reset may drive to a home pose and block
on human confirmation; there is no guaranteed privileged success oracle.
Simulators are a stricter special case that opt into extra capabilities.
Per R1 (see the design doc): step() returns as soon as the command is issued
and does NOT block for the control period — the framework owns pacing — unless
the embodiment declares the "self_paced" capability.
EmbodimentInfo
dataclass
¶
EmbodimentInfo(name: str, action_space: Box, observation_space: ObservationSpace, control_hz: float | None = None, is_simulated: bool = False, capabilities: frozenset[Capability] = frozenset(), supported_setups: frozenset[str] = frozenset(), supported_target_kinds: frozenset[str] = frozenset())
Static description of an embodiment for compatibility checking + logging.
Embodiment
¶
Bases: Protocol
The robot/simulator contract.
Tasks & scenes¶
scene
¶
Scenes — the robotics analog of Inspect AI's Sample.
A Scene is one initial condition of a benchmark: a language instruction,
an optional success Target, an optional seed, and metadata. A benchmark
Task iterates over a dataset of scenes (e.g. 50 object layouts), repeated
epochs times.
Field mapping to Inspect: Sample(input, target, id, metadata, setup) ↔
Scene(instruction, target, id, metadata, setup, init_seed).
Target
dataclass
¶
A success specification the scorer reads. Embodiment-namespaced.
kind names what the embodiment must realize/evaluate (e.g.
"reach_object"); spec carries the parameters. Kept intentionally open
for the tracer; richer typed targets land with the scorer milestone.
task
¶
The Task — an embodiment-agnostic benchmark definition.
Mirrors Inspect AI's Task = dataset + scorer + epochs/reducer, adapted for
robotics: the dataset is a sequence of Scene initial
conditions and the rollout horizon (max_steps) and control rate live here.
Epochs
dataclass
¶
Repeat count plus the reducer used to combine per-epoch scores.
Mirrors Inspect's Epochs(count, reducer); reducer is a registered name
(default "mean").
Scoring¶
scorer
¶
Scoring: Scores, the Scorer protocol, epoch reducers, and builtin scorers.
Mirrors Inspect AI's @scorer/reducer split. A scorer maps a recorded
trajectory (+ the scene's Target) to a Score; an epoch reducer
collapses the per-epoch scores of one scene into a single score before metrics
aggregate across scenes.
Scorers consume the recorded trajectory (not a live environment), so scoring is reproducible from a saved log.
Score
dataclass
¶
The outcome a scorer assigns to one trajectory.
VLMScorer
¶
Reserved interface (R10): score from a VLM classifier over final frames.
Implemented in a later milestone; instantiating and calling it raises so the contract is visible but no half-baked behavior ships.
value_to_float
¶
Coerce a score value to a float for metric aggregation.
Source code in src/robolens/scorer.py
reduce_mode
¶
Most common raw value (works for categorical scores). Deterministic.
Source code in src/robolens/scorer.py
pass_at_k
¶
Unbiased pass@k estimator over the epoch scores (success = value >= 0.5).
Source code in src/robolens/scorer.py
Rollout, controllers & safety¶
rollout
¶
The rollout engine — the closed control loop at the heart of RoboLens.
One rollout runs a single trial (one scene, one epoch): it drives the
policy↔embodiment loop through the Controller
(open-loop chunk execution) and the Approver safety
gate, logging each step to the sinks, and returns an immutable
TrialRecord that scorers consume.
StepRecord
dataclass
¶
StepRecord(t: int, observation: Observation, action: Action, result: StepResult, image_refs: Mapping[str, FrameRef] | None = None)
One step of a recorded trajectory.
When a FrameStore is used, observation has its
images stripped and image_refs holds on-disk handles instead (R5).
TrialRecord
dataclass
¶
TrialRecord(scene_id: str, epoch: int, seed: int | None, steps: list[StepRecord] = list(), terminated: bool = False, truncated: bool = False, termination_reason: str | None = None, status: str = 'success', error: str | None = None, inference_latencies: list[float] = list(), operator_judgement: str | None = None, events: list[Event] = list())
The full record of one trial — the unit scorers consume.
derive_seed
¶
Deterministically combine eval/scene seeds and the epoch index (R2).
Distinct epochs of the same scene get distinct seeds so repeats actually vary
for stochastic policies, while a fixed (eval_seed, scene_seed, epoch)
reproduces bitwise.
Source code in src/robolens/rollout.py
rollout
¶
rollout(policy: Policy, embodiment: Embodiment, scene: Scene, *, max_steps: int, seed: int | None, epoch: int, controller: Controller, approver: Approver, sink: LogSink, control_hz: float | None = None, frame_store: FrameStore | None = None) -> TrialRecord
Run a single trial and return its record.
Generic exceptions raised by the policy are wrapped as
PolicyError; by the embodiment as
EmbodimentFault. Already-typed RoboLens errors
(incl. SafetyAbort) propagate unchanged, so the
eval orchestrator can apply the correct continue-vs-halt policy.
Source code in src/robolens/rollout.py
104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 | |
controller
¶
Controllers — the rollout middleware layer (Inspect's @solver analog).
A Controller owns the per-control-step decision of which action
to send
to the embodiment. It internally decides when to call policy.act() (a slow
VLA inference returning an ActionChunk), buffers the
returned chunk, and pops the next action each step. This single-method, stateful
shape (R3) is what lets advanced controllers — e.g. a temporal-ensembling
controller that re-infers every step and blends overlapping predictions —
compose without forking the rollout loop.
DefaultController plays the first replan_interval actions of each chunk,
then re-infers (replan_interval=None ⇒ play the whole chunk before replanning).
Controller
¶
Bases: Protocol
Decides the next action to execute, calling the policy as needed.
DefaultController
¶
Open-loop chunk execution with periodic replanning.
Source code in src/robolens/controller.py
SmoothingController
¶
SmoothingController(inner: Controller, alpha: float = 0.5)
Wrap another controller and exponentially smooth its action stream.
Demonstrates the middleware composition the single-method interface enables:
the wrapped controller owns inference/replanning while this layer applies an
exponential moving average (alpha toward the new action) on top. Only
valid for additive/continuous action spaces (the caller's responsibility).
Source code in src/robolens/controller.py
EnsemblingController
¶
EnsemblingController(action_space: Box, m: float = 0.1)
ACT/ALOHA-style temporal ensembling over overlapping action chunks.
Queries the policy every control step and blends, for the current step, the
predictions of all still-relevant recent chunks. A chunk queried at global
step q predicts step t via its action at index t - q (valid while
0 <= t - q < len(chunk)). Predictions are weighted exp(-m * i) with
i = 0 for the oldest contributing chunk (ALOHA's convention: older
predictions dominate, which smooths motion); larger m favors the oldest.
Only valid for additive action representations: the constructor refuses rotation reps and binary grippers that cannot be linearly averaged (R8).
Source code in src/robolens/controller.py
approver
¶
The Approver — a safety gate between policy output and the embodiment.
Every action passes through Approver.review before embodiment.step. This
is the robotics analog of Inspect AI's ApprovalPolicy and is more
safety-critical: an approver may pass, clamp, or veto an action (a veto raises
SafetyAbort). In the tracer slice the default approver
passes everything through; clamping/operator approval land in rollout hardening.
Approver
¶
Bases: Protocol
Reviews an action before it reaches the embodiment.
May return the action unchanged, return a modified (e.g. clamped) action, or
raise SafetyAbort to halt the eval.
AutoApprover
¶
Approve every action unchanged (the permissive default).
frames
¶
FrameStore — rollout-owned streaming of camera frames to disk (R5).
A long multi-camera episode would exhaust memory if every frame were retained in
the TrialRecord. Instead the rollout streams frames to
disk through a FrameStore and keeps only lightweight
FrameRef
handles. This is owned by the rollout, NOT by any log sink, so trajectories are
recorded (and scorable) independent of which optional sinks are enabled.
FrameRef
dataclass
¶
A handle to a camera frame stored on disk.
transcript
¶
A typed transcript of rollout events.
Each trial records an ordered stream of events (reset, inference, step, approval,
operator judgement, error). This is the robotics analog of Inspect AI's
transcript and is the data a results viewer renders. Events are deliberately
lightweight: a kind, the step index t (-1 for pre-loop events), and a
small data payload.
Event
dataclass
¶
One entry in a trial's transcript.
Compatibility & errors¶
compat
¶
Compatibility checking between a policy and an embodiment.
Before any rollout, RoboLens verifies that a (policy, embodiment) pair can
actually run together: the action spaces agree in dimension and semantics, the
embodiment provides every observation the policy requires (resolving a name
remap), the control rates are reconcilable (R1), and — given a task — every
scene is realizable on the embodiment (R7).
Hard mismatches are error issues that fail fast; soft ones are warnings.
CompatibilityReport
dataclass
¶
The outcome of a compatibility check.
check_compatibility
¶
check_compatibility(policy: Policy, embodiment: Embodiment, task: Task | None = None, *, remap: dict[str, str] | None = None) -> CompatibilityReport
Return a structured compatibility report (does not raise).
Source code in src/robolens/compat.py
assert_compatible
¶
assert_compatible(policy: Policy, embodiment: Embodiment, task: Task | None = None, *, remap: dict[str, str] | None = None) -> CompatibilityReport
Check compatibility and raise CompatibilityError on
hard errors.
Source code in src/robolens/compat.py
errors
¶
RoboLens error taxonomy.
The split below resolves the "fail fast vs never-crash-overnight" tension:
ConfigError/CompatibilityErrorare raised before any rollout — bad configuration should fail loudly and immediately.PolicyErroris recorded as a failed trial; whether it aborts the eval is governed byfail_on_error(Inspect semantics).EmbodimentFaultandSafetyAbortalways halt the eval regardless offail_on_error— a faulted or unsafe robot must never auto-advance to the next scene unattended.
RoboLensError
¶
Bases: Exception
Base class for all RoboLens errors.
ConfigError
¶
Bases: RoboLensError
Invalid task / policy / embodiment configuration. Fail fast.
CompatibilityError
¶
Bases: RoboLensError
A policy and embodiment are not compatible. Fail fast, before any rollout.
PolicyError
¶
Bases: RoboLensError
The policy raised during inference. Recorded as a failed trial.
EmbodimentFault
¶
Bases: RoboLensError
The embodiment/robot faulted. Always halts the eval and requires a human.
SafetyAbort
¶
Bases: RoboLensError
An approver vetoed an action / e-stop. Always halts the eval.
Evaluation & logs¶
eval
¶
The eval() entry point — orchestrates scenes x epochs into an EvalLog.
Mirrors Inspect AI's eval(): it runs a task's scenes (repeated over epochs),
scores each recorded trajectory, reduces epochs, aggregates metrics, and returns
a list of immutable EvalLog (one per task). The tracer
slice accepts already-constructed objects; registry-string resolution
(policy="openvla/7b") is layered on with the registry milestone.
eval
¶
eval(task: Task | str, policy: Policy | str, embodiment: Embodiment | str, *, log_dir: str = 'logs', sinks: list[LogSink] | None = None, seed: int | None = 0, fail_on_error: bool | float = False, controller: Controller | None = None, approver: Approver | None = None, remap: dict[str, str] | None = None, store_frames: bool = False) -> list[EvalLog]
Run task with policy on embodiment; return [EvalLog].
task/policy/embodiment may be objects or registry names
(e.g. policy="scripted"), resolved through the registry — the Inspect-style
ergonomic that keeps logs and the CLI reproducible.
fail_on_error follows Inspect semantics for PolicyError (True =
fail on first, False = never, 0<x<1 = proportion, x>1 = count).
EmbodimentFault/SafetyAbort always halt regardless.
When store_frames is set, camera frames are streamed to
<log_dir>/frames as binary side-cars (R5) rather than kept in memory.
Raises CompatibilityError (fail fast, before any
rollout) if the policy and embodiment are incompatible.
Source code in src/robolens/eval.py
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 | |
eval_set
¶
eval_set(tasks: Task | str | Sequence[Task | str], policy: Policy | str, embodiment: Embodiment | str, *, log_dir: str = 'logs', seed: int | None = 0, fail_on_error: bool | float = False, controller: Controller | None = None, approver: Approver | None = None, remap: dict[str, str] | None = None, store_frames: bool = False, retry_attempts: int = 0) -> tuple[bool, list[EvalLog]]
Run a set of tasks and return (success, logs) (mirrors Inspect AI).
success is True iff every task's log has status == "success".
Resumption of a partially-completed run (skipping already-finished scenes via
a stable run id) is reserved for a follow-up: retry_attempts is accepted
now so callers don't get retrofitted, but is not yet honored.
Source code in src/robolens/eval.py
log
¶
The immutable evaluation log — RoboLens's reproducible record of a run.
Mirrors Inspect AI's EvalLog: version + status + eval spec +
results + stats + per-scene samples + error. Serialized to JSON
with a schema version so newer RoboLens always reads older logs (a read-back
guarantee enforced by golden tests in a later step).
EvalSpec
dataclass
¶
EvalSpec(task: str, policy: str, embodiment: str, created: str, robolens_version: str, git_commit: str | None = None, policy_config: dict[str, Any] = dict(), embodiment_info: dict[str, Any] = dict(), seed: int | None = None)
Top-level identity of an eval: what was run, with what, when.
EvalStats
dataclass
¶
EvalStats(started_at: str, completed_at: str, duration_s: float, total_steps: int, mean_inference_latency_s: float | None = None, frames_dir: str | None = None)
Timing and execution statistics for a run.
SceneResult
dataclass
¶
SceneResult(scene_id: str, status: str, reduced: dict[str, float] = dict(), epochs: list[dict[str, float]] = list(), error: str | None = None)
Per-scene result: the reduced score(s) plus the raw per-epoch scores.
EvalResults
dataclass
¶
Aggregate results across all scenes.
EvalLog
dataclass
¶
EvalLog(version: int, status: str, eval: EvalSpec, results: EvalResults, stats: EvalStats, samples: list[SceneResult] = list(), error: str | None = None)
The full record returned by eval and persisted to disk.
Logging sinks¶
sink
¶
The LogSink protocol and a no-op base implementation.
A sink observes a run's lifecycle. The rollout engine and eval() call these
hooks in a fixed order: on_eval_start → (per trial: on_trial_start →
log_step* → on_trial_end) → on_eval_end.
json_log
¶
The canonical JSON eval-log sink.
Writes the immutable EvalLog to log_dir once the run
finishes. The write is atomic (temp file + os.replace) so an interrupted
overnight run never leaves a half-written log.
rerun_sink
¶
Optional Rerun visualization sink.
Logs camera images, proprioception, action vectors, and success markers to a
Rerun <https://github.com/rerun-io/rerun>_ recording. rerun-sdk is imported
lazily inside methods so the core package never depends on it; if it is not
installed, the sink warns once and becomes a no-op (so unattended runs and the
core-only import gate are unaffected).
Install with pip install "robolens[rerun]".
RerunSink
¶
RerunSink(recording_path: str | None = None, *, application_id: str = 'robolens', spawn: bool = False)
Stream a rollout to a Rerun recording (.rrd) or a live viewer.
Source code in src/robolens/logging/rerun_sink.py
Registry & CLI¶
registry
¶
Registry and decorators for tasks, policies, embodiments, scorers, and sinks.
Mirrors Inspect AI's extension model: components register by name via decorators
and are resolved from strings (so eval(policy="scripted") and the CLI work).
Out-of-tree packages publish components through importlib.metadata entry-point
groups, so an installed robolens-openvla appears in robolens list without
being imported first.
Entry-point groups:
robolens.tasks, robolens.policies, robolens.embodiments,
robolens.scorers, robolens.sinks.
register
¶
Register a factory under kind/name (defaults to its __name__).
Source code in src/robolens/registry.py
task
¶
policy
¶
embodiment
¶
scorer
¶
sink
¶
registered
¶
Return all registered factories for kind (builtins + plugins).
Source code in src/robolens/registry.py
resolve
¶
Construct a registered component by name with the given keyword args.
Source code in src/robolens/registry.py
cli
¶
The robolens command-line interface.
Subcommands:
robolens list [tasks|policies|embodiments|scorers|sinks]— show registered components (builtins + installed plugins).robolens run --task T --policy P --embodiment E— run an eval, resolving components from the registry. Pass constructor args with-T/-P/-E k=v.
Mock world¶
cubepick
¶
CubePick — a deterministic 2D toy world for exercising the full stack.
A point end-effector in the unit square must reach a cube. The action is a 2D
end-effector position delta. Success is declared (and exposed as privileged
info["success"]) when the effector is within goal_radius of the cube.
Fully deterministic given a seed; no third-party dependencies.
CubePickEmbodiment
¶
CubePickEmbodiment(*, max_step: float = 0.1, goal_radius: float = 0.05, start: tuple[float, float] = (0.1, 0.1))
A 2D reach-the-cube simulator.
Source code in src/robolens/mock/cubepick.py
policies
¶
Mock policies for the CubePick world.
ScriptedPolicy— a deterministic oracle that walks the effector to the cube. It predicts a full action chunk by simulating its own future motion, so the chunk is a genuine open-loop trajectory (H > 1).RandomPolicy— emits random deltas; mostly fails.NoopPolicy— emits zero actions; never succeeds.
ScriptedPolicy
¶
Deterministic oracle: walk straight to the cube, in chunks.
Source code in src/robolens/mock/policies.py
RandomPolicy
¶
Emit random small deltas. Deterministic given the construction seed.
Source code in src/robolens/mock/policies.py
NoopPolicy
¶
Emit zero actions; never moves.