Combined-score weights [hide]

RobotArena Language-Following Results

All numbers in this viewer come from results.json, which is auto-generated from results/ra_eval_*/ by viewer/build.py. To add a new model, drop a new ra_eval_<date>/<session>/<ckpt>/<scene>/{results.yaml,episode_log.yaml} tree on disk and reload — no registration step.

Loading results.json
How to read this the 3 setups · the control rules · the 5 scores — click to collapse

Every model is the same kind of robot policy (a pi-zero arm controller with a Qwen3-VL vision-language brain). They differ in how the instruction reaches the arm. There are three setups:

Flat Bare action policy
camera +
full task
action policy arm moves
No planner. The whole task — "put the spoon on the towel" — goes straight in. The simplest setup; our baseline.
Two-model External planner
camera +
full task
VLM planner "grasp
the spoon"
action policy
Two networks. A vision-language model writes the next sub-step in words; a separate action policy carries it out.
Hierarchical One backbone
camera +
full task
ONE model
writes the sub-step
and moves the arm
arm moves
A single Qwen3-VL model does both jobs — it plans the next sub-step and drives the arm. No second network.

Test-time control rules — hand-written safety nets bolted on when the robot runs (no extra AI). They patch the most common mistakes a model makes when following its own sub-steps.

These target the grasp → carry → place family: they help pick-and-place, stacking, put-in-container and multi-object tasks. They do little or nothing for push / reach / pour. Each model row shows which rules it used as labelled chips.

The bold tag on each card (e.g. grasp→place) is exactly the chip shown on each model row below — so you can read a row without memorising anything.

grasp→place
Auto-advance after grasp
The instant it grabs, the sub-step flips from "grasp the spoon" to "place the spoon", so it stops re-grabbing what it already holds.pick-and-place · stack · put-in
anti-drop
Ignore open-flickers
A brief gripper-open blip is ignored — it must stay open 4 steps to really let go. (The sim's dark gripper is easily misread as empty.)any grasping task
name object
Name the exact object
A vague sub-step "grasp the thing" is rewritten to "grasp the blue spoon".any task
multi-object
Move to the next object
On "spoon AND fork on the plate", once the spoon is placed it moves on to the fork instead of fixating on the spoon.multi-object (long-horizon)
verify grasp
Check & retry the grasp
Right after grabbing, it confirms it's holding the object the sub-step asked for; if it grabbed the wrong thing it drops and retries (up to 2×).grasping tasks
loop guard
No infinite loops
A safety stop so it can never loop forever repeating one sub-step.all tasks

The five score columns

Default
The canonical task, worded plainly. "put the spoon on the towel"
Perturbed
Same task, reworded / typos / synonyms — tests language robustness. "place the utensil on the cloth"
Long-horizon
Two things in one instruction; scored on the final state with partial credit. "put the spoon AND fork on the plate"
Fine-grained
Single motor atoms scored on their own. reach · grasp · lift · move · place
SimplerEnv
A different simulator (4 widowx tasks) — an independent cross-check of the RobotArena numbers.

Training data (the "what it learned from")

Baseline — the raw Bridge instruction only. Sub-step — adds short per-moment sub-steps ("grasp the spoon" → "lift it" → "place it"). Repaired sub-step — the cleaned-up sub-step labels (fixed the dropped "place" phase, wrong object names, etc.).
Each section's open/closed state is remembered across reloads.

1. Models

Every model is a pi-zero FM policy fine-tuned on Bridge v2 (Berkeley RAIL teleop data, 60,119 trajectories). The variants differ in (a) the language supervision used at training time and (b) the underlying VL backbone (PaliGemma-3B vs Qwen3-VL-2B).

The Training data column describes the exact form of language each model saw during training — paraphrasing, subtask decomposition, structured properties, lexical perturbations, etc. Click "Real samples from training data" on any row to see verified samples pulled directly from the dataset's tread.subtask_commands / nils.paraphrases column for episode 0 (the "put clothes in laundry machine" trajectory). All models are evaluated on the same RobotArena scenes, so cross-model comparison is apples-to-apples.

2. Perturbation Types

Each of the 9 perturbation types rewrites the instruction text only — the scene, objects, target, and success criterion are identical to the default run.

The 9 types above are the canonical-scene perturbations. The long-horizon and fine-grained benchmark tasks use their own paraphrase corpus (configs/task_paraphrases.yaml) — expand below to see every perturbation instruction actually given for each task (env 16+; env 0–15 run the canonical wording).

3. Results

Scene set Hide scenes where the best model's perturbed rate is below % Scenes hidden: 0 / 0

A scene is hidden only when no model can reach the threshold on perturbed instructions — a proxy for "this scene is too hard / broken / out-of-distribution". Everything in §3 then re-aggregates over only the kept scenes. Default: 0% (no filtering).

All models are evaluated on the same 34 scenes (4 default + 30 perturbation), so this table is already an apples-to-apples comparison.

  • Default % — mean success rate when the policy receives the original Bridge instruction. Averaged over visible scenes (4 default scenes + 30 perturbation scenes, all with their canonical instruction).
  • Default scenes — how many scenes contributed to the Default mean (e.g. 34/34 means full coverage). Hover for total episode count + eps/scene.
  • Perturbed % — mean success rate over the same scenes but with each instruction replaced by 27 paraphrases (3 paraphrases × 9 perturbation types). Episode-weighted from the per-perturbation breakdown when available.
  • Pert. scenes — same as Default scenes but for the perturbation eval.
  • Δ (pp) — Perturbed − Default in percentage points. Negative = the model loses ground on perturbed instructions; closer to 0 = more language-robust.
  • Combined % — single ranking score, computed as a weighted sum across the components listed in the Combined-score weights panel at the top of the page (default = 0.4·Default + 0.6·Perturbed; you can add SimplerEnv weight on the fly). Per-model components that are missing get renormalised out so a model without SimplerEnv data is still scored on the components it has. Use this column to pick the best overall model under the weighting you care about.

Mean of per-model means within each family (PaliGemma, Qwen3-VL, HLC+VLA). The n models columns show how many family members have data for each eval type; the means are taken over those models only.

RobotArena's 30 perturbation scenes (+ 4 default scenes) decompose into 7 task types, classified by the structure of the original Bridge instruction. This tab breaks down each model's perturbed success rate per task type — use it to find which capabilities each model has and which it lacks.

    Each cell shows the model's mean perturbed success rate across the scenes in that task-type bin. Color-coded green (best in column) → red (worst). Sort by any column. The bottom row aggregates across all models per task type to show which capabilities are universally hard.

    Each row is one scene; metrics are aggregated across all models for that scene. Use this view to find the hardest and easiest scenes, and to spot scenes where perturbations cause an unusually large drop.

    • Default % — mean success rate across all models when given the original Bridge instruction. Low = the scene is intrinsically hard (physics / object placement / success-checker stringency); high = most models can solve it under ideal conditions.
    • Perturbed % — mean success rate across all models with the 27-paraphrase perturbed instructions. Low = the scene is hard and/or language-sensitive.
    • Combined % — weighted sum of scene-level Default and Perturbed means using the active weights panel (SimplerEnv has no per-scene data so it's auto-dropped here). Sort ascending to see the hardest scenes first.
    • Δ (pp) — Perturbed − Default. Large negative = scene is much harder under perturbation (language-sensitive); near-zero = perturbations don't hurt much.
    • Range — max − min perturbed rate across models. Large = some models can solve it but others can't (more model-discriminative); small = all models perform similarly (uninformative scene).
    • Best / Worst model — which model has the highest / lowest perturbed rate on this scene.

    Each cell is the per-model mean success rate over scenes for a single perturbation type (averaging the 3 paraphrases per type, then over scenes). Breakdown comes from walking each campaign's episode_log.yaml files via import_campaign.py / backfill_breakdown.py; models whose original campaign episode logs are no longer on disk show .

    4. Episode Video Viewer

    Side-by-side episode playback with synchronized controls (press play on either and both play; seek one and both seek). In RobotArena mode the panels show one default-instruction episode (left) and one perturbed-instruction episode (right) for the same (model, scene); the scene dropdown marks scenes that also have a default video. In SimplerEnv mode the panels show a successful episode (left) and a failed episode (right) for the same (model, widowx task).

    Coverage: Videos appear here once their main.mp4 files exist locally under results/ra_eval_*/<sess>/<ckpt>/<scene>/episodes/episode_*/main.mp4. Run viewer/fetch_videos.sh to pull them from the compute clusters (hala/sof1). The index is auto-rebuilt by viewer/build.py on each viewer reload.

    Loading videos.json

    Charts & Cross-Model Analysis

    Visual summaries computed live from the same results.json that drives the tables under Eval Results. Every model row is a real RobotArena + SimplerEnv campaign — nothing is fabricated. Hover any data point for the exact alias and rate.

    Per-model scoreboard — every metric, every model

    The same numbers that drive §3 of Eval Results, but in chart form for at-a-glance comparison. Each row is one model; each colored bar is one metric. Hover any bar for the exact percentage and the underlying eps count. Sort by the metric you care about most.

    Per-perturbation matrix — how each model handles each rewrite type

    Color-coded grid of success rates per (model × perturbation type). 9 perturbation types rewrite the instruction in different ways while keeping the scene identical (verb synonym, color reference, typo, verbose, etc. — see §2). Bright = high success; dark = low. Hover any cell for the exact percentage + paraphrase examples.

    Annotator size effect — Qwen3-VL 32B vs 8B

    Same training recipe (controlled-paraphrase), same checkpoint step, same RobotArena scenes. The only thing that changes between paired bars is which Qwen3-VL variant was used to generate the language paraphrases at dataset-construction time. Use this to read off whether spending more compute on the annotator translates into a stronger policy.

    RobotArena vs SimplerEnv — do they tell the same story?

    Three scatter plots side-by-side, one per RobotArena flavor (Default / Perturbed / Combined), all plotted against the same SimplerEnv-widowx overall success on the Y axis. Each dot is one trained checkpoint. The plain-English takeaway lives in the caption beneath: it spells out which RA flavor SimplerEnv tracks most strongly, which dots disagree the most, and what the spread of r / ρ implies about benchmark substitutability.

    RobotArena internal consistency — Default vs Perturbed

    A separate sanity scatter: RA Default % on X vs RA Perturbed % on Y, one dot per model. If the two RA flavors are nearly equivalent signals (r ≈ 1), Default eval alone is a fine cheap proxy; if they diverge, you genuinely need both evals.

    Training-step progression

    For model lines where we have multiple checkpoint steps on disk, this connects them so you can see whether more training keeps helping. A flat or down line at higher step counts is the typical sign of overtraining or curriculum saturation.

    Rollout Debugger

    One workspace for understanding what the model actually did on each scene. Pick filters, watch the focus rollout, then jump across "same scene other models", "same model all instructions", "same instruction other scenes" strips below to spot the failure mode. Pin 2+ rollouts to the compare lane for synchronized side-by-side playback.

    Loading videos.json

    Annotation Explorer WIP — unified view

    One episode-centric view over every annotation source. Pick a source (a baked dataset variant or a subtask-annotation run), type an episode #, and inspect the bridge video with that source's annotation — the subtask timeline overlay for runs, the paraphrase pool for variants — alongside the source's metrics. Same data as the two Annotation Quality tabs (nothing recomputed).

    Loading dataset_explorer/index.json + subtask_review/index.json