All numbers in this viewer come from results.json, which is auto-generated
from results/ra_eval_*/ by viewer/build.py. To add a new model,
drop a new ra_eval_<date>/<session>/<ckpt>/<scene>/{results.yaml,episode_log.yaml}
tree on disk and reload — no registration step.
results.json…Every model is the same kind of robot policy (a pi-zero arm controller with a Qwen3-VL vision-language brain). They differ in how the instruction reaches the arm. There are three setups:
These target the grasp → carry → place family: they help pick-and-place, stacking, put-in-container and multi-object tasks. They do little or nothing for push / reach / pour. Each model row shows which rules it used as labelled chips.
The bold tag on each card (e.g. grasp→place) is exactly the chip shown on each model row below — so you can read a row without memorising anything.
Every model is a pi-zero FM policy fine-tuned on Bridge v2 (Berkeley RAIL teleop
data, 60,119 trajectories). The variants differ in (a) the language supervision used at training
time and (b) the underlying VL backbone (PaliGemma-3B vs Qwen3-VL-2B).
The Training data column describes the exact form of language each model
saw during training — paraphrasing, subtask decomposition, structured properties, lexical
perturbations, etc. Click "Real samples from training data" on any row to see verified samples
pulled directly from the dataset's tread.subtask_commands /
nils.paraphrases column for
episode 0 (the "put clothes in laundry machine" trajectory). All models are evaluated on
the same RobotArena scenes, so cross-model comparison is apples-to-apples.
Each of the 9 perturbation types rewrites the instruction text only — the scene, objects, target, and success criterion are identical to the default run.
The 9 types above are the canonical-scene perturbations.
The long-horizon and fine-grained benchmark tasks use their own paraphrase corpus
(configs/task_paraphrases.yaml) — expand below to see every perturbation instruction
actually given for each task (env 16+; env 0–15 run the canonical wording).
A scene is hidden only when no model can reach the threshold on perturbed instructions — a proxy for "this scene is too hard / broken / out-of-distribution". Everything in §3 then re-aggregates over only the kept scenes. Default: 0% (no filtering).
All models are evaluated on the same 34 scenes (4 default + 30 perturbation), so this table is already an apples-to-apples comparison.
Default % — mean success rate when the policy receives the original Bridge
instruction. Averaged over visible scenes (4 default scenes + 30 perturbation scenes,
all with their canonical instruction).Default scenes — how many scenes contributed to the Default mean
(e.g. 34/34 means full coverage). Hover for total episode count + eps/scene.Perturbed % — mean success rate over the same scenes but with each
instruction replaced by 27 paraphrases (3 paraphrases × 9 perturbation types).
Episode-weighted from the per-perturbation breakdown when available.Pert. scenes — same as Default scenes but for the perturbation eval.Δ (pp) — Perturbed − Default in percentage points. Negative = the model
loses ground on perturbed instructions; closer to 0 = more language-robust.Combined % — single ranking score, computed as a weighted sum
across the components listed in the Combined-score weights panel at the top of the page
(default = 0.4·Default + 0.6·Perturbed; you can add SimplerEnv weight on the fly).
Per-model components that are missing get renormalised out so a model without SimplerEnv data
is still scored on the components it has. Use this column to pick the best overall model
under the weighting you care about.
Mean of per-model means within each family (PaliGemma, Qwen3-VL, HLC+VLA). The
n models columns show how many family members have data for each eval type;
the means are taken over those models only.
RobotArena's 30 perturbation scenes (+ 4 default scenes) decompose into 7 task types, classified by the structure of the original Bridge instruction. This tab breaks down each model's perturbed success rate per task type — use it to find which capabilities each model has and which it lacks.
Each cell shows the model's mean perturbed success rate across the scenes in that task-type bin. Color-coded green (best in column) → red (worst). Sort by any column. The bottom row aggregates across all models per task type to show which capabilities are universally hard.
Each row is one scene; metrics are aggregated across all models for that scene. Use this view to find the hardest and easiest scenes, and to spot scenes where perturbations cause an unusually large drop.
Default % — mean success rate across all models when given the original
Bridge instruction. Low = the scene is intrinsically hard (physics / object placement /
success-checker stringency); high = most models can solve it under ideal conditions.Perturbed % — mean success rate across all models with the 27-paraphrase
perturbed instructions. Low = the scene is hard and/or language-sensitive.Combined % — weighted sum of scene-level Default and Perturbed
means using the active weights panel (SimplerEnv has no per-scene data so it's
auto-dropped here). Sort ascending to see the hardest scenes first.Δ (pp) — Perturbed − Default. Large negative = scene is much harder
under perturbation (language-sensitive); near-zero = perturbations don't hurt much.Range — max − min perturbed rate across models. Large = some models
can solve it but others can't (more model-discriminative); small = all models perform
similarly (uninformative scene).Best / Worst model — which model has the highest / lowest perturbed
rate on this scene.
Each cell is the per-model mean success rate over scenes for a single perturbation type
(averaging the 3 paraphrases per type, then over scenes). Breakdown comes from walking
each campaign's episode_log.yaml files via
import_campaign.py / backfill_breakdown.py; models whose
original campaign episode logs are no longer on disk show —.
Side-by-side episode playback with synchronized controls (press play on either and both play; seek one and both seek).
In RobotArena mode the panels show one default-instruction episode (left)
and one perturbed-instruction episode (right) for the same (model, scene); the scene dropdown
★ marks scenes that also have a default video.
In SimplerEnv mode the panels show a successful episode (left) and a failed
episode (right) for the same (model, widowx task).
Coverage: Videos appear here once their main.mp4 files exist locally
under results/ra_eval_*/<sess>/<ckpt>/<scene>/episodes/episode_*/main.mp4.
Run viewer/fetch_videos.sh to pull them from the compute clusters (hala/sof1).
The index is auto-rebuilt by viewer/build.py on each viewer reload.
videos.json…
Visual summaries computed live from the same results.json that drives the
tables under Eval Results. Every model row is a real RobotArena + SimplerEnv
campaign — nothing is fabricated. Hover any data point for the exact alias and rate.
The same numbers that drive §3 of Eval Results, but in chart form for at-a-glance comparison. Each row is one model; each colored bar is one metric. Hover any bar for the exact percentage and the underlying eps count. Sort by the metric you care about most.
Color-coded grid of success rates per (model × perturbation type). 9 perturbation types rewrite the instruction in different ways while keeping the scene identical (verb synonym, color reference, typo, verbose, etc. — see §2). Bright = high success; dark = low. Hover any cell for the exact percentage + paraphrase examples.
Same training recipe (controlled-paraphrase), same checkpoint step, same RobotArena scenes. The only thing that changes between paired bars is which Qwen3-VL variant was used to generate the language paraphrases at dataset-construction time. Use this to read off whether spending more compute on the annotator translates into a stronger policy.
Three scatter plots side-by-side, one per RobotArena flavor (Default / Perturbed / Combined), all plotted against the same SimplerEnv-widowx overall success on the Y axis. Each dot is one trained checkpoint. The plain-English takeaway lives in the caption beneath: it spells out which RA flavor SimplerEnv tracks most strongly, which dots disagree the most, and what the spread of r / ρ implies about benchmark substitutability.
A separate sanity scatter: RA Default % on X vs RA Perturbed % on Y, one dot per model. If the two RA flavors are nearly equivalent signals (r ≈ 1), Default eval alone is a fine cheap proxy; if they diverge, you genuinely need both evals.
For model lines where we have multiple checkpoint steps on disk, this connects them so you can see whether more training keeps helping. A flat or down line at higher step counts is the typical sign of overtraining or curriculum saturation.
One workspace for understanding what the model actually did on each scene. Pick filters, watch the focus rollout, then jump across "same scene other models", "same model all instructions", "same instruction other scenes" strips below to spot the failure mode. Pin 2+ rollouts to the compare lane for synchronized side-by-side playback.
videos.json…One episode-centric view over every annotation source. Pick a source (a baked dataset variant or a subtask-annotation run), type an episode #, and inspect the bridge video with that source's annotation — the subtask timeline overlay for runs, the paraphrase pool for variants — alongside the source's metrics. Same data as the two Annotation Quality tabs (nothing recomputed).
dataset_explorer/index.json + subtask_review/index.json…