Define how progress gets measured

Setup phase · ~10 minutes

Evaluating AI variants reliably needs the same discipline as any reproducible test: same setup, same baseline, same metrics, same threshold for what counts as progress. Without that, results from one experiment aren’t comparable to the next, and improvements get harder to attribute to specific changes. This tutorial creates two project-level artifacts that lock in that consistency:

An eval template. A saved config that says how to score a variant against a baseline. What script to run, which baseline to compare against, which metrics matter.
A decision policy. A saved set of rules that says when a variant counts as shipping, when it counts as rejecting, and when the team should iterate.

Both are written before any variant exists. When results land in Run an evaluation, the team’s bar is already on the page next to the numbers.

Prerequisites. You’ve completed Create your project. The project exists and you can open Project Settings.

Define your allowed metrics

Before configuring the eval, declare the metrics this project tracks. Allowed metrics are the project-level vocabulary every experiment, eval template, and decision rule under this project draws from. They appear as dropdown options when creating experiments and as suggestions in the eval template’s metric picker.

Open Project Settings → Allowed Metrics.
Click + Add a metric.
Fill in:
- Key: machine identifier (lowercase, e.g., spatialscore). This is what the eval script emits and what decision rules reference.
- Label: how the metric is displayed in the UI (e.g., “SpatialScore”).
- Unit: what the value represents (e.g., accuracy, %, seconds).
- Baseline: the current value to compare future variants against.
Repeat for every metric this project will track.

For VQASynth, define one row per spatial benchmark:

Key	Label	Unit	Baseline
`spatialscore`	SpatialScore	accuracy	`0.42`
`omnispatial`	OmniSpatial	accuracy	`0.31`
`space10`	SpaCE-10	accuracy	`0.55`
`mindcube`	MindCube	accuracy	`0.38`

The baseline values are what Qwen/Qwen2.5-VL-7B-Instruct scores on each benchmark today. Future variants will be measured as a delta against these numbers.

Allowed Metrics section with four spatial benchmark rows

Keep metric labels business-oriented and short. They show up on charts, in dropdowns, and in decision-policy rule descriptions. “SpatialScore” reads better than “Spatial Score Accuracy %”; spatialscore as a key reads better than spatial_score_metric.

Configure the eval template

The eval template tells Remyx how to measure.

Scroll down to Project Settings → Evaluation.
Click + Configure eval.
Pick a runtime.
- Custom. Point Remyx at an evaluation script that already lives in your repo. Provide the script path and a baseline asset (the model, dataset, or process variants will be compared against).
- lighteval. Pick a task from the supported whitelist (e.g., gsm8k, mmlu). Metric defaults populate automatically.
Add the metrics you want this template to track. The listbox suggests the allowed metrics you just defined; select from the suggestions or type your own.
Click Save and lock.

For VQASynth, that looks like:

Runtime: Custom
Eval script path: docker/eval_stage/
Baseline asset: Qwen/Qwen2.5-VL-7B-Instruct (a HuggingFace model slug)
Metrics: spatialscore, omnispatial, space10, mindcube (the four allowed metrics from the previous step)

Eval template form with Custom runtime, script path, baseline asset, and metrics

The template appears with a green locked badge. Locked templates are what Run an evaluation runs against, and they autocomplete metric names into the decision policy below.

Eval template saved with green locked badge

Locking the template doesn’t run the eval. Locking captures the spec so the policy editor below can autocomplete metric names from your template, and so future runs use the same harness every time.

When to use Custom vs. lighteval

Use lighteval when the metric you care about is one of the standard benchmarks the runtime ships with (gsm8k, mmlu, etc.). The metric chip auto-populates with the task’s canonical metric (exact_match, accuracy), and you don’t need to maintain an eval script in your repo.Use Custom when your repo already ships its own eval script. Your script defines what runs, what metrics come out, and what the baseline asset is.You can create additional templates later if you have multiple evaluation strategies you’d like to deploy.

What counts as a Baseline asset?

A Baseline asset is whatever your project’s variants will be compared against. The exact form depends on your eval script and what your project produces.

Lock in the decision policy

The decision policy tells Remyx what counts as progress. The eval template gives you measurements; the policy gives you a verdict against pre-committed rules.

Open Project Settings → Decision Policy.
Build a ship rule. This defines what the variant must achieve to count as progress worth shipping.
- Pick the combinator (all or any).
- Add metric conditions: pick a metric (autocomplete from your locked template), pick an operator (>=, >, improves_by, etc.), set a threshold, optionally set a confidence floor.
- Add field conditions if you want to gate on confidence or sample size (e.g., delta_confidence >= 0.80).
Build a reject rule. This defines what disqualifies the variant from shipping.
- Same shape as ship. Common pattern: a no_regression condition on every metric you’d be unhappy to see drop.
Optional: add notes with free-form context the team should remember about why these criteria.
Click Save policy. A “Changes saved” flash confirms.

iterate is the automatic fallback. When neither ship nor reject matches, Remyx classifies the run as iterate and prompts the team to keep working.

Decision Policy editor with ship and reject rules and iterate fallback

A worked example

For a project comparing two VLMs on four spatial benchmarks, a defensible policy might read:Ship (combinator: all)

spatialscore.delta >= 0.02, confidence floor 0.80
omnispatial.delta >= 0.02, confidence floor 0.80
delta_confidence >= 0.80 (field condition)

Reject (combinator: any)

spatialscore.delta < -0.01
omnispatial.delta < -0.01
space10.delta < -0.01
mindcube.delta < -0.01

Notes: “3-of-4 across the suite at 80% confidence to ship. Any single benchmark regression worse than 1% rejects.”If the variant improves on some benchmarks but doesn’t clear the ship bar, and nothing regresses badly enough to trigger reject, the team iterates.

Recap

You now have:

A locked eval template defining how variants will be scored
A locked decision policy defining what counts as ship, reject, or iterate
A team-agreed bar that lives in the project alongside its history

When you run an evaluation in Run an evaluation, the policy renders alongside results with a per-predicate ✓ / ✗ / · indicator.

Tips

Lock the eval before writing the policy

The decision-policy editor autocompletes metric names from the locked eval template. Writing a policy without a locked template means typing metric names from memory, which leads to typos that won’t be caught until evaluation runs.

Write reject rules even when ship is what you care about

A reject rule with no_regression on every metric you wouldn’t want to see drop is cheap insurance. Without it, a variant that improves the headline metric but tanks a secondary one falls through to iterate, which can be misleading.

The notes field is for context, not criteria

Put context there that future-you (or a future teammate) will need to remember why this policy is what it is. The structured rules are what get evaluated against results.

Scope an experiment from a recommendation

Turn a paper into a structured experiment with hypothesis, target metric, and tags.

Series overview

Full series arc

ExperimentOps concepts

Why pre-committed criteria matter

Documentation Index

​Define your allowed metrics

​Configure the eval template

​Lock in the decision policy

​Recap

​Tips

​Next

Scope an experiment from a recommendation

Series overview

ExperimentOps concepts

Define your allowed metrics

Configure the eval template

Lock in the decision policy

Recap

Tips

Next