Setup phase · ~10 minutes Most AI teams measure variants after the fact. Someone trains a new model, results land, and the team argues about whether the gain is “good enough” to ship. That argument is expensive and the answer drifts based on who’s in the room. This tutorial replaces that argument with two saved artifacts:Documentation Index
Fetch the complete documentation index at: https://docs.remyx.ai/llms.txt
Use this file to discover all available pages before exploring further.
- An eval template. A saved config that says how to score a variant against a baseline. What script to run, which baseline to compare against, which metrics matter.
- A decision policy. A saved set of rules that says when a variant counts as shipping, when it counts as rejecting, and when the team should iterate.
Prerequisites. You’ve completed Create your project. The project exists and you can open Project Settings.
Define your allowed metrics
Before configuring the eval, declare the metrics this project tracks. Allowed metrics are the project-level vocabulary every experiment, eval template, and decision rule under this project draws from. They appear as dropdown options when creating experiments and as suggestions in the eval template’s metric picker.- Open Project Settings → Allowed Metrics.
- Click + Add a metric.
- Fill in:
- Key: machine identifier (lowercase, e.g.,
spatialscore). This is what the eval script emits and what decision rules reference. - Label: how the metric is displayed in the UI (e.g., “SpatialScore”).
- Unit: what the value represents (e.g.,
accuracy,%,seconds). - Baseline: the current value to compare future variants against.
- Key: machine identifier (lowercase, e.g.,
- Repeat for every metric this project will track.
| Key | Label | Unit | Baseline |
|---|---|---|---|
spatialscore | SpatialScore | accuracy | 0.42 |
omnispatial | OmniSpatial | accuracy | 0.31 |
space10 | SpaCE-10 | accuracy | 0.55 |
mindcube | MindCube | accuracy | 0.38 |
Qwen/Qwen2.5-VL-7B-Instruct scores on each benchmark today. Future variants will be measured as a delta against these numbers.

Configure the eval template
The eval template tells Remyx how to measure.- Scroll down to Project Settings → Evaluation.
- Click + Configure eval.
- Pick a runtime.
- Custom. Point Remyx at an evaluation script that already lives in your repo. Provide the script path and a baseline asset (the model, dataset, or process variants will be compared against).
- lighteval. Pick a task from the supported whitelist (e.g.,
gsm8k,mmlu). Metric defaults populate automatically.
- Add the metrics you want this template to track. The listbox suggests the allowed metrics you just defined; select from the suggestions or type your own.
- Click Save and lock.
- Runtime: Custom
- Eval script path:
docker/eval_stage/ - Baseline asset:
Qwen/Qwen2.5-VL-7B-Instruct(a HuggingFace model slug) - Metrics:
spatialscore,omnispatial,space10,mindcube(the four allowed metrics from the previous step)


Locking the template doesn’t run the eval. Locking captures the spec so the policy editor below can autocomplete metric names from your template, and so future runs use the same harness every time.
When to use Custom vs. lighteval
When to use Custom vs. lighteval
Use lighteval when the metric you care about is one of the standard benchmarks the runtime ships with (
gsm8k, mmlu, etc.). The metric chip auto-populates with the task’s canonical metric (exact_match, accuracy), and you don’t need to maintain an eval script in your repo.Use Custom when your repo already ships its own eval script. Your script defines what runs, what metrics come out, and what the baseline asset is.You can create additional templates later if you have multiple evaluation strategies you’d like to deploy.What counts as a Baseline asset?
What counts as a Baseline asset?
A Baseline asset is whatever your project’s variants will be compared against. The exact form depends on your eval script and what your project produces.
Lock in the decision policy
The decision policy tells Remyx what counts as progress. The eval template gives you measurements; the policy gives you a verdict against pre-committed rules.- Open Project Settings → Decision Policy.
- Build a ship rule. This defines what the variant must achieve to count as progress worth shipping.
- Pick the combinator (
allorany). - Add metric conditions: pick a metric (autocomplete from your locked template), pick an operator (
>=,>,improves_by, etc.), set a threshold, optionally set a confidence floor. - Add field conditions if you want to gate on confidence or sample size (e.g.,
delta_confidence >= 0.80).
- Pick the combinator (
- Build a reject rule. This defines what disqualifies the variant from shipping.
- Same shape as ship. Common pattern: a
no_regressioncondition on every metric you’d be unhappy to see drop.
- Same shape as ship. Common pattern: a
- Optional: add notes with free-form context the team should remember about why these criteria.
- Click Save policy. A “Changes saved” flash confirms.
iterate is the automatic fallback. When neither ship nor reject matches, Remyx classifies the run as iterate and prompts the team to keep working.

A worked example
A worked example
For a project comparing two VLMs on four spatial benchmarks, a defensible policy might read:Ship (combinator:
all)spatialscore.delta>=0.02, confidence floor0.80omnispatial.delta>=0.02, confidence floor0.80delta_confidence>=0.80(field condition)
any)spatialscore.delta<-0.01omnispatial.delta<-0.01space10.delta<-0.01mindcube.delta<-0.01
Recap
You now have:- A locked eval template defining how variants will be scored
- A locked decision policy defining what counts as ship, reject, or iterate
- A team-agreed bar that lives in the project alongside its history
Tips
Lock the eval before writing the policy
Lock the eval before writing the policy
The decision-policy editor autocompletes metric names from the locked eval template. Writing a policy without a locked template means typing metric names from memory, which leads to typos that won’t be caught until evaluation runs.
Write reject rules even when ship is what you care about
Write reject rules even when ship is what you care about
A
reject rule with no_regression on every metric you wouldn’t want to see drop is cheap insurance. Without it, a variant that improves the headline metric but tanks a secondary one falls through to iterate, which can be misleading.The notes field is for context, not criteria
The notes field is for context, not criteria
Put context there that future-you (or a future teammate) will need to remember why this policy is what it is. The structured rules are what get evaluated against results.
Next
Scope an experiment from a recommendation
Turn a paper into a structured experiment with hypothesis, target metric, and tags.
Series overview
Full series arc
ExperimentOps concepts
Why pre-committed criteria matter