Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.remyx.ai/llms.txt

Use this file to discover all available pages before exploring further.

Setup phase · ~10 minutes
Follow along with the video walkthrough on YouTube ↗.
Most AI teams measure variants after the fact. Someone trains a new model, results land, and the team argues about whether the gain is “good enough” to ship. That argument is expensive and the answer drifts based on who’s in the room. This tutorial replaces that argument with two saved artifacts:
  • An eval template. A saved config that says how to score a variant against a baseline. What script to run, which baseline to compare against, which metrics matter.
  • A decision policy. A saved set of rules that says when a variant counts as shipping, when it counts as rejecting, and when the team should iterate.
Both are written before any variant exists. When results land in Run an evaluation, the team’s bar is already on the page next to the numbers.
Prerequisites. You’ve completed Create your project. The project exists and you can open Project Settings.

Define your allowed metrics

Before configuring the eval, declare the metrics this project tracks. Allowed metrics are the project-level vocabulary every experiment, eval template, and decision rule under this project draws from. They appear as dropdown options when creating experiments and as suggestions in the eval template’s metric picker.
  1. Open Project SettingsAllowed Metrics.
  2. Click + Add a metric.
  3. Fill in:
    • Key: machine identifier (lowercase, e.g., spatialscore). This is what the eval script emits and what decision rules reference.
    • Label: how the metric is displayed in the UI (e.g., “SpatialScore”).
    • Unit: what the value represents (e.g., accuracy, %, seconds).
    • Baseline: the current value to compare future variants against.
  4. Repeat for every metric this project will track.
For VQASynth, define one row per spatial benchmark:
KeyLabelUnitBaseline
spatialscoreSpatialScoreaccuracy0.42
omnispatialOmniSpatialaccuracy0.31
space10SpaCE-10accuracy0.55
mindcubeMindCubeaccuracy0.38
The baseline values are what Qwen/Qwen2.5-VL-7B-Instruct scores on each benchmark today. Future variants will be measured as a delta against these numbers.
Allowed Metrics section with four spatial benchmark rows
Keep metric labels business-oriented and short. They show up on charts, in dropdowns, and in decision-policy rule descriptions. “SpatialScore” reads better than “Spatial Score Accuracy %”; spatialscore as a key reads better than spatial_score_metric.

Configure the eval template

The eval template tells Remyx how to measure.
  1. Scroll down to Project SettingsEvaluation.
  2. Click + Configure eval.
  3. Pick a runtime.
    • Custom. Point Remyx at an evaluation script that already lives in your repo. Provide the script path and a baseline asset (the model, dataset, or process variants will be compared against).
    • lighteval. Pick a task from the supported whitelist (e.g., gsm8k, mmlu). Metric defaults populate automatically.
  4. Add the metrics you want this template to track. The listbox suggests the allowed metrics you just defined; select from the suggestions or type your own.
  5. Click Save and lock.
For VQASynth, that looks like:
  • Runtime: Custom
  • Eval script path: docker/eval_stage/
  • Baseline asset: Qwen/Qwen2.5-VL-7B-Instruct (a HuggingFace model slug)
  • Metrics: spatialscore, omnispatial, space10, mindcube (the four allowed metrics from the previous step)
Eval template form with Custom runtime, script path, baseline asset, and metrics
The template appears with a green locked badge. Locked templates are what Run an evaluation runs against, and they autocomplete metric names into the decision policy below.
Eval template saved with green locked badge
Locking the template doesn’t run the eval. Locking captures the spec so the policy editor below can autocomplete metric names from your template, and so future runs use the same harness every time.
Use lighteval when the metric you care about is one of the standard benchmarks the runtime ships with (gsm8k, mmlu, etc.). The metric chip auto-populates with the task’s canonical metric (exact_match, accuracy), and you don’t need to maintain an eval script in your repo.Use Custom when your repo already ships its own eval script. Your script defines what runs, what metrics come out, and what the baseline asset is.You can create additional templates later if you have multiple evaluation strategies you’d like to deploy.
A Baseline asset is whatever your project’s variants will be compared against. The exact form depends on your eval script and what your project produces.

Lock in the decision policy

The decision policy tells Remyx what counts as progress. The eval template gives you measurements; the policy gives you a verdict against pre-committed rules.
  1. Open Project SettingsDecision Policy.
  2. Build a ship rule. This defines what the variant must achieve to count as progress worth shipping.
    • Pick the combinator (all or any).
    • Add metric conditions: pick a metric (autocomplete from your locked template), pick an operator (>=, >, improves_by, etc.), set a threshold, optionally set a confidence floor.
    • Add field conditions if you want to gate on confidence or sample size (e.g., delta_confidence >= 0.80).
  3. Build a reject rule. This defines what disqualifies the variant from shipping.
    • Same shape as ship. Common pattern: a no_regression condition on every metric you’d be unhappy to see drop.
  4. Optional: add notes with free-form context the team should remember about why these criteria.
  5. Click Save policy. A “Changes saved” flash confirms.
iterate is the automatic fallback. When neither ship nor reject matches, Remyx classifies the run as iterate and prompts the team to keep working.
Decision Policy editor with ship and reject rules and iterate fallback
For a project comparing two VLMs on four spatial benchmarks, a defensible policy might read:Ship (combinator: all)
  • spatialscore.delta >= 0.02, confidence floor 0.80
  • omnispatial.delta >= 0.02, confidence floor 0.80
  • delta_confidence >= 0.80 (field condition)
Reject (combinator: any)
  • spatialscore.delta < -0.01
  • omnispatial.delta < -0.01
  • space10.delta < -0.01
  • mindcube.delta < -0.01
Notes: “3-of-4 across the suite at 80% confidence to ship. Any single benchmark regression worse than 1% rejects.”If the variant improves on some benchmarks but doesn’t clear the ship bar, and nothing regresses badly enough to trigger reject, the team iterates.

Recap

You now have:
  • A locked eval template defining how variants will be scored
  • A locked decision policy defining what counts as ship, reject, or iterate
  • A team-agreed bar that lives in the project alongside its history
When you run an evaluation in Run an evaluation, the policy renders alongside results with a per-predicate ✓ / ✗ / · indicator.

Tips

The decision-policy editor autocompletes metric names from the locked eval template. Writing a policy without a locked template means typing metric names from memory, which leads to typos that won’t be caught until evaluation runs.
A reject rule with no_regression on every metric you wouldn’t want to see drop is cheap insurance. Without it, a variant that improves the headline metric but tanks a secondary one falls through to iterate, which can be misleading.
Put context there that future-you (or a future teammate) will need to remember why this policy is what it is. The structured rules are what get evaluated against results.

Next

Scope an experiment from a recommendation

Turn a paper into a structured experiment with hypothesis, target metric, and tags.

Series overview

Full series arc

ExperimentOps concepts

Why pre-committed criteria matter