Run an evaluation

Decide phase · ~15 minutes You’ve configured the harness in Define how progress gets measured, scoped an experiment in Scope an experiment from a recommendation, and implemented it in Implement an Experiment. Now you find out whether the variant is worth shipping. This tutorial triggers the eval against the locked template, reads the side-by-side results, and logs the decision against the policy you wrote in advance.

Prerequisites.

You’ve completed the previous tutorials in the series. The project, eval template, decision policy, scoped experiment, and implementation PR all exist.
A Modal account, connected via Integrations. Modal sandboxes run under your workspace, and your account pays for the compute.
A Hugging Face account connected via Integrations, so the eval can pull the baseline and variant model weights.

Trigger the head-to-head

The eval runs as a two-arm Modal job. One sandbox runs the baseline asset (defined in your eval template), one runs the variant asset (what your PR produced). Each runs the locked eval script and emits the metrics defined in the template. From the experiment detail page, click Run evaluation. A dialog asks for the baseline and variant references. The baseline defaults to the asset on the eval template; the variant is the model produced by the implementation PR. Or from Claude Code:

> Run an evaluation on the depth-estimator-swap experiment,
  baseline Qwen/Qwen2.5-VL-7B-Instruct,
  variant myorg/Qwen2.5-VL-VQASynth-FT-v1.
  [Claude calls provision_validation]

Either path calls provision_validation, which builds an eval-environment Dockerfile (using gitingest plus Gemini-driven generation against your repo), submits it as two parallel sandbox jobs to Modal, and registers a webhook to receive results. Runs go in parallel, so wall-clock time is whichever arm takes longer (typically the bigger model, or whichever has slower benchmarks). While it runs, you can step away. The experiment detail page surfaces the results when both arms complete.

Read the results

Once both arms complete, the experiment’s detail page populates with the side-by-side. Each metric in the template appears with the baseline value, the variant value, the delta, and a directional confidence indicator. Below the metrics, the Decision policy block from the locked template renders next to the actuals. Each predicate from the policy is evaluated against the actual deltas, with a ✓ / ✗ / · indicator next to each. The classification is one of three things:

Ship. Every predicate of the ship rule passes. The variant cleared the bar set in advance.
Reject. Any predicate of the reject rule fires. The variant tripped a guard rail (e.g., regressed a benchmark by more than the threshold).
Iterate. Neither ship nor reject matched. The variant moved some metrics but didn’t decisively earn a ship-or-stop verdict.

These match the three terminal states of an experiment in Remyx.

Log the decision

The classification is the engine’s recommendation. The decision is yours. Sometimes you’ll override (the policy said reject, but the regression was on a metric you’ve already deprecated; ship anyway with a note). Sometimes you’ll defer (the engine said ship, but you want a second eval on a different dataset before committing). Log what you decided and why. From the UI: open the experiment, click the disposition button (Ship / Iterate / Reject), and write the rationale in the modal that opens. Or from Claude Code:

> Log a decision on the depth-estimator-swap experiment:
  Ship to 100%. Three of four benchmarks improved at >= 2%
  with confidence above 0.85. The MindCube regression was
  inside the noise floor.
  [Claude calls log_decision]

A good rationale captures three things: what (the disposition), why (the reasoning grounded in the actual results), and what’s next (any follow-up work it implies). Future-you reading this in six months will need all three.

What just happened

You closed a loop. The experiment moves from running to its terminal state (shipped, rejected, or iterating). The rationale is captured alongside the metrics. The policy that governed the decision is recorded with the experiment, so anyone reading it later can replay the logic. This is also when the project’s outcomes timeline updates. A new node lands on the chart, the project’s hit rate adjusts, and the decision becomes part of the project’s standing context. Future discovery recommendations and future cluster patterns reason over this result automatically.

Recap

You now have:

A variant scored against the locked eval template on Modal infrastructure
A side-by-side comparison with deltas and confidence per metric
The decision policy evaluated against the actuals, predicate by predicate
A logged decision with rationale
A closed-loop experiment that’s now part of the project’s history

Tips

Watch for confidence, not just deltas

A 3% improvement at 0.4 confidence is weaker evidence than a 1% improvement at 0.95 confidence. The policy locked in the previous setup tutorial should already gate on confidence. If you find yourself wanting to ship despite low confidence, that’s a sign to re-run with a larger evaluation set rather than override the policy.

Don't tune the policy after seeing results

The discipline that makes pre-committed criteria valuable is not changing them after the fact. If a result makes you want to adjust the policy, file that as a separate change. Let the current experiment ship or reject under the policy that was in place when it was scoped, then update the policy for the next round.

Override decisions sparingly, and always with rationale

Sometimes the right call differs from the engine’s classification. Write down why in the rationale field. Six months from now, an override without explanation looks like an unexplained anomaly in the project’s outcomes record.

Stay in the loop

Read the patterns across your experiments, the decisions behind them, and the directions paying off.

Run another experiment

Loop back to scoping with a new recommendation

Series overview

Full series arc

Documentation Index

​Trigger the head-to-head

​Read the results

​Log the decision

​What just happened

​Recap

​Tips

​Next