Skip to main content
Nav: Experiments > Evaluation | URL: /projects > Project Settings An eval template and a decision policy are the two project-level artifacts that fix how progress gets measured before any variant exists. The template says how to score; the policy says what counts as progress. Both are committed ahead of time so the bar is the same for every experiment under the project, and so a result can’t be graded against a goalpost that moved after the numbers landed. This page documents the capability. For the step-by-step setup, see Define how progress gets measured.

Eval templates

An eval template is a saved, project-level spec for scoring a variant against a baseline. It declares the runtime, the baseline, and the metrics once, so every evaluation under the project uses the same harness. Locking a template doesn’t run anything; it captures the spec. Two runtimes are supported:
Point Remyx at an evaluation script that already lives in your repo. You provide:
  • Eval script path: the path in your repo Remyx invokes (e.g. docker/eval_stage/).
  • Baseline asset: the model, dataset, or process variants are compared against (e.g. a HuggingFace model slug like Qwen/Qwen2.5-VL-7B-Instruct).
Your script defines what runs and which metrics it emits. Use this when your project already ships its own evaluation harness.
The template draws its metrics from the project’s allowed metrics, the shared vocabulary every experiment, template, and decision rule references. Once locked, the template autocompletes those metric names into the decision-policy editor, which keeps rule fields free of typos that wouldn’t surface until a run.
Locking a template captures the spec; it does not launch an evaluation. To actually run the eval on compute, see Validation & compute below.

Decision policy

The decision policy is a per-disposition rule builder. For each disposition you define one rule, and a rule is a combinator (all or any) over a set of conditions:
DispositionWhat its rule defines
shipWhat the variant must achieve to count as progress worth shipping.
rejectWhat disqualifies the variant, typically a no_regression condition on each metric you’d be unhappy to see drop.
iterate is the automatic fallback: when neither the ship rule nor the reject rule matches, the run classifies as iterate. Conditions are predicates over the experiment’s aggregated metrics and a few standard fields:
  • Metric conditions: a metric from your locked template, an operator (>=, >, improves_by, and so on), a threshold, and an optional confidence floor.
  • Field conditions: gates on fields like delta_confidence, validation_status, and sample_size (e.g. delta_confidence >= 0.80).

Visualization only

The decision policy is a read-side visualization aid. When results are in, Remyx marks each predicate as met, missed, or not-yet-measured against the experiment’s aggregated metrics, so you can see whether the evidence clears the bar you set. It does not auto-dispose the experiment. A human still reviews the evidence and logs the decision (ship, iterate, or abandon) and why.
This is deliberate. The policy tells you whether the evidence meets a pre-committed standard; it does not make the call about whether a change actually moved a business outcome. Automated disposition is explicitly deferred: the system automates the toil up to the judgment and stops there. See Continuous Experimentation for why this gate stays human.

Validation & compute

When an eval template runs, Remyx can launch the run on a connected compute provider instead of asking you to run it by hand. This is the validation path: it executes the template’s harness, then writes the results back onto the experiment. Supported providers:
ProviderStatus
ModalSupported
Weights & Biases LaunchSupported
MLflowAdapter stubbed
Connect a provider from Connectors. When a validation run finishes, results flow back onto the experiment via a signed webhook. Fields like observed_delta and delta_confidence land on the experiment and feed the decision-policy predicates described above.
1

Provision the run

Remyx submits the locked template’s harness to your connected provider (Modal or W&B Launch).
2

Run on your compute

The eval executes on the provider against the template’s baseline asset and metrics.
3

Results post back

The provider posts results to Remyx over a signed webhook. observed_delta, delta_confidence, and related fields are written onto the experiment.
4

Policy renders

The decision policy re-renders against the new metrics, and the team logs the decision.

Where to set these

Both artifacts are settable two ways:
Project Settings panels: Allowed Metrics, Evaluation (the eval template), and Decision Policy (the rule builder). Results and the rendered policy appear on the experiment in Outcomes.

Outcomes

Where results and the rendered decision policy appear on each experiment

Projects

Where eval templates and decision policies are declared

Continuous Experimentation

Why the decision stays a human gate

Define how progress gets measured

Step-by-step setup of the template and policy