/projects > Project Settings
An eval template and a decision policy are the two project-level artifacts that fix how progress gets measured before any variant exists. The template says how to score; the policy says what counts as progress. Both are committed ahead of time so the bar is the same for every experiment under the project, and so a result can’t be graded against a goalpost that moved after the numbers landed.
This page documents the capability. For the step-by-step setup, see Define how progress gets measured.
Eval templates
An eval template is a saved, project-level spec for scoring a variant against a baseline. It declares the runtime, the baseline, and the metrics once, so every evaluation under the project uses the same harness. Locking a template doesn’t run anything; it captures the spec. Two runtimes are supported:- Custom
- lighteval
Point Remyx at an evaluation script that already lives in your repo. You provide:
- Eval script path: the path in your repo Remyx invokes (e.g.
docker/eval_stage/). - Baseline asset: the model, dataset, or process variants are compared against (e.g. a HuggingFace model slug like
Qwen/Qwen2.5-VL-7B-Instruct).
Locking a template captures the spec; it does not launch an evaluation. To actually run the eval on compute, see Validation & compute below.
Decision policy
The decision policy is a per-disposition rule builder. For each disposition you define one rule, and a rule is a combinator (all or any) over a set of conditions:
| Disposition | What its rule defines |
|---|---|
ship | What the variant must achieve to count as progress worth shipping. |
reject | What disqualifies the variant, typically a no_regression condition on each metric you’d be unhappy to see drop. |
iterate is the automatic fallback: when neither the ship rule nor the reject rule matches, the run classifies as iterate.
Conditions are predicates over the experiment’s aggregated metrics and a few standard fields:
- Metric conditions: a metric from your locked template, an operator (
>=,>,improves_by, and so on), a threshold, and an optional confidence floor. - Field conditions: gates on fields like
delta_confidence,validation_status, andsample_size(e.g.delta_confidence >= 0.80).
Visualization only
This is deliberate. The policy tells you whether the evidence meets a pre-committed standard; it does not make the call about whether a change actually moved a business outcome. Automated disposition is explicitly deferred: the system automates the toil up to the judgment and stops there. See Continuous Experimentation for why this gate stays human.Validation & compute
When an eval template runs, Remyx can launch the run on a connected compute provider instead of asking you to run it by hand. This is the validation path: it executes the template’s harness, then writes the results back onto the experiment. Supported providers:| Provider | Status |
|---|---|
| Modal | Supported |
| Weights & Biases Launch | Supported |
| MLflow | Adapter stubbed |
observed_delta and delta_confidence land on the experiment and feed the decision-policy predicates described above.
Provision the run
Remyx submits the locked template’s harness to your connected provider (Modal or W&B Launch).
Run on your compute
The eval executes on the provider against the template’s baseline asset and metrics.
Results post back
The provider posts results to Remyx over a signed webhook.
observed_delta, delta_confidence, and related fields are written onto the experiment.Where to set these
Both artifacts are settable two ways:- Web UI
- MCP tools
Project Settings panels: Allowed Metrics, Evaluation (the eval template), and Decision Policy (the rule builder). Results and the rendered policy appear on the experiment in Outcomes.
Related
Outcomes
Where results and the rendered decision policy appear on each experiment
Projects
Where eval templates and decision policies are declared
Continuous Experimentation
Why the decision stays a human gate
Define how progress gets measured
Step-by-step setup of the template and policy