Evaluation & Decision Policy

Nav: Manage > Projects > project (Pro sections) | URL: /projects An eval template and a decision policy are two of the three Pro gating artifacts on a Project (the third is target metrics). Together they fix how progress gets measured before any variant exists: the template says how to score; the policy says what counts as progress — Ship, Reject, or Iterate. Both are committed ahead of time so agent drafts are graded against the same bar as human work, and so a result can’t be graded against a goalpost that moved after the numbers landed. This page documents the capability. For the step-by-step setup, see Define how progress gets measured.

Eval templates

An eval template is a saved, project-level spec for scoring a variant against a baseline. It declares the runtime, the baseline, and the metrics once, so every evaluation under the project uses the same harness. Locking a template doesn’t run anything; it captures the spec. Two runtimes are supported:

Custom
lighteval

Point Remyx at an evaluation script that already lives in your repo. You provide:

Eval script path: the path in your repo Remyx invokes (e.g. docker/eval_stage/).
Baseline asset: the model, dataset, or process variants are compared against (e.g. a HuggingFace model slug like Qwen/Qwen2.5-VL-7B-Instruct).

Your script defines what runs and which metrics it emits. Use this when your project already ships its own evaluation harness.

Pick a task from a whitelist of standard benchmarks (e.g. gsm8k, mmlu). The task’s canonical metric (exact_match, accuracy) auto-populates, and you don’t maintain an eval script in your repo.Use this when the metric you care about is one of the standard tasks the runtime ships with.

The template draws its metrics from the project’s allowed metrics, the shared vocabulary every experiment, template, and decision rule references. Once locked, the template autocompletes those metric names into the decision-policy editor, which keeps rule fields free of typos that wouldn’t surface until a run.

Locking a template captures the spec; it does not launch an evaluation. To actually run the eval on compute, see Validation & compute below.

Decision policy

The decision policy is a per-disposition rule builder. For each disposition you define one rule, and a rule is a combinator (all or any) over a set of conditions:

Disposition	What its rule defines
`ship`	What the variant must achieve to count as progress worth shipping.
`reject`	What disqualifies the variant, typically a `no_regression` condition on each metric you’d be unhappy to see drop.

iterate is the automatic fallback: when neither the ship rule nor the reject rule matches, the run classifies as iterate. Conditions are predicates over the experiment’s aggregated metrics and a few standard fields:

Metric conditions: a metric from your locked template, an operator (>=, >, improves_by, and so on), a threshold, and an optional confidence floor.
Field conditions: gates on fields like delta_confidence, validation_status, and sample_size (e.g. delta_confidence >= 0.80).

Visualization only

The decision policy is a read-side visualization aid. When results are in, Remyx marks each predicate as met, missed, or not-yet-measured against the experiment’s aggregated metrics, so you can see whether the evidence clears the bar you set. It does not auto-dispose the experiment. A human still reviews the evidence and logs the decision (ship, iterate, or abandon) and why.

This is deliberate. The policy tells you whether the evidence meets a pre-committed standard; it does not make the call about whether a change actually moved a business outcome. Automated disposition is explicitly deferred: the system automates the toil up to the judgment and stops there. See Continuous Experimentation for why this gate stays human.

Validation & compute

When an eval template runs, Remyx can launch the run on a connected compute provider instead of asking you to run it by hand. This is the validation path: it executes the template’s harness, then writes the results back onto the experiment. Supported providers:

Provider	Status
Modal	Supported
Weights & Biases Launch	Supported
MLflow	Adapter stubbed

Connect a provider from Connectors. When a validation run finishes, results flow back onto the experiment via a signed webhook. Fields like observed_delta and delta_confidence land on the experiment and feed the decision-policy predicates described above.

Provision the run

Remyx submits the locked template’s harness to your connected provider (Modal or W&B Launch).

Run on your compute

The eval executes on the provider against the template’s baseline asset and metrics.

Results post back

The provider posts results to Remyx over a signed webhook. observed_delta, delta_confidence, and related fields are written onto the experiment.

Policy renders

The decision policy re-renders against the new metrics, and the team logs the decision.

Where to set these

Both artifacts are settable two ways:

Web UI
MCP tools

The project’s Pro sections: Target metrics, Evaluation (the eval template), and Decision policy. The rendered policy (Ship / Reject / Iterate) appears against the evidence at the gate — right on the PR and in the project’s experiment history.

From a connected agent, the decision policy is managed with set_decision_policy, get_decision_policy, and clear_decision_policy. Validation runs are driven with provision_validation, check_validation_status, and get_validation_results.

Projects

Where eval templates and decision policies are declared (Pro)

Inbox

Where Decide cards carry the evidence to the human gate

Continuous Experimentation

Why the decision stays a human gate

Define how progress gets measured

Step-by-step setup of the template and policy

Get started

Discover

Agents

Review & decide

Configure

Background

Evaluation & Decision Policy

Eval templates

Decision policy

Visualization only

Validation & compute

Where to set these

Projects

Inbox

Continuous Experimentation

Define how progress gets measured

​Eval templates

​Decision policy

​Visualization only

​Validation & compute

​Where to set these

​Related

Projects

Inbox

Continuous Experimentation

Define how progress gets measured

Eval templates

Decision policy

Visualization only

Validation & compute

Where to set these

Related