> ## Documentation Index
> Fetch the complete documentation index at: https://docs.remyx.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluation & Decision Policy

> Commit how progress gets measured before a result exists: eval templates declare the harness, decision policy renders the bar against the evidence, and validation runs the eval on connected compute.

**Nav:** Experiments > Evaluation | **URL:** [`/projects`](https://engine.remyx.ai/projects) > Project Settings

An eval template and a decision policy are the two project-level artifacts that fix *how progress gets measured* before any variant exists. The template says **how to score**; the policy says **what counts as progress**. Both are committed ahead of time so the bar is the same for every experiment under the project, and so a result can't be graded against a goalpost that moved after the numbers landed.

This page documents the capability. For the step-by-step setup, see [Define how progress gets measured](/tutorials/get-started/define-how-progress-gets-measured).

***

## Eval templates

An eval template is a saved, project-level spec for scoring a variant against a baseline. It declares the runtime, the baseline, and the metrics once, so every evaluation under the project uses the same harness. Locking a template doesn't run anything; it captures the spec.

Two runtimes are supported:

<Tabs>
  <Tab title="Custom">
    Point Remyx at an evaluation script that already lives in your repo. You provide:

    * **Eval script path**: the path in your repo Remyx invokes (e.g. `docker/eval_stage/`).
    * **Baseline asset**: the model, dataset, or process variants are compared against (e.g. a HuggingFace model slug like `Qwen/Qwen2.5-VL-7B-Instruct`).

    Your script defines what runs and which metrics it emits. Use this when your project already ships its own evaluation harness.
  </Tab>

  <Tab title="lighteval">
    Pick a task from a whitelist of standard benchmarks (e.g. `gsm8k`, `mmlu`). The task's canonical metric (`exact_match`, `accuracy`) auto-populates, and you don't maintain an eval script in your repo.

    Use this when the metric you care about is one of the standard tasks the runtime ships with.
  </Tab>
</Tabs>

The template draws its metrics from the project's **allowed metrics**, the shared vocabulary every experiment, template, and decision rule references. Once locked, the template autocompletes those metric names into the decision-policy editor, which keeps rule fields free of typos that wouldn't surface until a run.

<Note>
  Locking a template captures the spec; it does not launch an evaluation. To actually run the eval on compute, see [Validation & compute](#validation--compute) below.
</Note>

***

## Decision policy

The decision policy is a per-disposition rule builder. For each disposition you define one rule, and a rule is a combinator (`all` or `any`) over a set of conditions:

| Disposition | What its rule defines                                                                                             |
| ----------- | ----------------------------------------------------------------------------------------------------------------- |
| `ship`      | What the variant must achieve to count as progress worth shipping.                                                |
| `reject`    | What disqualifies the variant, typically a `no_regression` condition on each metric you'd be unhappy to see drop. |

`iterate` is the automatic fallback: when neither the `ship` rule nor the `reject` rule matches, the run classifies as iterate.

Conditions are predicates over the experiment's aggregated metrics and a few standard fields:

* **Metric conditions**: a metric from your locked template, an operator (`>=`, `>`, `improves_by`, and so on), a threshold, and an optional confidence floor.
* **Field conditions**: gates on fields like `delta_confidence`, `validation_status`, and `sample_size` (e.g. `delta_confidence >= 0.80`).

### Visualization only

<Warning>
  The decision policy is a **read-side visualization aid**. When results are in, Remyx marks each predicate as met, missed, or not-yet-measured against the experiment's aggregated metrics, so you can see whether the evidence clears the bar you set. It does **not** auto-dispose the experiment. A human still reviews the evidence and logs the decision (ship, iterate, or abandon) and why.
</Warning>

This is deliberate. The policy tells you whether the evidence meets a pre-committed standard; it does not make the call about whether a change actually moved a business outcome. Automated disposition is explicitly deferred: the system automates the toil up to the judgment and stops there. See [Continuous Experimentation](/concepts/continuous-experimentation) for why this gate stays human.

***

## Validation & compute

When an eval template runs, Remyx can launch the run on a connected compute provider instead of asking you to run it by hand. This is the **validation** path: it executes the template's harness, then writes the results back onto the experiment.

Supported providers:

| Provider                    | Status          |
| --------------------------- | --------------- |
| **Modal**                   | Supported       |
| **Weights & Biases Launch** | Supported       |
| MLflow                      | Adapter stubbed |

Connect a provider from [Connectors](/platform/manage/connectors). When a validation run finishes, results flow back onto the experiment via a **signed webhook**. Fields like `observed_delta` and `delta_confidence` land on the experiment and feed the decision-policy predicates described above.

<Steps>
  <Step title="Provision the run">
    Remyx submits the locked template's harness to your connected provider (Modal or W\&B Launch).
  </Step>

  <Step title="Run on your compute">
    The eval executes on the provider against the template's baseline asset and metrics.
  </Step>

  <Step title="Results post back">
    The provider posts results to Remyx over a signed webhook. `observed_delta`, `delta_confidence`, and related fields are written onto the experiment.
  </Step>

  <Step title="Policy renders">
    The decision policy re-renders against the new metrics, and the team logs the decision.
  </Step>
</Steps>

***

## Where to set these

Both artifacts are settable two ways:

<Tabs>
  <Tab title="Web UI">
    Project Settings panels: **Allowed Metrics**, **Evaluation** (the eval template), and **Decision Policy** (the rule builder). Results and the rendered policy appear on the experiment in [Outcomes](/platform/experiments/outcomes).
  </Tab>

  <Tab title="MCP tools">
    From a connected agent, the decision policy is managed with `set_decision_policy`, `get_decision_policy`, and `clear_decision_policy`. Validation runs are driven with `provision_validation`, `check_validation_status`, and `get_validation_results`.
  </Tab>
</Tabs>

***

## Related

<CardGroup cols={2}>
  <Card title="Outcomes" icon="chart-column" href="/platform/experiments/outcomes">
    Where results and the rendered decision policy appear on each experiment
  </Card>

  <Card title="Projects" icon="folder" href="/platform/manage/projects">
    Where eval templates and decision policies are declared
  </Card>

  <Card title="Continuous Experimentation" icon="arrows-rotate" href="/concepts/continuous-experimentation">
    Why the decision stays a human gate
  </Card>

  <Card title="Define how progress gets measured" icon="ruler" href="/tutorials/get-started/define-how-progress-gets-measured">
    Step-by-step setup of the template and policy
  </Card>
</CardGroup>
