> ## Documentation Index > Fetch the complete documentation index at: https://docs.remyx.ai/llms.txt > Use this file to discover all available pages before exploring further. # Evaluation & Decision Policy > Commit how progress gets measured before a result exists: eval templates declare the harness, decision policy renders the bar against the evidence, and validation runs the eval on connected compute. **Nav:** Experiments > Evaluation | **URL:** [`/projects`](https://engine.remyx.ai/projects) > Project Settings An eval template and a decision policy are the two project-level artifacts that fix *how progress gets measured* before any variant exists. The template says **how to score**; the policy says **what counts as progress**. Both are committed ahead of time so the bar is the same for every experiment under the project, and so a result can't be graded against a goalpost that moved after the numbers landed. This page documents the capability. For the step-by-step setup, see [Define how progress gets measured](/tutorials/get-started/define-how-progress-gets-measured). *** ## Eval templates An eval template is a saved, project-level spec for scoring a variant against a baseline. It declares the runtime, the baseline, and the metrics once, so every evaluation under the project uses the same harness. Locking a template doesn't run anything; it captures the spec. Two runtimes are supported: Point Remyx at an evaluation script that already lives in your repo. You provide: * **Eval script path**: the path in your repo Remyx invokes (e.g. `docker/eval_stage/`). * **Baseline asset**: the model, dataset, or process variants are compared against (e.g. a HuggingFace model slug like `Qwen/Qwen2.5-VL-7B-Instruct`). Your script defines what runs and which metrics it emits. Use this when your project already ships its own evaluation harness. Pick a task from a whitelist of standard benchmarks (e.g. `gsm8k`, `mmlu`). The task's canonical metric (`exact_match`, `accuracy`) auto-populates, and you don't maintain an eval script in your repo. Use this when the metric you care about is one of the standard tasks the runtime ships with. The template draws its metrics from the project's **allowed metrics**, the shared vocabulary every experiment, template, and decision rule references. Once locked, the template autocompletes those metric names into the decision-policy editor, which keeps rule fields free of typos that wouldn't surface until a run. Locking a template captures the spec; it does not launch an evaluation. To actually run the eval on compute, see [Validation & compute](#validation--compute) below. *** ## Decision policy The decision policy is a per-disposition rule builder. For each disposition you define one rule, and a rule is a combinator (`all` or `any`) over a set of conditions: | Disposition | What its rule defines | | ----------- | ----------------------------------------------------------------------------------------------------------------- | | `ship` | What the variant must achieve to count as progress worth shipping. | | `reject` | What disqualifies the variant, typically a `no_regression` condition on each metric you'd be unhappy to see drop. | `iterate` is the automatic fallback: when neither the `ship` rule nor the `reject` rule matches, the run classifies as iterate. Conditions are predicates over the experiment's aggregated metrics and a few standard fields: * **Metric conditions**: a metric from your locked template, an operator (`>=`, `>`, `improves_by`, and so on), a threshold, and an optional confidence floor. * **Field conditions**: gates on fields like `delta_confidence`, `validation_status`, and `sample_size` (e.g. `delta_confidence >= 0.80`). ### Visualization only The decision policy is a **read-side visualization aid**. When results are in, Remyx marks each predicate as met, missed, or not-yet-measured against the experiment's aggregated metrics, so you can see whether the evidence clears the bar you set. It does **not** auto-dispose the experiment. A human still reviews the evidence and logs the decision (ship, iterate, or abandon) and why. This is deliberate. The policy tells you whether the evidence meets a pre-committed standard; it does not make the call about whether a change actually moved a business outcome. Automated disposition is explicitly deferred: the system automates the toil up to the judgment and stops there. See [Continuous Experimentation](/concepts/continuous-experimentation) for why this gate stays human. *** ## Validation & compute When an eval template runs, Remyx can launch the run on a connected compute provider instead of asking you to run it by hand. This is the **validation** path: it executes the template's harness, then writes the results back onto the experiment. Supported providers: | Provider | Status | | --------------------------- | --------------- | | **Modal** | Supported | | **Weights & Biases Launch** | Supported | | MLflow | Adapter stubbed | Connect a provider from [Connectors](/platform/manage/connectors). When a validation run finishes, results flow back onto the experiment via a **signed webhook**. Fields like `observed_delta` and `delta_confidence` land on the experiment and feed the decision-policy predicates described above. Remyx submits the locked template's harness to your connected provider (Modal or W\&B Launch). The eval executes on the provider against the template's baseline asset and metrics. The provider posts results to Remyx over a signed webhook. `observed_delta`, `delta_confidence`, and related fields are written onto the experiment. The decision policy re-renders against the new metrics, and the team logs the decision. *** ## Where to set these Both artifacts are settable two ways: Project Settings panels: **Allowed Metrics**, **Evaluation** (the eval template), and **Decision Policy** (the rule builder). Results and the rendered policy appear on the experiment in [Outcomes](/platform/experiments/outcomes). From a connected agent, the decision policy is managed with `set_decision_policy`, `get_decision_policy`, and `clear_decision_policy`. Validation runs are driven with `provision_validation`, `check_validation_status`, and `get_validation_results`. *** ## Related Where results and the rendered decision policy appear on each experiment Where eval templates and decision policies are declared Why the decision stays a human gate Step-by-step setup of the template and policy