> ## Documentation Index
> Fetch the complete documentation index at: https://docs.remyx.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# ExperimentOps

> The methodology behind systematic AI experimentation: capture decisions, surface patterns, compound what your team learns

# ExperimentOps

Every AI team experiments. New techniques ship weekly. Coding agents generate implementations in hours. But most of that effort doesn't compound because the decisions, context, and cross-experiment patterns never get captured in a system.

ExperimentOps is the practice of tracking the full lifecycle of every AI experiment your team runs, including the **decisions**: what the team tried, why they tried it, whether it worked, and what to do next. Over time, this builds institutional knowledge that persists through team changes and reveals strategic patterns that are invisible when experiments are tracked in isolation.

<Note>
  **Shipped today vs. on the roadmap.** The capabilities described on this page are live in production. ExperimentOps is the foundation. The architecture extends further over time, with a causal model that updates as evidence arrives from logs, commits, A/B tests, and eventually instrumented decision points. See [Causal Intelligence](/concepts/causal-intelligence) and the [Roadmap](/roadmap) for the forward-looking direction.
</Note>

***

## Why This Matters Now

The #1 production blocker cited by AI practitioners in 2026 is **evaluation, testing, and measurement**: knowing whether the system is actually working and getting better over time. The bottleneck is not generating ideas or writing code. It's validating what works, at production depth, with real business outcomes.

Three structural problems prevent most teams from solving this:

<AccordionGroup>
  <Accordion title="Context lives in people's heads, not systems">
    A data scientist spends two months testing retrieval strategies. The reasoning behind the final choice (why hybrid search won, what alternatives were tested, what tradeoffs were considered) lives in their memory and a few Slack messages. MLflow logged the runs but not the interpretation. When that person leaves, the next engineer starts from scratch.

    Every experiment without captured context is knowledge that walks out the door when someone changes teams or leaves the company.
  </Accordion>

  <Accordion title="Experiments are isolated, patterns are invisible">
    A team runs 14 experiments in a quarter. Five explored retrieval improvements and all produced positive results. Three explored routing and none worked. But nobody sees this pattern because each experiment is tracked in a different tool: a Jira ticket, an MLflow run, a Notion page.

    The strategic signal ("retrieval is our strongest direction, routing is not working") is invisible unless someone manually reviews all experiments and connects the dots. Nobody has time for that meta-analysis.
  </Accordion>

  <Accordion title="Leadership can't see what's working">
    A CTO managing three AI projects needs to know: which are producing results? Which are stalled? Where should the team invest more? Today, getting that answer requires being in every room or scheduling status meetings with each team lead.

    Without a portfolio view across AI projects, resource allocation is based on status reports, not outcomes.
  </Accordion>

  <Accordion title="The pace of change is outrunning teams">
    What was best practice weeks ago may already be suboptimal. New techniques, architectures, and tooling ship faster than any team can evaluate. The result: a persistent backlog of improvements that never gets touched because maintenance, firefighting, and stakeholder management consume all the bandwidth.

    The result is a persistent backlog of improvements that never gets touched because maintenance and stakeholder work consume all the bandwidth.
  </Accordion>
</AccordionGroup>

***

## How ExperimentOps Works

ExperimentOps adds three layers that traditional experiment tracking doesn't capture:

### 1. The Decision Layer

Every experiment in Remyx records not just *what happened* (metrics, artifacts) but *why it happened* and *what the team decided to do about it*.

| Field             | What it captures                                    | Example                                                                           |
| ----------------- | --------------------------------------------------- | --------------------------------------------------------------------------------- |
| **Source**        | Where the idea came from                            | Paper, GitHub repo, HuggingFace model, hypothesis, incident, Remyx recommendation |
| **Hypothesis**    | What the team expects                               | "Adding temporal decay to embeddings will increase 7-day retention by 1-2%"       |
| **Target metric** | The business outcome that matters                   | Resolution rate, conversion rate, CSAT                                            |
| **Decision**      | What the team decided after seeing results, and why | "Ship to 100%. Explore temporal features in search ranking next."                 |

The decision field captures the reasoning that artifact-tracking tools miss. When a team logs "Ship to 100%. The re-ranker specifically helps with multi-topic tickets where the old retriever returned tangentially related articles," that reasoning persists through team changes, onboarding, and leadership transitions.

MLflow tracks what you did. Remyx tracks what it meant and what to do next.

### 2. The Pattern Layer

After enough experiments, Remyx groups them by direction and surfaces which directions are producing results:

```
Retrieval quality (5 experiments): 5 of 5 positive, avg +3.2%  → HIGH SIGNAL
Prompt engineering (2 experiments): 2 of 2 positive, avg +1.2% → MODERATE
Routing / model selection (2 experiments): 0 of 2 significant  → LOW SIGNAL
```

This meta-analysis is what no individual on the team can do on their own. It requires looking across all experiments, grouping them by theme, and computing which themes consistently produce results. For a team running 10+ experiments per quarter, this is the difference between iterating randomly and iterating strategically.

<Card title="Patterns and Insights" icon="bolt-lightning" href="/platform/experiments/insights">
  See how pattern detection works in the Insights view
</Card>

### 3. The Recommendation Layer

For each high-signal direction, Remyx searches its resource index for techniques, models, and methods that align with the cluster's theme. Your team's own experiment history feeds the ranker directly: candidates that align with the direction you've actually shipped rank above ones that are merely topically related, and a learned **preference model** fit over your past experiments breaks ties. Past experiments inform future discovery, so recommendations improve as the team's data grows.

That same loop can run on a schedule — surfacing recommendations daily and opening reviewable draft PRs into your repo, with a person still approving every merge.

<Card title="Automated discovery PRs" icon="compass" href="/platform/discover/outrider">
  How Outrider runs the discovery-to-draft loop on a schedule, with the human at the merge
</Card>

***

## The Experiment Lifecycle

<Steps>
  <Step title="Origin">
    An experiment starts from one of several sources:

    * **Research**: A relevant paper, repo, or model surfaced by Remyx or found by the team
    * **Hypothesis**: An engineer has an idea based on domain knowledge
    * **Incident**: A production issue revealed a gap worth investigating
    * **Recommendation**: Remyx detected a pattern and suggested a next step

    For research-sourced experiments, Remyx assembles a **launch context** with resource metadata, Docker environment, target repo structure, and an AI-generated implementation plan grounded in actual file paths.
  </Step>

  <Step title="Implementation">
    The team implements the experiment using the tools they already have:

    * **GitHub**: PRs linked to experiments, status synced via webhooks
    * **Linear / Jira**: Tickets created automatically, bidirectional status sync
    * **Claude Code**: AI agent generates a draft PR via MCP, implementing the technique in the team's codebase
  </Step>

  <Step title="Validation">
    The experiment runs against the target metric. Remyx tracks A/B test configuration, traffic splits, duration, and results. The observed delta and statistical confidence are recorded and tied to the business outcome.
  </Step>

  <Step title="Decision">
    The team logs their interpretation: ship, iterate, or abandon, and why. A good decision captures reasoning, context, and next steps that help the next person (or the same person six months later) understand what happened and what it means.
  </Step>

  <Step title="Patterns">
    As experiments accumulate, Remyx surfaces cross-experiment patterns. Tag-based clustering identifies which directions produce consistent results. Recommendations suggest what to try next, grounded in the team's own data.
  </Step>
</Steps>

***

## ExperimentOps vs. Traditional Experiment Tracking

|                               | MLflow / W\&B                                  | Remyx ExperimentOps                                                 |
| ----------------------------- | ---------------------------------------------- | ------------------------------------------------------------------- |
| **Tracks**                    | Model runs, hyperparameters, artifacts         | Full experiment lifecycle including decisions and business outcomes |
| **Entry point**               | `mlflow.log_param()` in training code          | Hypothesis, research resource, or production incident               |
| **Decision capture**          | Not supported                                  | First-class `log_decision` with reasoning and next steps            |
| **Cross-experiment patterns** | Manual analysis                                | Automatic tag-based clustering with hit rates                       |
| **Resource discovery**        | None                                           | Papers, repos, and models matched to experiment history             |
| **Leadership view**           | Dashboard of model metrics                     | Portfolio of projects with health indicators                        |
| **Integrations**              | Model registry, artifact store                 | GitHub, Linear, Jira, Slack, Claude Code MCP                        |
| **Question answered**         | "What hyperparameters produced the best loss?" | "Which directions are working, and what should we try next?"        |

<Note>
  Remyx operates at a higher level than MLflow or W\&B. Link your MLflow runs as artifacts on a Remyx experiment to connect the training details with the business decision.
</Note>

***

## Who Uses ExperimentOps

<Tabs>
  <Tab title="ML Engineers & Data Scientists" icon="code">
    Track every experiment from hypothesis through results. Link PRs, tickets, and datasets. Log decisions so the next person understands why things were done this way. See which directions are producing results across the team's work.

    Frees up time spent on maintenance triage and context reconstruction so engineers can focus on the experiments that actually improve the product.
  </Tab>

  <Tab title="Team Leads" icon="users">
    See which projects are on track and which need attention. Understand hit rates across experiment directions. Know when a direction has been sufficiently explored or when it's worth doubling down.

    Turns "which direction should we invest in?" from a debate into a data-informed decision.
  </Tab>

  <Tab title="CTOs / VPs" icon="building">
    **Get visibility without being the bottleneck.** The Portfolio view shows every project's experiment velocity, positive hit rate, and metric trends. Know which teams are producing results without reviewing every experiment or scheduling syncs. Let the system be the structured experimentation process so you don't have to be in every room.

    Replaces status meetings with a live portfolio view that shows which projects are producing results and which need attention.
  </Tab>
</Tabs>

***

## Core Principles

**Decisions over metrics.** The most valuable artifact of an experiment is the team's interpretation: "This worked because X, and next we should try Y."

**Patterns over individual results.** A single experiment result is a data point. Ten experiments grouped by direction reveal a strategy.

**Compounding over starting fresh.** Every experiment builds institutional knowledge. That knowledge persists through team changes and informs future recommendations.

**Visibility over status meetings.** Leadership sees which projects are working by looking at a screen, not by scheduling a sync.

***

## Get Started

<CardGroup cols={2}>
  <Card title="Quick Start" icon="rocket" href="/quickstart">
    Create your first experiment in 5 minutes
  </Card>

  <Card title="Outcomes" icon="chart-column" href="/platform/experiments/outcomes">
    Explore the experiment timeline and detail views
  </Card>

  <Card title="Connectors" icon="link" href="/platform/manage/connectors">
    Connect GitHub, Linear, Jira, Slack, and Claude Code
  </Card>

  <Card title="MCP Server" icon="server" href="/resources/mcp-server">
    Use Remyx tools from Claude Code or Slack
  </Card>
</CardGroup>