Skip to main content
ExperimentOps is your system of record — what your team tried, why, and what you decided. Continuous Experimentation puts that record to work: it runs the discovery-to-decision loop on a cadence, so keeping up with the field becomes a background process instead of a second job — the way CI/CD turned “we deploy when someone remembers to” into “every change runs the pipeline.” The bottleneck it removes is the legwork before a decision. New techniques ship across arXiv, Hugging Face, and GitHub every week; ideas are plentiful, and the cost is finding the few that fit your codebase and turning them into something you can actually evaluate. The loop does that work — surface, validate, draft — and leaves you the judgment, so effort flows to the most fruitful directions.
What “continuous” means, precisely. The loop runs on a schedule and does the watching, reading, and first-draft work automatically. It does not make the call — every cycle surfaces a reviewable artifact (a ranked recommendation, a draft PR, a scored variant) and a person decides. The CI/CD analogy, taken honestly: the pipeline runs on every change, but a human still approves the deploy.

The loop

A single experiment in Remyx already has a lifecycle: it comes from somewhere, gets implemented, gets evaluated, and ends in a decision. Continuous Experimentation is that lifecycle run as a standing loop, where the output of each turn feeds the input of the next.

Discover

Your Feed surfaces new papers, repos, and models ranked against your team’s actual shipping history. This runs daily without you asking.

Draft

Remyx’s automated discovery-PR agent (Outrider), a scheduled GitHub Action, picks the candidate most implementable against your codebase and opens a draft PR wiring it into a real call site, or a discussion Issue when a clean integration isn’t possible. This is the step that used to require a human to sit down and start reading.

Evaluate

The variant is scored against the eval template and decision policy your team committed to ahead of time, so the bar is fixed before the result is known.

Decide

A person reviews the evidence and logs the call — ship, iterate, or abandon — and why. This is the human gate, and it stays human.

Compound

The decision becomes part of the record. The next discovery pass is ranked against a history that now includes this experiment. The loop tightens.
The first two steps are where Continuous Experimentation does new work: it closes the gap between a relevant paper exists and there’s a reviewable proposal in front of me. Steps three through five are the ExperimentOps loop you already run — now fed automatically.

CI/CD for AI experimentation

The analogy is exact in the places that matter and worth stating plainly where it isn’t.
Software CI/CDRemyx Continuous Experimentation
A commit triggers the pipelineA schedule (or a context change) triggers a discovery + draft run
The pipeline builds and tests automaticallyOutrider selects, drafts, validates, and self-reviews automatically
A failing build never reaches reviewA recommendation that can’t be cleanly integrated becomes a discussion Issue with the attempted diff attached
The team sets the gates (required checks, coverage) ahead of timeThe team sets the eval template and decision policy ahead of time
A human approves the deployA human reviews and merges the PR, and logs the decision
Where the analogy breaks — and where overclaiming would erode trust — is the last row. CI/CD ends in an automated deploy because the correctness bar is a passing test suite. AI experimentation ends in a judgment about whether a change actually moved a business outcome, which is not something to hand to a cron job. Remyx automates the toil up to that judgment and stops there.
This is also why the Maturity Progression stages stay read-only and passive through Stage 3. Continuous Experimentation reads your repo, ranks against your history, and proposes changes you approve. It does not touch production behavior. The first capability that does (Stage 4 counterfactual perturbations) ships only behind shadow-mode audits.

How recommendations get sharper

A loop is only worth running continuously if each turn is better than the last. Two mechanisms make Remyx’s recommendations improve as your history grows, rather than re-surfacing the same generic results:
The structured experiments Remyx extracts from your merge log feed the ranker as context, so a candidate aligned with the direction you’ve actually shipped ranks above a merely topical one. This shifts the top results meaningfully versus ranking from your interest description alone, and the reasoning cites specific past work instead of shallow keyword overlap.
Remyx fits a per-team preference model over your past experiments — learning from the order and lineage in which you shipped things — and scores new candidates with it as a tiebreaker behind relevance. It populates lazily and becomes meaningful past a few dozen experiments. It sharpens ranking only; it is deliberately not wired to auto-generate or auto-select experiments, which would put it in the decision seat the human holds.
Both converge on the same ExperimentHistory whether you reached it through a Project or a repo-driven Research Interest, so the loop sees one coherent picture of your work. Beyond history, a Deep Research brief feeds the ranker as a forward-looking axis: it captures where your team intends to go next, so candidates aligned with that direction rank up even before you’ve shipped against it. If a recommendation or draft is wrong, the cost is a PR you close — nothing reaches your default branch or your users without a person putting it there.

Get started

Automated discovery PRs

Set up the scheduled draft-PR loop on a repo

Feed

Create the Research Interest that drives recommendations

ExperimentOps

The system of record this loop runs on top of

Maturity Progression

Why the loop stays passive and read-only