Causal Intelligence

ExperimentOps captures decisions and surfaces patterns across the experiments your team has already run. Causal intelligence is what those captured experiments build toward: a maintained model of how your system actually works, updated as new evidence arrives from production logs, commit history, A/B tests, and instrumented decision points. The capabilities described here ship in stages. See Maturity Progression for how they map to the customer journey, and the Roadmap for current shipping status.

Most of what follows is forward-looking. We publish the architecture in full so customers, advisors, and researchers can engage with the direction, push back where they disagree, and see the shape of the system well before each piece ships.

The case for broader evidence

When an AI team ships a change and resolution rate goes up, the question that matters is what caused the lift. Was it the prompt change, the retrieval upgrade, both, neither? Without a causal answer, the team’s next decision is guesswork. A/B testing answers this question well for the changes that warrant it. The limit is throughput. Each test carries fixed operational cost (instrumentation, traffic allocation, time to significance) that doesn’t scale with the rate at which a modern AI team ships candidate changes. Remyx’s approach is to broaden the evidence base. Use A/B testing where it earns its operational cost. Get causal estimates from cheaper evidence everywhere else, and use perturbation-based evidence to answer the questions A/B testing structurally cannot.

Three evidence sources

Remyx integrates three evidence sources into one causal model. Each has different strengths and different operational costs, and the system uses all three together.

1. Commit-correlated logs (quasi-experiments)

Production logs windowed against commit boundaries. Each meaningful commit (a model swap, prompt change, retrieval update, routing rule) is a regime change in the data-generating process. The windows before and after a commit form a natural before-after comparison. Under the assumption that nothing else material changed in the window, this supports causal effect estimates from data your team is already collecting.

2. A/B tests

When your team runs randomized controlled trials through a platform like Statsig, Eppo, LaunchDarkly, or in-house, Remyx integrates with the platform and incorporates this evidence into the model. Randomization is the gold standard for causal identification, so these results carry the most weight.

3. Counterfactual randomization (CTF-RAND)

Remyx’s lightweight client SDK helps instrument your system’s decision points, applying counterfactual perturbations that generate the evidence the causal inference engine needs to identify effects it otherwise cannot — which part of the pipeline is doing the work, the effect of a treatment specifically on the population that received it, effect estimates at low-traffic decision points. The SDK runs in shadow mode first to audit safety before any perturbation is applied to live traffic.

The causal model incorporates each new piece of evidence into its posterior, rather than treating sources separately or letting later evidence overwrite earlier results.

How teams use the model

The causal model sits underneath the product. Your team interacts with three workflows on top of it.

Answer questions in natural language

“Did the prompt change last Tuesday cause the latency regression?”

The system parses the question, identifies what kind of evidence could answer it, routes to the relevant sources, and returns an estimate with a confidence interval and a clear note on whether the available data is strong enough to support a causal conclusion or only a correlational one.

Triage hypotheses prospectively

When your team proposes ten things to try, the system ranks them by how much each would actually teach you given what you already know, and identifies the cheapest evidence path for each. Each hypothesis gets one of these paths.

Already answerable from existing data.
Answerable with quasi-experimental analysis.
Requires an A/B test.
Requires CTF-RAND at decision point X for two weeks.
Unidentifiable even with full instrumentation.

Surface what’s not yet known

When you ask a question the current evidence cannot answer, the system tells you what data would answer it. That turns “we don’t know” from a dead end into a planning surface.

How this fits with A/B testing

Causal intelligence works alongside A/B testing. The three evidence layers each handle a different operational price point and answer a different class of question.

	A/B testing	Causal data fusion (quasi-experimental)	Counterfactual randomization
Evidence quality	Randomized	Quasi-experimental, with RCT evidence fused in when available	Counterfactual, from instrumented perturbations
Operational cost	High per change	Low marginal cost (reuses existing data)	Moderate (SDK instrumentation, shadow-mode audit before live)
Coverage	Changes worth gating with a traffic split	Every commit boundary in the log history	Decision points where the SDK is instrumented
Question types	Average effect of a randomized treatment	Average effects across changes the team already shipped	Attribution, mediation, effect on the treated, low-traffic effects
When to use	Decisions worth the cost of randomization	Continuous learning from the work the team is already doing	Questions A/B testing structurally cannot answer

A/B tests sharpen the model where they earn their operational cost. Quasi-experimental evidence is the entry point for teams not yet platformed for A/B testing, producing causal estimates from observational data they already collect. Counterfactual randomization extends the model to questions that no amount of A/B testing can resolve.

Theoretical foundation

The system makes only those causal claims that can be backed by evidence you could in principle collect. It avoids the layers of counterfactual reasoning that depend on assumptions which cannot be checked against data. For readers familiar with Pearl’s Causal Hierarchy (PCH), the evidence sources map to layers as follows. PCH organizes causal questions into three rungs of increasing strength: L1 (associational, “what is”), L2 (interventional, “what happens if I do”), and L3 (counterfactual, “what would have happened if I had done”). Yang & Bareinboim (2025) introduce intermediate rungs L2.25 and L2.5 for the counterfactual evidence that CTF-RAND collects.

Source	Layer	Notes
Observational logs	L1	Supports correlational queries and structural discovery
Quasi-experiments (logs + commits)	L2	Local identification under “nothing else changed in the window”
A/B tests	L2	Strong identification via randomization
CTF-RAND, default	L2.25	Same overridden value applies across the trajectory
CTF-RAND, mediation opt-in	L2.5	Per-child value assignment for mediation analysis

Readers who want the formal account should start with the further reading below.

Shipping status

For current status of each capability, see the Roadmap. For how these capabilities map to the customer journey, see Maturity Progression.

Documentation Index

​Causal Intelligence

​The case for broader evidence

​Three evidence sources

​How teams use the model

​Answer questions in natural language

​Triage hypotheses prospectively

​Surface what’s not yet known

​How this fits with A/B testing

​Theoretical foundation

​Shipping status

​Further reading