Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.remyx.ai/llms.txt

Use this file to discover all available pages before exploring further.

Causal Intelligence

ExperimentOps captures decisions and surfaces patterns across the experiments your team has already run. Causal intelligence is what those captured experiments build toward: a maintained model of how your system actually works, updated as new evidence arrives from production logs, commit history, A/B tests, and instrumented decision points. The capabilities described here ship in stages. See Maturity Progression for how they map to the customer journey, and the Roadmap for current shipping status.
Most of what follows is forward-looking. We publish the architecture in full so customers, advisors, and researchers can engage with the direction, push back where they disagree, and see the shape of the system well before each piece ships.

The case for broader evidence

When an AI team ships a change and resolution rate goes up, the question that matters is what caused the lift. Was it the prompt change, the retrieval upgrade, both, neither? Without a causal answer, the team’s next decision is guesswork. A/B testing answers this question well for the changes that warrant it. The limit is throughput. Each test carries fixed operational cost (instrumentation, traffic allocation, time to significance) that doesn’t scale with the rate at which a modern AI team ships candidate changes. Remyx’s approach is to broaden the evidence base. Use A/B testing where it earns its operational cost. Get causal estimates from cheaper evidence everywhere else, and use perturbation-based evidence to answer the questions A/B testing structurally cannot.

Three evidence sources

Remyx integrates three evidence sources into one causal model. Each has different strengths and different operational costs, and the system uses all three together.
Production logs windowed against commit boundaries. Each meaningful commit (a model swap, prompt change, retrieval update, routing rule) is a regime change in the data-generating process. The windows before and after a commit form a natural before-after comparison. Under the assumption that nothing else material changed in the window, this supports causal effect estimates from data your team is already collecting.
When your team runs randomized controlled trials through a platform like Statsig, Eppo, LaunchDarkly, or in-house, Remyx integrates with the platform and incorporates this evidence into the model. Randomization is the gold standard for causal identification, so these results carry the most weight.
Remyx’s lightweight client SDK helps instrument your system’s decision points, applying counterfactual perturbations that generate the evidence the causal inference engine needs to identify effects it otherwise cannot — which part of the pipeline is doing the work, the effect of a treatment specifically on the population that received it, effect estimates at low-traffic decision points. The SDK runs in shadow mode first to audit safety before any perturbation is applied to live traffic.
The causal model incorporates each new piece of evidence into its posterior, rather than treating sources separately or letting later evidence overwrite earlier results.

How teams use the model

The causal model sits underneath the product. Your team interacts with three workflows on top of it.

Answer questions in natural language

“Did the prompt change last Tuesday cause the latency regression?”
The system parses the question, identifies what kind of evidence could answer it, routes to the relevant sources, and returns an estimate with a confidence interval and a clear note on whether the available data is strong enough to support a causal conclusion or only a correlational one.

Triage hypotheses prospectively

When your team proposes ten things to try, the system ranks them by how much each would actually teach you given what you already know, and identifies the cheapest evidence path for each. Each hypothesis gets one of these paths.
  • Already answerable from existing data.
  • Answerable with quasi-experimental analysis.
  • Requires an A/B test.
  • Requires CTF-RAND at decision point X for two weeks.
  • Unidentifiable even with full instrumentation.

Surface what’s not yet known

When you ask a question the current evidence cannot answer, the system tells you what data would answer it. That turns “we don’t know” from a dead end into a planning surface.

How this fits with A/B testing

Causal intelligence works alongside A/B testing. The three evidence layers each handle a different operational price point and answer a different class of question.
A/B testingCausal data fusion (quasi-experimental)Counterfactual randomization
Evidence qualityRandomizedQuasi-experimental, with RCT evidence fused in when availableCounterfactual, from instrumented perturbations
Operational costHigh per changeLow marginal cost (reuses existing data)Moderate (SDK instrumentation, shadow-mode audit before live)
CoverageChanges worth gating with a traffic splitEvery commit boundary in the log historyDecision points where the SDK is instrumented
Question typesAverage effect of a randomized treatmentAverage effects across changes the team already shippedAttribution, mediation, effect on the treated, low-traffic effects
When to useDecisions worth the cost of randomizationContinuous learning from the work the team is already doingQuestions A/B testing structurally cannot answer
A/B tests sharpen the model where they earn their operational cost. Quasi-experimental evidence is the entry point for teams not yet platformed for A/B testing, producing causal estimates from observational data they already collect. Counterfactual randomization extends the model to questions that no amount of A/B testing can resolve.

Theoretical foundation

The system makes only those causal claims that can be backed by evidence you could in principle collect. It avoids the layers of counterfactual reasoning that depend on assumptions which cannot be checked against data. For readers familiar with Pearl’s Causal Hierarchy (PCH), the evidence sources map to layers as follows. PCH organizes causal questions into three rungs of increasing strength: L1 (associational, “what is”), L2 (interventional, “what happens if I do”), and L3 (counterfactual, “what would have happened if I had done”). Yang & Bareinboim (2025) introduce intermediate rungs L2.25 and L2.5 for the counterfactual evidence that CTF-RAND collects.
SourceLayerNotes
Observational logsL1Supports correlational queries and structural discovery
Quasi-experiments (logs + commits)L2Local identification under “nothing else changed in the window”
A/B testsL2Strong identification via randomization
CTF-RAND, defaultL2.25Same overridden value applies across the trajectory
CTF-RAND, mediation opt-inL2.5Per-child value assignment for mediation analysis
Readers who want the formal account should start with the further reading below.

Shipping status

For current status of each capability, see the Roadmap. For how these capabilities map to the customer journey, see Maturity Progression.

Further reading

Yang, K., & Bareinboim, E. (2025). A Hierarchy of Graphical Models for Counterfactual Inferences. Background on graphical models used in causal and counterfactual inference, including the L2.25 and L2.5 layers referenced above. Pearl, J., & Mackenzie, D. (2018). The Book of Why. Accessible introduction to the Causal Hierarchy and the difference between association, intervention, and counterfactual reasoning.