Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{
"label": "Running well powered experiments",
"position": 10,
"collapsible": true,
"collapsed": false,
"link": {
"type": "doc",
"id": "guides/advanced-experimentation/running-well-powered-experiments/index"
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
---
sidebar_position: 5
---

# Ambition, disagreement, and tactical vs. strategic measurement

When traffic is scarce, the limiting factor is often **not** whether a 2% lift is detectable—it is whether the idea is worth shipping at all.

## Test bolder ideas

Marginal tweaks rarely justify weeks of engineering and coordination when each variant only receives a trickle of users. **Ambitious, controversial** ideas can fail more often, but when they move, they move enough to measure—and they clarify strategic disagreements.

Experimentation is a safety net: use it to try concepts that would not pass a consensus deck. Expect that **many** ideas will not win on the first iteration; treat that as a reason to **explore widely**, not to stop testing.

## Learn from non-users and unhappy users

Some of the best hypotheses come from people who **considered** your product and walked away, or from churned and dissatisfied users. Ask structured questions about blockers and anxieties.

Do not treat every feature request literally—the classic “faster horses” warning still applies. A useful framing (from performance feedback) is that **the proposed fix may be wrong, but the note is right**: something is off, and it is your job to invent the right solution. Catalog complaints, then design **your own** responses; internally, surface **where teams disagree** and prioritize experiments that resolve those disagreements.

## Tactical A/B tests vs. strategic holdouts

Not every decision should be settled with a standard A/B on a distant revenue metric.

**Tactical** questions suit randomized comparisons on **immediate** outcomes: Did the new search UI reduce redundant queries? Did the form capture valid emails more reliably?

**Strategic** questions—whether sustained investment in an area (search, onboarding, a new surface) is worth the opportunity cost—often need **longer horizons**, **holdouts**, or separate program-level measurement so you separate “did we ship a good iteration?” from “should we fund this bet for the next year?”

### Example: search

- **Tactical**: A/B changes that improve query reformulation, latency, or click position—metrics close to the product surface.
- **Strategic**: Whether the organization should fund a larger search initiative; that may be better served by a **holdout** or long-running program metric than by a single short A/B.

### Example: email capture

- **Tactical**: Experiments on the “Send me more” flow—copy, validation, UX—to improve capture quality.
- **Tactical follow-on**: What you do with those emails (reactivation journeys) and how that moves downstream conversions.
- **Strategic**: Whether relying on that intermediate step matches the company’s long-term relationship to customers—often a slower, broader evaluation than one funnel test.

Use [proximal metrics and entry points](/guides/advanced-experimentation/running-well-powered-experiments/design-targeting-dilution-proximal-metrics/) so tactical tests stay measurable; use [runtime and baselines](/guides/advanced-experimentation/running-well-powered-experiments/runtime-seasonality-sample-size/) when the strategic question truly needs time.
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
---
sidebar_position: 4
---

# Rich covariates for CUPED++

Standard CUPED that only uses **pre-experiment activity** on the outcome metric helps less for **brand-new users**, because there is little or no pre-period behavior to condition on. Eppo’s CUPED++ uses a **regression** on pre-period metrics available in the experiment **and** on properties attached to assignments. Passing informative fields through assignment logs can materially shrink confidence intervals.

You often have more information than you think:

### Acquisition and intent

- **Origination**: channel, campaign, UTM parameters, app install source. These encode what was promised to the user and which media they trust.
- **First-session context**: landing page, referral type, or signup flow variant.

### Geography and environment

- **GeoIP** gives at least country; finer location can support features such as urban density, commute patterns, or proximity to relevant physical infrastructure (for example education or retail anchors), when those map to your product.

### Weather and season

Weather can look quaint, but it shifts time indoors, commute friction, and mood. Encoding **rain, cold snaps, or seasonal buckets** can improve predictions for engagement-heavy products when those patterns line up with usage.

### Time structure

- Time of day, day of week, proximity to **payday**, school holidays, major retail events. Many products see spikes on “fresh start” dates or long weekends. Give each feature **a small number of meaningful levels** so the model can use them without overfitting.

### Demographics and firmographics

When available and appropriate for your use case and policies, demographic or account-level descriptors can explain baseline differences—used responsibly and in line with privacy commitments.

### What to expect in the product

Eppo’s analysis UI summarizes **lift and uncertainty**, not the full regression table. For stakeholders who want to go deeper into **which factors** drove adjustment, teams sometimes export aggregates or replicate a slice of the logic in a notebook; that workflow is outside the default UI path.

If you need a convincing narrative before launch, a **historical rehearsal** (a fake or retrospective split on pre-period data) can illustrate how much variance reduction you get from a given covariate set—paired with the intuition that a **large CUPED adjustment** usually means strong pre-existing differences between arms, not a frivolous correction.
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
sidebar_position: 3
---

# Targeting, dilution, and proximal metrics

Power is not only a property of your formulas—it is a property of **who** enters the analysis and **what** you measure.

## Very small audiences

If you have **fewer than roughly a hundred** subjects per variant, treat the experiment as partly qualitative: talk to users, review support tickets, and run interviews. The numbers may still rule out huge effects, but you will not resolve fine-grained product questions by statistics alone.

## Prefer metrics close to the change

Pick outcomes that the intervention plausibly moves **directly**: engagement with the new UI, task completion, satisfaction on the flow you changed. Downstream metrics (long-term retention, company-wide revenue) are legitimate as guardrails or long-horizon follow-ups, but they dilute signal when the sample is small and the change is narrow.

Where it helps, **binarize or bucket** continuous outcomes so they carry more information at your scale—for example, the share of users who needed **more than fifteen minutes** to complete a task can be more stable and interpretable than the average completion time alone.

## Targeting and segments

Define **indicative categories** (cohorts, intents, or lifecycle stages) so you can focus analysis on the population the change actually affects. A broad average can hide a clear win in the subgroup you care about—or hide harm outside it.

## Remove dilution with entry points

When assignment happens before exposure, including users who never see the treatment adds noise. Use [entry points](/guides/advanced-experimentation/entry_points) (qualifying events) so the experiment analyzes people who were actually eligible for the experience under test.

## A higher-conversion proxy metric

When a strategic outcome is rare, decompose the decision into **steps** with higher base rates: e.g. measuring whether users successfully used an improved search experience (queries reformulated, clicks on results) before you insist on revenue impact. Pair that with the [tactical vs. strategic](/guides/advanced-experimentation/running-well-powered-experiments/ambition-tactical-strategic-measurement/) framing so tactical A/B tests do not silently substitute for a strategic bet.
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_position: 10
sidebar_position: 1
---

# Running well powered experiments with smaller sample size
Expand Down Expand Up @@ -71,6 +71,10 @@ This example focuses on a continuous variable (revenue), but the same concept ap

If we do not have any pre-period data to leverage, for example because we run an experiment on new users, assignment properties can still help you lower the variance in your experiment. Any property added to your assignment definition will also get included in the regression. For instance, if you add the source a user came from, their region, and their device, Eppo's CUPED++ model will use those values to reduce variance. Note that this can help speed up both new user experiments and existing user experiments.

For intuition when presenting CUPED to stakeholders: a **large adjustment** after CUPED is usually not arbitrary—it reflects that control and treatment were far apart on predictable dimensions before the test, and the model is doing the work of separating signal from that imbalance. When audiences are very small (on the order of dozens of subjects per variant), supplement numbers with qualitative conversations.

For a deeper discussion of **which covariates to pass** (assignment properties, geography, time, and other features) and how they help new-user experiments in particular, see [Rich covariates for CUPED++](/guides/advanced-experimentation/running-well-powered-experiments/covariates-for-cuped-plus-plus/).

## Choosing a statistical paradigm

This leaves one final lever: the statistical methodology we choose to use to analyze the results. In general, [there are no miracles](https://www.geteppo.com/blog/comparing-frequentist-vs-bayesian-approaches) here but certainly the choice will affect results.
Expand All @@ -81,6 +85,15 @@ However, particularly the t-test is susceptible to peeking. If this is a problem

We want to stay away from the fully sequential paradigm when we struggle to find enough power in the first place. We cannot afford the cost in width of the confidence intervals for the added flexibility. Furthermore, it is unlikely we would be able to stop the experiment early anyway.

## Beyond variance reduction and statistical methodology

Lowering metric variance (including via winsorization and CUPED++) and picking an analysis paradigm are powerful levers, but they are not the only ones. When power is constrained, also consider:

- **[Runtime, seasonality, and sample size](/guides/advanced-experimentation/running-well-powered-experiments/runtime-seasonality-sample-size/)** — Running longer is often the best response when the observed gap between variants is large relative to noise; longer runs also build baselines you can compare to holidays, campaigns, and external shocks.
- **[Targeting, dilution, and proximal metrics](/guides/advanced-experimentation/running-well-powered-experiments/design-targeting-dilution-proximal-metrics/)** — Who you measure, how you [filter to true exposure](/guides/advanced-experimentation/entry_points), and whether outcomes are close to the change.
- **[Rich covariates for CUPED++](/guides/advanced-experimentation/running-well-powered-experiments/covariates-for-cuped-plus-plus/)** — Passing informative assignment and context features into the model beyond pre-period activity alone.
- **[Ambition, disagreement, and tactical vs. strategic measurement](/guides/advanced-experimentation/running-well-powered-experiments/ambition-tactical-strategic-measurement/)** — Whether the idea is large enough to justify scarce traffic, how to learn from users who did not convert, and when A/B tests should be paired with holdouts or decomposed decisions.

## Conclusion

In certain situations, we really need to make the most out of a limited sample size. In this case, remember that it is all about optimizing the signal-to-noise ratio. First and foremost, we should make sure we choose our metrics carefully. With winsorization, CUPED++, and a choice of statistical methodology, Eppo helps you make the most out of our data.
In certain situations, we really need to make the most out of a limited sample size. In this case, remember that it is all about optimizing the signal-to-noise ratio. First and foremost, we should make sure we choose our metrics carefully. With winsorization, CUPED++, and a choice of statistical methodology, Eppo helps you make the most out of our data—and [design choices outside pure variance reduction](/guides/advanced-experimentation/running-well-powered-experiments/runtime-seasonality-sample-size/) matter just as much when samples are small.
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
---
sidebar_position: 2
---

# Runtime, seasonality, and sample size

When treatment and control look very different compared to what you would expect from random noise alone, the first question to ask is whether you can **collect more data**—not only because larger $N$ tightens intervals, but because a large imbalance can mean the variants were not comparable at the start, and time can help average over that misspecification when paired with good covariate adjustment.

## Run longer when you can

Doubling precision in the standard error roughly requires **four times** the sample size when you are trying to detect a proportionally smaller effect. There is no substitute for duration or traffic when the effect you care about is subtle.

Longer runs have benefits beyond raw power:

- You accumulate a **baseline** of behavior that you can compare to holidays, major marketing pushes, weather, or product incidents—so you are less likely to mistake a one-off context for a lasting lift.
- They **normalize** the idea that some questions need weeks or months, which in turn makes it acceptable to ship **more impactful** changes that would never clear a “result this week” bar.
- They allow **compounding** and habit formation to show up when the hypothesis is about sustained behavior change rather than a one-session tweak.

In highly nuanced cases—for example a subscription or premium tier where the value proposition plays out over seasons and repeated use—it can take a very long horizon to measure the trade-offs fairly. In those situations, the constraint is often not the statistical toolkit but the **business patience** to align the metric window with the actual decision.

## When runtime is not enough

If you cannot extend the test, return to the levers in the main guide: [variance reduction](/guides/advanced-experimentation/running-well-powered-experiments/#reducing-variance), [CUPED++ with rich covariates](/guides/advanced-experimentation/running-well-powered-experiments/covariates-for-cuped-plus-plus/), [targeting and entry points](/guides/advanced-experimentation/running-well-powered-experiments/design-targeting-dilution-proximal-metrics/), and [experiment ambition](/guides/advanced-experimentation/running-well-powered-experiments/ambition-tactical-strategic-measurement/). Runtime is the best fix when imbalance is the core issue; the other options help when time is capped.