Confidence interval in A/B testing: The underestimated variable for valid results

Steffen Schulz

Published on July 14, 2025

An A/B test läuft gut. Variante B scheint besser, statistisch sogar signifikant. Entscheidung getroffen, Test beendet.

But shortly afterwards, the result is reversed. The improvement fizzles out. What went wrong?

Many rely on the p-value. This only shows whether a difference is probable. Not how certain the result really is.

Confidence intervals help with this. They show how stable your test result is and how much uncertainty it contains.

Without this understanding, you will make decisions that will cost you money later on.

Confidence interval: what it really is and why you need it

A confidence interval tells you how precisely your conversion rate is estimated.

Example: You tested 2,000 users, 82 of whom converted. This results in a conversion rate of 4.1 %.

A statistical tool calculates a confidence interval of [3.3 % - 5.0 %], with a confidence level of 95 %.

This means that if you run the same test 100 times with new users, the real result would be within this interval in about 95 of these runs.

What is the confidence level?

The confidence level indicates how certain you can be that the interval contains the true value.
In practice, 95 % is almost always used. A good compromise between safety and efficiency.
The higher the level, the wider the interval, but also the more cautious your assessment.

Why this is important

A single percentage seems precise, but is only an estimate
Only the interval shows how reliable this estimate is
The smaller the sample, the greater the fluctuation
The higher the confidence level, the more conservative the valuation

Wie Konfidenzintervalle A/B Tests absichern

Imagine you are testing two variants of a landing page:

Variant A: 4.1 % Conversion
Variant B: 4.9 % Conversion

Without further information, B looks like the clear winner. But only a look at the confidence intervals shows whether you can rely on it:

A: [3.6 % - 4.6 %]

B: [4.3 % - 5.5 %]

The intervals do not overlap. This is a strong signal: the improvement is probably real.

Another scenario:

A: [3.6 % - 4.6 %]

B: [4.0 % - 5.3 %]

Now there is an overlap. This means that the two variants could actually perform equally well. The measured difference may have arisen by chance. A decision on this basis would be risky.

Rule of thumb:

No overlap → Decision possible
Overlap → result uncertain, extend test or set to more data basis

What this brings you

You can recognize whether a difference is statistically verified or only appears to exist
You not only make decisions faster, but also with higher quality
You reduce the risk of investing resources in a supposedly better variant

The underestimated risk zones: Confidence level, 1st and 2nd type errors

Ein A/B Test zeigt 95 % Konfidenzniveau. Klingt zuverlässig, aber was bedeutet das genau?

It means that if you carry out the same test a hundred times with other visitors, the real result will lie within the calculated confidence interval in around 95 cases. In five cases, however, it will not. These five percent correspond to the probability of error that you factor in with each test. This is the so-called error of the 1st kind.

Error 1. type: You think a random result is real

One example:

Variant A: 4.1 % conversion (820 conversions with 20,000 visitors)
Variant B: 4.6 % conversion (920 conversions with 20,000 visitors)
p-value: 0.045
Confidence intervals:
A: [3.8 % - 4,4 %]
B: [4,3 % - 4.9 %]

That looks convincing. B seems better, the intervals hardly overlap. Nevertheless, the result may have arisen by chance. In this case, the decision would be wrong, even though the test was formally correct.

Why? The two confidence intervals are close to each other. Variant A ends at 4.4 percent, variant B begins at 4.3 percent. This minimal gap may have arisen by chance. In reality, both variants could perform equally well. The method recognizes "significance", but not the uncertainty behind the result. This is precisely the first type of error: you believe that one variant is better, although the effect is not reliable.

Error 2. type: You overlook an actually better variant

Another scenario:

Variant A: 4.1 percent (123 conversions with 3,000 visitors)
Variant B: 4.8 percent (144 conversions with 3,000 visitors)
p-value: 0.12
Confidence intervals:
A: [3.4 % - 4.9 %]
B: [4.0 % - 5.7 %]

The values for variant B are better, but the confidence intervals overlap significantly. The upper limit of A is 4.9 percent, the lower limit of B is 4.0 percent. This means that the difference is not clear enough.

Why is this a 2nd type of error?

Weil der Effekt zwar real existiert, aber statistisch nicht nachweisbar ist. Zumindest nicht mit dieser Datenmenge. Die Teststärke reicht nicht aus, um den Unterschied sichtbar zu machen. Du verwirfst Variante B, obwohl sie tatsächlich besser ist. Der Fehler beim A/B Test liegt nicht in der Interpretation, sondern in der unzureichenden Datenbasis.

In such cases, only one thing helps: Extend the test duration, collect more data or make your decision based on additional criteria. These can be, for example, effect size, business impact or previous experience. If you make a blanket conclusion of "not significant", you often miss out on real opportunities.

How to plan test run time and sample size with confidence intervals

Many A/B tests are terminated too early. A value reaches the significance threshold, the conversion rates look good - and the experiment is terminated. But without looking at the confidence interval, it is unclear how stable the result really is. If you decide too quickly, you risk biased statements and incorrectly prioritized measures.

What influences the width of the confidence interval?

A confidence interval becomes narrower the more data you collect.
Three factors are decisive:

Sample size: More users lead to less statistical noise
Stability of conversion rates: Large fluctuations increase the interval
Confidence level: Higher level means wider interval

The smaller the difference you want to measure, the more visitors you need to obtain a reliable result.

Example: How the expected difference influences your planning

You expect an improvement of around 1.5 percentage points.
How large does your sample have to be for each variant?

At 4.0 % vs. 5.5 %: approx. 3,500 visitors per variant
At 4.0 % vs. 4.5 %: approx. 19,000 visitors per variant

Conclusion: Small effects require large amounts of data. If you underestimate this, you will get confidence intervals that overlap considerably and results that you cannot rely on.

Recommendation for practice

Plane Tests immer rückwärts: Lege fest, welchen Effekt du mindestens nachweisen willst, und berechne daraus die nötige Stichprobengröße. Nutze dafür einen Significance calculator. Starte nicht blind, sondern mit einem klaren Zielbereich für Dauer, Datenmenge und Konfidenzniveau.

A/B tests without well-founded size planning only generate statistical noise in case of doubt.

Practical pitfalls: The most common errors in thinking about confidence intervals

Even though confidence intervals have long been known to many, they are often misunderstood or misapplied in practice. Particularly in everyday testing, typical errors of reasoning occur that massively impair the validity of a test.

Misconception 1: Confusing confidence interval with certainty

An interval of [4.3 % - 5.1 %] at 95 % confidence level does not mean that the true value is certainly within this range. It means that in 5 out of 100 cases the interval may be off, even if you have calculated everything correctly.

Misconception 2: Abort test as soon as significance is reached

Many tools show a "significant" result at an early stage. If you then stop, you risk a 1st type error. A short test with a small sample is susceptible to outliers. Without stable confidence intervals, every decision is premature.

Misconception 3: Comparing confidence intervals like fixed values

A difference of 0.4 percentage points can be statistically relevant or irrelevant - depending on the interval width. If you only look at mean values and ignore the ranges, you are making decisions based on apparent precision.

Misconception 4: Statistically significant = practically relevant

An effect can be significant, but meaningless in terms of content. Example: 0.2 percentage points difference with a large sample. Statistically stable, but hardly relevant in operational terms. Confidence intervals help to evaluate effect and size together.

Misconception 5: Comparing several variants without customization

In multivariate tests, the risk of errors of the first kind increases with each additional variant. Anyone who compares confidence intervals without correction unconsciously increases the probability of error. This requires proper statistical adjustment (e.g. Bonferroni correction or controlled experiment design).

Conclusion & recommendations for practice: How to use statistics for better tests

Confidence intervals are not additional knowledge for statistics nerds. They are a key tool for anyone who wants to reliably evaluate A/B tests and make well-founded decisions.

Those who ignore them are flying blind. Those who use them correctly not only recognize whether a result is safe, but also how safe it is and how big the effect could really be.

Three key learnings

1. a single percentage value is not sufficient
Without a confidence interval, there is no framework for correctly classifying results.

2. significance alone is not enough
Statistically conspicuous does not equal operationally relevant. The width of the interval makes the difference.

3. test quality depends on the preparation
If you don't do any size planning, you can't make any reliable statements even with clean statistics.

Three recommendations for practice

1. consciously check confidence intervals
In each test report, pay attention to how close the intervals are and whether they overlap.

2. carry out size planning before starting the test
Use a calculator to determine sample size and duration based on your expectations.

3. do not accept tool results without checking them
Ask yourself what exactly your tool shows you and how the calculation is made.

Those who understand confidence intervals test with foresight and make decisions that work.

Confidence interval in A/B testing: The underestimated variable for valid results

Table of contents