An A/B test works well. Variant B seems better, even statistically significant. Decision made, test completed.
But shortly afterwards, the result is reversed. The improvement fizzles out. What went wrong?
Many rely on the p-value. This only shows whether a difference is probable. Not how certain the result really is.
Confidence intervals help with this. They show how stable your test result is and how much uncertainty it contains.
Without this understanding, you will make decisions that will cost you money later on.

Table of contents
Confidence interval: what it really is and why you need it
A confidence interval tells you how precisely your conversion rate is estimated.
Example: You tested 2,000 users, 82 of whom converted. This results in a conversion rate of 4.1 %.
A statistical tool calculates a confidence interval of [3.3 % - 5.0 %], with a confidence level of 95 %.
This means that if you run the same test 100 times with new users, the real result would be within this interval in about 95 of these runs.
What is the confidence level?
The confidence level indicates how certain you can be that the interval contains the true value.
In practice, 95 % is almost always used. A good compromise between safety and efficiency.
The higher the level, the wider the interval, but also the more cautious your assessment.
Why this is important
- A single percentage seems precise, but is only an estimate
- Only the interval shows how reliable this estimate is
- The smaller the sample, the greater the fluctuation
- The higher the confidence level, the more conservative the valuation
How confidence intervals validate A/B tests
Imagine you are testing two variants of a landing page:
- Variant A: 4.1 % Conversion
- Variant B: 4.9 % Conversion
ย
Without further information, B looks like the clear winner. But only a look at the confidence intervals shows whether you can rely on it:
A: [3.6 % - 4.6 %]
B: [4.3 % - 5.5 %]
The intervals do not overlap. This is a strong signal: the improvement is probably real.
Another scenario:
A: [3.6 % - 4.6 %]
B: [4.0 % - 5.3 %]
Now there is an overlap. This means that the two variants could actually perform equally well. The measured difference may have arisen by chance. A decision on this basis would be risky.
Rule of thumb:
- No overlap โ Decision possible
- Overlap โ result uncertain, extend test or set to more data basis
What this brings you
- You can recognize whether a difference is statistically verified or only appears to exist
- You not only make decisions faster, but also with higher quality
- You reduce the risk of investing resources in a supposedly better variant
The underestimated risk zones: Confidence level, 1st and 2nd type errors
An A/B test shows 95 % confidence level. Sounds reliable, but what exactly does that mean?
It means that if you carry out the same test a hundred times with other visitors, the real result will lie within the calculated confidence interval in around 95 cases. In five cases, however, it will not. These five percent correspond to the probability of error that you factor in with each test. This is the so-called error of the 1st kind.
Error 1. type: You think a random result is real
One example:
- Variant A: 4.1 % conversion (820 conversions with 20,000 visitors)
- Variant B: 4.6 % conversion (920 conversions with 20,000 visitors)
- p-value: 0.045
- Confidence intervals:
A: [3.8 % - 4,4 %]
B: [4,3 % - 4.9 %]
That looks convincing. B seems better, the intervals hardly overlap. Nevertheless, the result may have arisen by chance. In this case, the decision would be wrong, even though the test was formally correct.
Why? The two confidence intervals are close to each other. Variant A ends at 4.4 percent, variant B begins at 4.3 percent. This minimal gap may have arisen by chance. In reality, both variants could perform equally well. The method recognizes "significance", but not the uncertainty behind the result. This is precisely the first type of error: you believe that one variant is better, although the effect is not reliable.
Error 2. type: You overlook an actually better variant
Another scenario:
- Variant A: 4.1 percent (123 conversions with 3,000 visitors)
- Variant B: 4.8 percent (144 conversions with 3,000 visitors)
- p-value: 0.12
- Confidence intervals:
A: [3.4 % - 4.9 %]
B: [4.0 % - 5.7 %]
The values for variant B are better, but the confidence intervals overlap significantly. The upper limit of A is 4.9 percent, the lower limit of B is 4.0 percent. This means that the difference is not clear enough.
Why is this a 2nd type of error?
Because although the effect does exist, it is not statistically verifiable. At least not with this amount of data. The test power is not sufficient to make the difference visible. You reject variant B, although it is actually better. The error is not in the interpretation, but in the insufficient database.
In such cases, only one thing helps: Extend the test duration, collect more data or make your decision based on additional criteria. These can be, for example, effect size, business impact or previous experience. If you make a blanket conclusion of "not significant", you often miss out on real opportunities.
How to plan test run time and sample size with confidence intervals
What influences the width of the confidence interval?
A confidence interval becomes narrower the more data you collect.
Three factors are decisive:
- Sample size: More users lead to less statistical noise
- Stability of conversion rates: Large fluctuations increase the interval
- Confidence level: Higher level means wider interval
Example: How the expected difference influences your planning
You expect an improvement of around 1.5 percentage points.
How large does your sample have to be for each variant?
- At 4.0 % vs. 5.5 %: approx. 3,500 visitors per variant
- At 4.0 % vs. 4.5 %: approx. 19,000 visitors per variant
Conclusion: Small effects require large amounts of data. If you underestimate this, you will get confidence intervals that overlap considerably and results that you cannot rely on.
Recommendation for practice
Always plan tests backwards: Determine the minimum effect you want to prove and calculate the necessary sample size from this. Use a significance calculator for this. Don't start blindly, but with a clear target range for duration, data volume and confidence level.
A/B tests without well-founded size planning only generate statistical noise in case of doubt.
Practical pitfalls: The most common errors in thinking about confidence intervals
Misconception 1: Confusing confidence interval with certainty
Misconception 2: Abort test as soon as significance is reached
Misconception 3: Comparing confidence intervals like fixed values
Misconception 4: Statistically significant = practically relevant
Misconception 5: Comparing several variants without customization
Conclusion & recommendations for practice: How to use statistics for better tests
Confidence intervals are not additional knowledge for statistics nerds. They are a key tool for anyone who wants to reliably evaluate A/B tests and make well-founded decisions.
Those who ignore them are flying blind. Those who use them correctly not only recognize whether a result is safe, but also how safe it is and how big the effect could really be.
Three key learnings
1. a single percentage value is not sufficient
Without a confidence interval, there is no framework for correctly classifying results.
2. significance alone is not enough
Statistically conspicuous does not equal operationally relevant. The width of the interval makes the difference.
3. test quality depends on the preparation
If you don't do any size planning, you can't make any reliable statements even with clean statistics.
Three recommendations for practice
1. consciously check confidence intervals
In each test report, pay attention to how close the intervals are and whether they overlap.
2. carry out size planning before starting the test
Use a calculator to determine sample size and duration based on your expectations.
3. do not accept tool results without checking them
Ask yourself what exactly your tool shows you and how the calculation is made.
Those who understand confidence intervals test with foresight and make decisions that work.
More articles about A/B testing
๐ A/B testing: how it works, tips & solutions
A comprehensive guide with 5-step instructions for effective A/B tests - from hypothesis to evaluation.
๐ User testing: methods, processes & metrics
Find out how real user feedback leads to better decisions through targeted user testing.
๐ Effective optimization through multivariate testing
Learn how to test several elements at the same time to identify the best combination.
๐ A/A tests explained: Validation for reliable data
Why A/A tests are important to validate your testing setup and ensure data quality.
๐ 10 red flags in A/B testing that you should avoid
The most common mistakes in A/B testing and how to avoid them.
๐ Big Query A/B Testing
How to efficiently analyze A/B tests at data level with BigQuery and Varify.io.
๐ Server-side tracking with GTM & GA4
More control over your data through server-side tracking with Google Tag Manager and GA4.
๐ A/B testing for Shopify: everything you need to know
Smart strategies and technical tips for successful A/B testing in Shopify stores.
๐ Split tests explained simply: definition, application, implementation
This is how split tests work and how to use them specifically.
๐ WordPress A/B testing
How to effectively integrate A/B tests into your WordPress website.
๐ Shopify Themes A/B Testing
Optimization of Shopify themes through targeted A/B testing for better conversion rates.