Statistical significance
Table of contents
What is statistical significance?
Statistical significance is a measure of how unusual your experiment results would be if there were actually no difference in performance between your variation and baseline and the discrepancy in lift was due to random chance alone.
It's become increasingly important for online businesses, marketers, and advertisers running A/B tests (such as testing conversion rates, ad copy, or email subject lines).
Achieving statistical significance helps ensure that conclusions drawn from experiments are reliable and not based on random fluctuations in data.
However, most experiments fail to reach a substantial significance level. Here's why:
- Changes are too small: Most changes to visitor experience aren’t impactful and they fail to reach clinical significance due to sampling error.
- Low baseline conversion rates: Most data sets use metrics with low baseline as a proxy which often results in test results showing significant standard deviations.
- Too many goals: Often, teams don’t focus on crucial metrics aligned with their hypothesis. This results in research findings falling short of the significance threshold.
Why is the concept of statistical significance important?
Statistical significance helps businesses make sound decisions based on data rather than random fluctuations. It relies on two key factors:
- Sample size: The number of participants in your experiment. Larger samples generally provide more reliable results. For website tests, more traffic means quicker, more accurate results.
- Effect size: The magnitude of difference between your test variations. It shows how much impact your changes have made.
Random sampling is crucial for bridging the statistically significant difference and getting accurate results. If you don't distribute your test variations randomly among your audience, you might introduce bias. For example: If all men see version A and all women see version B, you can't compare results fairly, even with a 50-50 split. Differences in behavior might be due to gender, not your test variations.
Example of real-world impact: In industries like pharmaceuticals, statistical significance in clinical trials can determine a drug's effectiveness. This can influence investor funding and the success or failure of a product.
Overall, statistical significance helps you distinguish between real improvements and random chance, guiding better business decisions.
Testing your hypothesis
Statistical significance is most practically used in hypothesis testing. For example, you want to know whether changing the color of a button on your website from red to green will result in more people clicking on it. If your button is currently red, that’s called your “null hypothesis,” which takes the form of your experiment baseline. Turning your button green is known as your “alternative hypothesis.”
To determine the observed difference in a statistical significance test, you will want to pay attention to two outputs: p-value and the confidence interval.
- P-value: P-value is the likelihood of seeing evidence as strong or stronger in favor of a difference in performance between your variation and baseline, calculated assuming there actually is no difference between them and any lift observed is entirely owed to random chance.
- Confidence interval: Confidence level is an estimated range of values that are likely, but not guaranteed, to include the unknown but exact value summarizing your target population if an experiment was replicated numerous times.
Get always valid results with Stats Engine
A strict set of guidelines is required to get valid results from experiments run with classical statistics: set a minimum detectable effect and sample size in advance, don’t peek at results, and don’t test too many goals or variations at the same time. These guidelines can be cumbersome and, if not followed carefully, can produce severely distorted and dubious test results for statisticians.
Fortunately, you can easily determine the practical significance of your experiments using Stats Engine, the advanced statistical model built-in to Optimizely. Here’s how to calculate the estimated duration of your experiment:
- Total visitors needed = Sample size × Number of variations
- Estimated days to run = Total visitors needed ÷ Average daily visitors
Stats Engine operates by combining sequential testing and false discovery rate control to give you trustworthy results faster, regardless of sample size and type of data. Updating in real-time, this approach allows for:
- Real-time monitoring of results
- Adaptive testing that adjusts to true effect size
- Faster decision-making without sacrificing data integrity
With Stats Engine, statistical significance should generally increase over time as more evidence is collected. This evidence comes in two forms:
- Larger conversion rate differences
- Conversion rate differences that persist over more visitors
Check out the full stats engine report.
Best practices for reaching statistical significance
When running statistical tests, you might encounter challenges in reaching statistical significance. Here are some best practices you can follow:
- Run tests for at least one business cycle (7 days)
- Choose primary and secondary metrics carefully
- Design experiments with significant potential impact on user behavior
Frequently asked questions
Q1: Why did my statistical significance go down?
A: Small fluctuations can occur due to data bucketing. Larger decreases might trigger a stats reset if Stats Engine detects seasonality or drift in conversion rates, maintaining experiment validity.
Q2: How long should I run my experiment?
A: Run your experiment until you reach statistical significance or for at least one full business cycle, whichever is longer.