Statistical significance for Facebook ads: why the calculator keeps crowning losers
Why a standard statistical significance test fails Facebook ad creative batches, and how FDR correction and shrinkage toward format priors fix it.
Why a standard statistical significance test fails Facebook ad creative batches, and how FDR correction and shrinkage toward format priors fix it.
You run twelve creatives against a cold audience for five days. One pulls ahead, so you paste its clicks and conversions into a free significance calculator next to the runner-up. The page says 96% significant. You kill the other eleven and triple the budget. Ten days later the winner is performing like the account average, and you are wondering whether the calculator lied.
It did not lie. You asked it the wrong question. That calculator tested two ads in isolation. You tested twelve. Those are two different statistical situations, and the second one needs machinery that almost nobody running Facebook ads uses.
I built that machinery into Adscalr's scoring, so I can show you exactly what breaks and what the fix looks like.
The takeaways
A significance test answers one narrow question: if these two ads performed identically, how likely is a gap this large by chance alone? At 95% confidence you accept a 1-in-20 false-positive rate per comparison. That is the entire promise. It says nothing about how big the difference is, and nothing about whether it holds next week.
Worth knowing before you trust any platform trophy icon: Meta's own A/B test documentation works with a lower bar than the textbook. It calls an A/B winner at 65% confidence, while its lift and holdout studies require 90%. So when Ads Manager declares a winner, the platform is making a much weaker claim than the 95% you would demand from a clinical trial, and it is still only a claim about two variants at a time.
Because the 1-in-20 error rate applies per comparison, and a creative batch makes many comparisons at once. Test 20 creatives at 95% confidence and the chance that at least one dud clears the bar is 1 minus 0.95 to the 20th power, about 64%. The more creatives you launch, the closer you get to guaranteeing a fake champion.
This shows up in field data. Ron Berman and co-authors analyzed thousands of real commercial A/B experiments and found that of all effects significant at the 5% level, roughly 1 in 5 were truly null. One declared winner in five did not exist.
Each pairwise test was honest on its own. The batch was not. I walked through the gut-level version of this trap in Won, or just lucky?; this post is about what the correction looks like in practice.
The standard tool is the Benjamini-Hochberg procedure, which controls the false discovery rate (FDR) across the whole batch rather than the error rate of each test. Mechanically: rank every p-value from smallest to largest, then compare each to a sliding threshold that tightens as you move down the ranking (the i-th p-value must beat i divided by m, times q, where m is the number of comparisons and q is your tolerated share of false discoveries).
The honest part is choosing q. In Adscalr's pattern mining I set q = 0.10, which means: of everything the system flags as a real pattern across many creatives, I accept that roughly 1 in 10 will still be noise. That sounds like defeat until you price the alternative. The older Bonferroni correction divides the threshold by the number of tests, so 20 creatives would each need p < 0.0025, and with ad-sized samples almost nothing ever clears that bar. You would sit on your hands forever. FDR control buys back the power to act, at a known and budgeted false-alarm rate.
Most creative tests die before significance is even on the table. Nine conversions on a new static is not a sample, and Meta itself treats early data as unstable: its documentation puts the learning phase at roughly 50 conversion events before delivery settles.
The tool for this stage is shrinkage. A new ad's score gets blended with a prior: what ads of its format historically do. A static posting a 3.8 ROAS on nine conversions, in an account where statics average 1.6, gets pulled most of the way back toward 1.6. If it keeps converting and is still ahead at 400 events, the prior fades and its own data takes over. Lucky streaks get dampened early; sustained performance comes through untouched.
This is Bayesian shrinkage with format-specific priors, and in my own scoring it removed more fake winners than any other single change.
Here is an admission that surprises people: I left significance testing out of Adscalr's kill path on purpose. When a creative is burning budget, waiting for p < 0.05 has a price; you can spend hundreds of euros buying statistical certainty about a dud. So the kill path runs on hard safeguards instead: a learning-phase lockout (no kills under 5 days or 200 euros of spend) and a ROAS floor that pauses anything at 1.5x or better rather than killing it. The FDR correction lives where a false discovery is cheap to flag and expensive to scale: in ranking and pattern mining.
Significance answers "is this difference real". A kill decision answers "how do I cap the damage while I wait to find out". Different jobs, different math.
Score each creative on a composite of six metrics so one twitchy number cannot carry a verdict, shrink early scores toward the format's prior, correct the ranking for how many creatives you compared, and let Thompson sampling pick the next test worth running. That is the statistics layer inside Adscalr's ad intelligence, and none of it is exotic. It is 1990s statistics applied with discipline. The exotic part is that most dashboards skip it.
This is the thinking behind Adscalr.
See the product →