A/B Testing for CRM Optimization: Sample Size, Balance & Significance

Most teams think they’re testing creative, personalization, or strategy. In reality, they’re often just measuring statistical noise.

After auditing hundreds of CRM programs every year via my role as strategy director at Underground Ecom, I see the same pattern again and again. Experiments look solid on the surface, but fall apart once you look closely. The ideas aren’t necessarily bad, but the tests themselves aren’t built to work.

Most experiments fail, full stop.

Here’s how to stop testing wrong and start getting results you can actually trust.

What goes wrong in most A/B tests

Across industries, channels, and platforms, failed tests usually fall into 3 buckets:

1. Underpowered tests

Sample sizes are too small to detect meaningful lift. You might see “statistical significance” in a dashboard, but the result isn’t strong enough to guide real decisions. What you’re left with is an inconclusive outcome and wasted time.

2. Unbalanced audiences

Test groups differ in ways that matter: engagement level, geography, lifecycle stage, inbox provider. Those differences skew results before the test even begins.

3. Overinterpreted results

A tiny increase in click rate. A subject line that “won” because of machine opens. A result driven by random variation. These are vanity wins, and they don’t give you a clear picture of business impact.

When these issues stack up, teams either declare false winners or keep testing without ever acting on what they learn.

Step 1: Size your test for statistical power

Before you launch an A/B test, you need to know if it can actually succeed.

That starts with calculating the minimum sample size required to detect the lift you care about. Not what’s convenient nor what fits into a short test window. A minimum sample size is what’s actually mathematically required.

To determine the minimum sample size, you need 4 inputs:

Input	Definition
Baseline metric	Your current click rate or conversion rate. For email, these are the only metrics that consistently reflect real behavior.
Minimum detectable effect	The smallest lift worth detecting. In most real-world CRM programs, that’s around 20–30%.
Statistical power	The probability of detecting a real effect.
Significance level (alpha)	Your false-positive threshold. Standard practice is 5%.

To put that in an equation (you can use pen and paper, or your favorite LLM):

n ≈ [ ( zα/2 × √(2 × p̄ × (1 − p̄)) + zβ × √( p1 × (1 − p1) + p2 × (1 − p2) ) )² ] ÷ ( p2 − p1 )²

P1 is your baseline metric, P2 is the target metric after lift, the alpha is your significance level, and the beta is your statistical power (you’d use their z-scores, which for a 5% alpha is 1.96, and for an 80% beta is 0.84) .

In the real world, for example, if your baseline conversion rate is 2% and you want to detect a 30% lift with 80% power at 95% confidence, you’ll need roughly 9,800 profiles per test arm. Anything less, and the results simply won’t be reliable.

If you can’t reach that volume within your planned window, you still have options. You can simplify the test, extend the timeline, or adjust expectations. What you can’t do is ignore the math and hope it works out.

Step 2: Balance your audience before you test

Even with the right sample size, biased group assignment will invalidate your results.

A clean split matters more than a clever hypothesis. Garbage in, garbage out.

Take these 3 steps to be intentional about audience design:

Stratify your audience: Segment by variables that plausibly affect response, like engagement recency, lifecycle stage, geography, and inbox provider.
Randomize within each stratum: Apply a 50/50 split inside each segment so both test arms include the same mix of profiles.
Validate balance using standardized mean difference (SMD): An SMD of 0.10 or lower means your groups are comparable. Anything higher means the test is biased and needs rebalancing.

To calculate your standardized mean difference (which is a measure of how similar two groups are based on key variables), you will need two formulas: one if your segment uses numerical values only (e.g. clicks, opens), and the other if it uses boolean or proportional values (e.g. if a person placed an order).

For numerical values, use: (Xt - Xc)Spooled

X is the average of the numerical value(s) you used to build out your Test (t) and Control (c) segments.

To find the pooled value of S (combined standard deviation) of the segments, you can use the following formula: (St2+Sc2)2

For boolean- and proportional-value based segments, you can use the following:

(Pt - Pc)(Pt(1-Pt)+Pc(1-Pc))2

Where the P values are percentages of profiles with a particular value in the test (t) and control (c) segments.

People often skip this last step because it feels technical. But without it, you’re not running an experiment. You’re comparing two different audiences and calling it insight.

Step 3: Read results like a scientist

The goal of experimentation is to get a reliable answer you can scale with confidence.

That starts with measuring what actually matters:

Primary metric: Conversion rate is your north star for business impact.
Secondary metric: Click rate helps explain engagement quality and intent before conversion.
Guardrail metrics: Unsubscribes, spam complaints, and bounces protect deliverability and long-term performance.

Use the same measurement window you planned when sizing the test. Segment results by lifecycle stage to see where lift is actually coming from. Maintain a holdout group so you can catch false positives driven by external factors.

And remember, statistical significance alone isn’t enough. A lift can be real and still not worth operationalizing.

Where AI fits into modern testing

No one’s calculating these formulas by hand anymore, and you don’t need to.

AI tools make this work practical at scale. You can use them to:

Calculate required sample sizes in seconds.
Sanity-check assumptions before a test launches.
Analyze exported results without bias.
Keep testing context and logic organized over time.

When you use it well, AI doesn’t replace your native intelligence or judgment. But it can remove friction, so your team can focus on designing better experiments instead of wrestling with spreadsheets.

A quick checklist before you launch your next test

Before you hit send, ask yourself:

Have I calculated the minimum sample size needed to detect meaningful lift?
Can I realistically reach that volume within my planned test window?
Am I testing for conversion rate, not vanity metrics?
Are my control and treatment groups balanced across key dimensions?
Have I validated balance using SMD (≤ 0.10)?
Do I have a primary metric, a secondary metric, and clear guardrails?
Have I defined what success looks like before seeing the results?

If you can’t confidently say yes to all of the above, the test probably isn’t ready yet.

When CRM experimentation becomes scientific, marketing decisions stop being guesses and start driving predictable, scalable growth.

The difference between good marketing and great marketing is knowing which experiments to trust and which to ignore.

Stefan Milicevic

Stefan Milicevic is a CRM and retention strategy expert with over 10 years of experience in email marketing. Based in Bosnia and Herzegovina and working with clients worldwide, he specializes in helping brands grow through data-driven lifecycle marketing, customer retention strategy, and performance analytics. Stefan holds a Master’s degree in corporate law and has worked across sales, operations, and executive leadership roles. His work focuses on bridging strategy, analytics, and execution to help businesses build scalable retention systems. Fluent in several languages and experienced working with international teams, Stefan brings a practical, strategic perspective to modern CRM, lifecycle marketing, and business growth.

Stop testing wrong: how to run A/B tests that actually mean something