A/B Testing LinkedIn Outreach Properly (Most Teams Test Wrong)

Most "A/B tests" in outreach are noise. The sample-size math, what to test in what order, and the protocol that lets you actually learn from your campaigns.

The phrase "we A/B tested two openers and B won" is the most-repeated lie in outreach. Almost every such test runs on a sample too small to distinguish signal from noise, on a list that drifted between the two arms, or on a metric (reply rate) so naturally volatile that a 2-point difference is meaningless. The result is teams that "learn" things that are not true, then optimize their stack around them for months. This guide is the math, the protocol, and the discipline to actually learn from outreach experiments.

Why most outreach "A/B tests" are noise

Three failures recur:

  1. Sample too small. 50 prospects per arm cannot reliably detect a 3-point reply-rate difference. Confidence intervals overlap heavily; the "winner" is often statistical noise.
  2. Arms are not equivalent. Arm A ran Monday/Tuesday, arm B ran Thursday/Friday. The list segments differed. The senders differed. You are testing the variable plus a dozen confounds.
  3. Wrong metric, too soon. Reply rate at day 3 looks different from reply rate at day 14. Calling the test at day 3 misses 40%+ of replies.

Each failure individually invalidates a test. Combined, they make most outreach A/B testing a ritual that produces stories, not data.

Sample-size math you can actually use

For a binary outcome (reply / no reply) at typical outreach reply rates, you need surprisingly large samples to detect realistic differences. Practical rules of thumb:

Baseline reply rateLift you want to detectApprox. sample / arm
5%+1pp (5% → 6%)~7,000
5%+2pp (5% → 7%)~1,900
5%+5pp (5% → 10%)~430
10%+2pp (10% → 12%)~3,100
10%+5pp (10% → 15%)~620
20%+5pp (20% → 25%)~1,000

(Assumes 95% confidence, 80% power, two-tailed.)

Two implications: (1) testing small differences requires sample sizes most teams will not have for months; (2) only test things you expect to move reply rate by ~5 percentage points or more. Trying to detect +1pp on 50 prospects per arm is theatre, not science.

What to test (and in what order)

Test the variables that can move reply rate by ≥ 5pp. Most are upstream of the message itself.

  1. Targeting / ICP segments. Highest variance; biggest learnings. Test a precise vs broad ICP definition on the same volume.
  2. Sequence length. 3 touches vs 5 touches on the same list. Easy to run, big effects.
  3. First-touch type. Connection request with note vs without; InMail vs connection request.
  4. Opener structure. Observation-based vs offer-based vs question-based.
  5. Sender persona. Founder/CEO sender vs SDR sender, on the same target list.
  6. Touch timing. Asymmetric (2/4/6 days) vs uniform (4/4/4).

Run them in this order — the higher items move more, and learnings from #1 inform every later test.

What NOT to test — the noise traps

  • Single-word variations ("Hi" vs "Hello"). Effects are smaller than your sample can detect.
  • Subject line micro-tweaks. Same problem.
  • Send time within the same window (10am vs 11am). Real but tiny vs noise.
  • Anything you cannot fully isolate. If arms differ on the variable AND the list AND the senders, you are not testing anything.

If a stakeholder pushes for these tests, the right answer is: "the effect size is smaller than our sample noise — let's spend the test budget on something that can actually move."

Buy your accounts — $350 once, yours forever.

NFC passport-verified, 2+ year aged, with 500+ targeted connections — owned, not rented. Up to ~71% cheaper than renting over a year.

See the buy offer →

A clean test protocol you can run weekly

The protocol below survives the failures most outreach tests have. Adopt it as a written checklist.

  1. State the hypothesis in one sentence with the expected lift in percentage points.
  2. Check the sample math using the table above. If you cannot hit the required sample within 2–3 weeks, do not run this test.
  3. Randomize at the prospect level, not at the day or sender level. Use a hash or coin flip per prospect.
  4. Equalize confounds — same list, same senders, same time window, same sequence except the variable under test.
  5. Run for the full reply window — typically 14–21 days from the last touch. Do not call the test on day 5.
  6. Pre-register the decision rule — "we accept the new variant if it beats the control by ≥ 3pp AND the 95% confidence interval excludes zero". Decide before you see the data.
  7. Document the result in a shared log — variant, sample size, result, decision. Build institutional memory; stop re-running the same tests.

Reading results without fooling yourself

Three discipline rules:

  • Confidence intervals matter more than point estimates. "Variant B was 12% vs control 9%" is a story; "Variant B 12% (CI 9–15%) vs control 9% (CI 7–11%)" is data — and shows the intervals overlap, meaning the result is inconclusive.
  • Cumulative reply rate vs single-touch reply rate. Measure replies attributable to the full sequence; not just the first message.
  • Beware the "winner that disappears next month". If a variant wins once and then loses, it was noise both times. Demand replication before changing the standard.

Most teams are better off running fewer, larger, better-designed tests than running a dozen underpowered ones a quarter. The KPI infrastructure that makes this practical is in the outreach KPI dashboard.

Frequently asked questions

Frequently Asked Questions

How big does my A/B test sample need to be?
At a 5% baseline reply rate, detecting a +2pp lift needs about 1,900 prospects per arm; +5pp needs about 430. Smaller samples cannot distinguish signal from noise — do not run them as tests.
How long should I wait before calling a reply-rate test?
14–21 days from the last touch in the sequence. Calling on day 3–5 misses 30–50% of eventual replies and inverts the result of many tests.
Should I test the opening line of my message?
Only if you expect a ≥ 5pp lift and have the sample for it. Single-word or micro-tweak tests are below sample noise — they look like wins, but the wins do not replicate.
What is the highest-leverage thing to test in LinkedIn outreach?
Targeting segments and sequence length, in that order. They typically move reply rate by 5–15 percentage points, well above the threshold a real test can detect.