Simpson’s Paradox: When Data Lies and How to Spot It

Here is something that will genuinely unsettle you the first time you see it: a medical treatment can appear to help patients in every single subgroup you examine, yet somehow harm patients overall. A university admission process can look fair — even favorable — toward a minority group in every department, yet show clear discrimination when you look at the whole institution. These are not hypothetical absurdities or statistical tricks for confusing students. They are real phenomena with real consequences, and they go by the name Simpson’s Paradox.

Related: cognitive biases guide

If you work with data in any professional capacity — analyzing marketing funnels, interpreting HR metrics, reading medical research, evaluating policy outcomes — you will encounter this paradox. The question is whether you will recognize it when you do.

What Simpson’s Paradox Actually Is

Simpson’s Paradox occurs when a trend that appears in several groups of data disappears or reverses when the groups are combined. The combined dataset tells a completely different story than any of the individual segments, and both stories are mathematically correct. That last part is what makes it so dangerous: no one is lying to you. The numbers are accurate. The interpretation is simply wrong.

The paradox was formally described by statistician Edward Simpson in 1951, though earlier work by Karl Pearson and others touched on the same phenomenon. The name stuck, and so did the confusion it creates.

To understand how this happens mechanically, consider a simplified example. Suppose two doctors each treat patients with a particular condition. Doctor A has a higher success rate with mild cases and a higher success rate with severe cases. Yet when you look at overall success rates, Doctor B looks better. How? Because Doctor A handles a disproportionate number of severe cases. The mix of cases — what statisticians call a confounding variable or a lurking variable — distorts the aggregate picture entirely (Pearl & Mackenzie, 2018).

The mathematics here is not complicated once you see it. Weighted averages do not behave the way our intuitions expect them to. When groups have different sizes, or when the cases within those groups are not evenly distributed, combining them can reverse every trend you observed. This is not a bug in mathematics. It is a feature of how aggregation works, and it exposes a real limitation in how human beings reason about proportions.

The UC Berkeley Case: The Paradox in Real Life

The most famous real-world example comes from the University of California Berkeley in 1973. Researchers examined admission data and found that the university as a whole admitted about 44% of male applicants but only 35% of female applicants. That is a substantial gap, and on the surface it looks like strong evidence of gender discrimination.

But when Bickel, Hammel, and O’Connell (1975) dug into the data department by department, they found something startling: in most individual departments, women were actually admitted at higher rates than men, or at comparable rates. There was no consistent pattern of discrimination at the departmental level. What was happening?

Women were disproportionately applying to departments with low overall admission rates — fields like English and social sciences, which were highly competitive and rejected most applicants regardless of gender. Men were applying in larger numbers to departments like engineering and chemistry that had higher acceptance rates. The aggregate numbers looked discriminatory because they blended two very different underlying distributions without accounting for where people were applying in the first place.

This is Simpson’s Paradox in action at institutional scale. The analysis that almost led to a major discrimination lawsuit was, in a narrow technical sense, not wrong. The overall admission gap was real. The interpretation — that it reflected bias — was unsupported once you controlled for the relevant variable. [5]

Why Your Brain Does Not See This Coming

There is a good reason this paradox catches smart people off guard. Human cognition runs on heuristics, and one of the most powerful heuristics we have is the assumption that parts reflect the whole. If something is true in every group, we naturally assume it is true overall. This is usually a reasonable assumption. It just happens to be catastrophically wrong in cases where group sizes are unequal and a confounding variable is lurking in the structure of the data. [2]

Research on statistical reasoning suggests that even trained analysts frequently fail to identify when aggregation is misleading without explicit prompting to look for confounders (Kahneman, 2011). Our working memory loads up with the numbers directly in front of us. We do not spontaneously ask “wait, how are these groups composed?” unless we have been explicitly trained to do so, or unless something about the result surprises us enough to trigger a second look. [3]

There is also a narrative pull at work. When we see data, we immediately want to construct a story. The story that says “treatment A works better overall” is clean and actionable. The story that says “treatment A works better in every subgroup but we need to think carefully about the composition of those subgroups before drawing any conclusion” is messy and unsatisfying. We are drawn to clean stories even when the messy ones are more accurate. [4]

This is compounded in professional settings, where there is often pressure to produce clear takeaways from data quickly. The person who says “here is a clear finding” gets rewarded. The person who says “here is a finding that might reverse depending on how we slice it” gets asked to come back with something more definitive. This institutional dynamic pushes analysts toward exactly the kind of interpretation that Simpson’s Paradox exploits.

A Medical Example That Actually Killed People

The consequences of missing Simpson’s Paradox are not always limited to bad business decisions or flawed academic papers. In medical contexts, the stakes are considerably higher.

Consider the story of kidney stone treatments. In the 1980s, a study comparing two surgical methods — open surgery and a newer, less invasive percutaneous nephrolithotomy — appeared to show that the newer method had a higher overall success rate. Sounds straightforward: adopt the newer technology.

But when researchers broke the data down by kidney stone size, the picture reversed completely. For small stones, the old method was more effective. For large stones, the old method was more effective. Yet somehow the overall numbers favored the new method. The reason was the same as always: the case mix was different. The new, less invasive procedure was more commonly used on smaller, easier-to-treat stones. When you averaged everything together without accounting for stone size, you got a misleading result (Charig et al., 1986).

Had clinicians adopted the newer method wholesale based solely on the aggregate data, they would have been giving patients inferior treatment for both categories of stones, while believing the data supported their decision. This is why understanding how to disaggregate data is not just an academic exercise. It is a clinical and ethical responsibility.

How to Spot It Before It Spots You

Recognizing Simpson’s Paradox requires building specific habits of mind around how you interrogate aggregate data. These are not complex statistical techniques. They are questions you need to train yourself to ask reflexively.

Ask what variables might determine group membership

Before accepting any aggregate finding, ask yourself: what factors determine which group a data point ends up in? In the Berkeley example, the lurking variable was which department someone applied to. In the kidney stone example, it was stone size. These variables were not hidden in the data — they were available. They just were not in the initial summary. Whenever you see an overall rate or proportion, ask what underlying factors might influence both the grouping and the outcome simultaneously.

Disaggregate proactively, not reactively

Most analysts disaggregate data when something looks surprising or when someone asks them to. The better approach is to make disaggregation part of your standard workflow. Break your data down by any variable that could plausibly be a confounder before you commit to an interpretation. If the subgroup trends and the overall trend tell the same story, you can report your finding with more confidence. If they diverge, you have found something worth investigating (Hernán, Clayton, & Keiding, 2011).

Look at the weights, not just the rates

When comparing proportions across groups, always check the size of each group as well as the rate. A treatment that works in 90% of Group A and 80% of Group B will look worse than a treatment that works in 70% of Group A and 75% of Group B if the second treatment’s group compositions are skewed heavily toward the easier-to-treat cases. Rates without context are only half the story.

Be suspicious of any finding that is especially clean

Real data is messy. When you get a very clean, dramatic finding from a complex dataset, that is actually a signal to pause rather than celebrate. It may simply mean you have not looked closely enough yet. Paradoxes and artifacts hide in aggregates precisely because clean summaries are what we are trained to produce and reward.

Think about causality, not just correlation

Pearl and Mackenzie (2018) argue that Simpson’s Paradox is fundamentally a problem of causal reasoning, not just statistical reasoning. The question of which level to analyze — subgroup or aggregate — cannot be answered by looking at the numbers alone. It requires a causal model: an understanding of the actual mechanisms linking the variables. If the confounding variable is on the causal pathway between your treatment and your outcome, you might need to analyze it one way. If it is a background characteristic that affects who receives treatment, you need to analyze it differently. Statistical tools alone will not tell you which situation you are in. Your domain knowledge will. [1]

What This Means for Knowledge Work in Practice

If you manage people, interpret performance dashboards, read research studies, or make evidence-based recommendations, Simpson’s Paradox is relevant to your work right now. The effect shows up in A/B test results that look different by device type than overall. It shows up in employee performance ratings that look fair by team but discriminatory at the company level. It shows up in educational outcome data that suggests one curriculum is better while obscuring which student populations drove the result.

The practical implication is not that you should distrust data — that is the wrong lesson. The right lesson is that you should distrust unexamined aggregates. Data is not lying to you when Simpson’s Paradox appears. The data is accurate. What is failing is the interpretive framework you are applying to it.

Developing fluency with this paradox does not require advanced statistics. It requires a particular kind of epistemic discipline: the willingness to slow down before an interesting finding, to ask what variables might be structuring the data in ways that are not visible in the summary, and to hold your conclusions loosely until you have checked whether they survive disaggregation.

That discipline is harder than it sounds. Especially under time pressure, with stakeholders waiting for a clear answer, the temptation to take the aggregate finding at face value is real and strong. But the cost of missing a Simpson’s Paradox can be significant — wasted resources, flawed policies, or in high-stakes domains like medicine, genuine harm to real people.

The statistician’s job — and increasingly the knowledge worker’s job — is not just to report what the numbers say. It is to understand why they say it, whether that story holds up when you look more carefully, and what alternative stories the same data could support. Simpson’s Paradox is one of the clearest reminders we have that this interpretive work is not optional. It is the whole point.

Last updated: 2026-03-31

Your Next Steps

Today: Pick one idea from this article and try it before bed tonight.
This week: Track your results for 5 days — even a simple notes app works.
Next 30 days: Review what worked, drop what didn’t, and build your personal system.

References

Berggren, M. et al. (2025). Simpson’s gender-equality paradox. Proceedings of the National Academy of Sciences (PNAS). Link
Teng, X. et al. (2026). De-paradox Tree: Breaking Down Simpson’s Paradox via A Kernel-Based Partition Algorithm. arXiv. Link
Bickel, P. J., Hammel, E. A., & O’Connell, J. W. (1975). Sex Bias in Graduate Admissions: Data from Berkeley. Science. Link
Charig, C. R. et al. (1986). Association of Survival with Treatment in Kidney Stones. British Medical Journal (BMJ). Link
Wagner, C. H. (1982). Simpson’s Paradox in Real Life. The American Statistician. Link
Pearl, J. (1982). The Logic of Simpson’s Paradox. Synthese. Link

What is the key takeaway about simpson’s paradox?

Evidence-based approaches consistently outperform conventional wisdom. Start with the data, not assumptions, and give any strategy at least 30 days before judging results.

How should beginners approach simpson’s paradox?

Pick one actionable insight from this guide and implement it today. Small, consistent actions compound faster than ambitious plans that never start.

Simpson’s Paradox: When Data Lies and How to Spot It