A p-value is the probability of observing data as extreme as (or more extreme than) what you got, assuming the null hypothesis is true.

The fundamental misinterpretation

You test a coin, get 14 heads in 20 flips (p=0.057). This means “if the coin were fair, you’d see results this extreme 5.7% of the time”, NOT “there’s a 5.7% chance the coin is fair”.

The actual probability the coin is fair depends on your prior beliefs:

  • Random quarter from someone’s pocket → probably still ~98% chance it’s fair
  • Coin from a magic shop → maybe only ~2% chance it’s fair

Same data, same p-value, completely different conclusions! To get P(null|data) you’d need Bayes theorem and a prior.

Common pitfalls

Arbitrary thresholds: The 0.05 cutoff is just convention.
P-hacking: Running many tests until p < 0.05, then only reporting the “significant” ones. This invalidates the entire interpretation.
Binary thinking: Treating p=0.049 as fundamentally different from p=0.051, when they’re practically identical.

The American Statistical Association took the unusual step of releasing an official warning about p-value misuse:

  1. P-values can indicate incompatibility between data and the null hypothesis
  2. P-values do NOT measure the probability the hypothesis is true
  3. Scientific conclusions shouldn’t be based solely on whether p < 0.05
  4. Proper inference requires full reporting (no cherry-picking)
  5. P-values don’t measure effect size or importance
  6. By itself, a p-value provides limited information