Conditional independence
Two random variables and are conditionally independent given a third random variable :
This means that once we know , knowing doesn’t give us any more information about (and vice versa).
Independence != conditional independence
Independent but not conditionally independent (common effect):
Sprinkler and rain are independent, but given that the grass is wet, they become dependent: If the sprinkler was on, it’s more likely it didn’t rain.Conditionally independent but not independent (common cause):
Shoe size and reading level are correlated across children. Both are caused by age. Condition on age and the correlation vanishes: among 8-year-olds specifically, shoe size tells you nothing about reading level.
Common cause vs Common effect
… common cause (“fork”) … causes both and . They’re marginally correlated (shared cause), but conditionally independent.
… common effect (“collider”) … is caused by both and . They’re marginally independent, but conditionally dependent!!
… chain (“mediator”) … causes which causes . They’re marginally dependent, but conditionally independent.
TODO(distill): Why common effect is counterintuitive
The surprise is in the asymmetry. For chains and forks, conditioning on the middle variable destroys a dependence that was there. For colliders, conditioning creates a dependence that wasn’t there. Most people’s gut says conditioning can only ever add information and reduce uncertainty — and it does — but reducing uncertainty about can sharpen relationships among ‘s causes that you couldn’t see when was unobserved.
The intuitive mechanism: explaining-away.
Pearl’s classic setup. Your alarm goes off. Two possible causes: burglary and earthquake . Both are rare; both can independently trigger the alarm. In the general population, they’re uncorrelated.
Hear the alarm:
The evidence raises both probabilities. Reasonable.
Now you turn on the radio and hear there was an earthquake. What happens to your belief in burglary?
It drops. The earthquake already accounts for the alarm; the burglary hypothesis was only being supported because something needed to explain the noise, and now has stepped up. Knowing “explains away” the need for . The two causes compete for explanatory work.
That competition is the dependence. Marginally none of this competition exists because there’s no shared evidence forcing the two hypotheses into the same room. Conditioning on puts them in the same room.
The math, exactly.
Two fair independent coins . Let .
Marginal: $$
Conditional on $C = 1$, only the outcomes $(0, 1)$ and $(1, 0)$ survive, each with probability $1/2$:
P(A=1, B=1) = \tfrac{1}{4} = P(A=1),P(B=1) \quad \checkmarkP(A=0 \mid C=1) = \tfrac{1}{2}, \quad P(B=0 \mid C=1) = \tfrac{1}{2}
If $A$ and $B$ were conditionally independent given $C=1$, we'd need $P(A=0, B=0 \mid C=1) = \tfrac{1}{2} \cdot \tfrac{1}{2} = \tfrac{1}{4}$. We get $0$ instead. The variables went from uncorrelated to perfectly anti-correlated by the act of conditioning. No causal arrow was added; the joint distribution alone, restricted to a slice, contains the dependence. **Where this bites you in practice.** Most "controlling for X turned out to be wrong" stories in observational science are collider stories. _Berkson's paradox_: in the general population, two diseases $A$ and $B$ are independent. Each independently raises the probability of hospitalization $C$. Among hospitalized patients, $A$ and $B$ are anti-correlated. Patients sick enough with $A$ to land in the hospital often don't _also_ have $B$, because $A$ alone was enough to get them admitted. A researcher studying only hospitalized patients sees a negative association and might conclude "$A$ protects against $B$". The association is real in the sample and absent in the population. The selection itself created it. _Hollywood_: talent and looks are roughly independent in the population. Becoming famous requires talent OR looks (or both). Among the famous, talent and looks are anti-correlated. The data you see ("celebrities") is conditioned on a collider, so the world looks like it has a tradeoff between talent and looks that doesn't exist outside the sample. _Stat-controlling gone wrong_: you're studying whether smoking causes lung cancer. Someone suggests "control for hospital admission" because admitted patients are easier to study. Smoking causes admission. Cancer causes admission. Admission is a collider. Conditioning on it induces a spurious negative correlation between smoking and cancer in the conditioned sample, which can mask or even reverse the true effect. This is why you can't just throw every available variable into a regression — controlling for a collider introduces bias rather than removing it. **The deeper structural point.** Causal influence flows along arrows. _Bayesian information_ flows in both directions along arrows. The two graphs of "what causes what" and "what's evidence for what" aren't identical. In $A \to C \leftarrow B$ the causal graph has no path between $A$ and $B$. But once $C$ is observed, the _evidential_ graph has one: $A$'s value updates your belief about $C$ being the value it is, which updates your belief about $B$. That informational path didn't exist before observation; conditioning opened it. d-separation is the formal rule that tracks exactly when such paths open and close, and the collider rule is the part that captures this. This is also why "explaining-away" feels like a non-physical interaction. It is non-physical. Nothing in the world changed when you learned the earthquake happened — the burglar wasn't deterred. What changed is your distribution over hypotheses, and that distribution lives entirely in your head (or in the inference machinery), not in the system being modeled.
P(A=0, B=0 \mid C=1) = 0