Refactor from derivative
Link to score function
Do this after / when reading optimization bootcamp.
∇θlogp(x;θ)=p(x;θ)∇θp(x;θ)
Rearranging this gives us the fundamental form:
∇θp(x;θ)=p(x;θ)⋅∇θlogp(x;θ)
Terminology Note: The term ∇θlogp(x;θ) is called the score function in statistics, hence “score function estimator.”
Why This Matters: The Core Problem
Consider this ubiquitous problem in ML: we want to compute gradients of expectations:
∇θEx∼p(x;θ)[f(x)]=∇θ∫p(x;θ)f(x)dx
The challenge: we can’t interchange the gradient and integral operators directly when the distribution p(x;θ) depends on the parameters θ we’re differentiating with respect to.
Key Insight: We’ve transformed a gradient of an expectation (hard to compute) into an expectation of a gradient-weighted function (easy to estimate via sampling).
Prerequisites & Mathematical Foundation
Required Concepts
Gradient operator: ∇θ=[∂θ1∂,...,∂θn∂]T
Chain rule: dxdlogf(x)=f(x)1⋅dxdf
Leibniz integral rule: Under suitable conditions, dθd∫abf(x,θ)dx=∫ab∂θ∂f(x,θ)dx
Regularity Conditions
p(x;θ)>0 wherever f(x)=0 (to avoid division by zero)
Sufficient smoothness for interchange of differentiation and integration
In RL, we maximize expected return: $$
J(\theta) = \mathbb{E}{\tau \sim p\theta(\tau)}[R(\tau)]
where $\tau$ is a trajectory and $p_\theta(\tau)$ is the trajectory distribution under policy $\pi_\theta$.
**The gradient**: $$
\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta}[R(\tau) \cdot \nabla_\theta \log p_\theta(\tau)]
For a trajectory τ=(s0,a0,s1,a1,...): $$
\log p_\theta(\tau) = \sum_{t=0}^{T} \log \pi_\theta(a_t|s_t) + \sum_{t=0}^{T} \log P(s_{t+1}|s_t, a_t)
Since transition dynamics don't depend on $\theta$: $$
\nabla_\theta \log p_\theta(\tau) = \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t)
2. Variational Inference (ELBO Maximization)
In variational inference, we maximize the Evidence Lower BOund (ELBO): $$
\mathcal{L}(\theta) = \mathbb{E}{z \sim q\theta(z)}[\log p(x,z) - \log q_\theta(z)]
Control variates: Use correlated variables with known expectations
Reparameterization trick: When possible, use pathwise gradients instead (e.g., in VAEs)
Connections to Related Concepts
Fisher Information
The Fisher Information Matrix: $$
\mathcal{I}(\theta) = \mathbb{E}{x \sim p\theta}[\nabla_\theta \log p_\theta(x) \cdot (\nabla_\theta \log p_\theta(x))^T]
This measures the information content about parameters in the distribution.
### Natural Gradients
Natural gradient descent uses: $$
\tilde{\nabla}_\theta = \mathcal{I}(\theta)^{-1} \nabla_\theta
This is intimately connected to the score function.
Importance Sampling
When we can’t sample from pθ directly: $$
\mathbb{E}{x \sim p\theta}[f(x)] = \mathbb{E}{x \sim q}[f(x) \cdot \frac{p\theta(x)}{q(x)}]
The log derivative trick can be combined with importance sampling for off-policy learning.
## Implementation Considerations
### Numerical Stability
- **Log-sum-exp trick**: When computing $\log p_\theta(x)$, use stable implementations
- **Gradient clipping**: Score functions can have large magnitudes
- **Standardization**: Often helpful to standardize advantages/returns
### Sample Efficiency
- More samples → lower variance in gradient estimates
- Trade-off between computation and variance reduction
- Batch size selection is critical
## Common Pitfalls & Clarifications
1. **Not zero-mean**: While $\mathbb{E}[\nabla_\theta \log p_\theta(x)] = 0$, the term $f(x) \cdot \nabla_\theta \log p_\theta(x)$ is generally not zero-mean
2. **Biased with finite samples**: Individual gradient estimates are unbiased, but nonlinear functions of these estimates (e.g., Adam optimizer updates) can introduce bias
3. **Distribution support**: The trick assumes we can sample from $p_\theta$ - this isn't always tractable
## Advanced Extensions
### Rao-Blackwellization
If $x = (x_1, x_2)$ and we can compute $\mathbb{E}[f(x_1, x_2)|x_1]$ analytically: $$
\mathbb{E}[f(x)] = \mathbb{E}_{x_1}[\mathbb{E}_{x_2|x_1}[f(x_1, x_2)]]
This reduces variance by computing part of the expectation exactly.
## Summary: When to Use the Log Derivative Trick
**Use when**:
- The expectation's distribution depends on parameters you're optimizing
- The function $f(x)$ is non-differentiable or unknown
- You can sample from $p_\theta(x)$ but can't compute expectations analytically
- Working with discrete distributions (where reparameterization is impossible)
**Consider alternatives when**:
- Reparameterization is possible (continuous distributions with known inverse CDFs)
- Analytical gradients are tractable
- Variance is prohibitively high even with reduction techniques
This technique bridges the gap between probabilistic modeling and gradient-based optimization, making it fundamental to modern ML approaches that involve stochastic computation graphs.