Refactor from derivative
Link to score function
Do this after / when reading optimization bootcamp.

Rearranging this gives us the fundamental form:

Terminology Note: The term is called the score function in statistics, hence “score function estimator.”

Why This Matters: The Core Problem

Consider this ubiquitous problem in ML: we want to compute gradients of expectations:

The challenge: we can’t interchange the gradient and integral operators directly when the distribution depends on the parameters we’re differentiating with respect to.

The Trick in Action

Step-by-Step Derivation

Starting with our expectation gradient:

Key Insight: We’ve transformed a gradient of an expectation (hard to compute) into an expectation of a gradient-weighted function (easy to estimate via sampling).

Prerequisites & Mathematical Foundation

Required Concepts

  1. Gradient operator:

  2. Chain rule:

  3. Leibniz integral rule: Under suitable conditions,

Regularity Conditions

  • wherever (to avoid division by zero)
  • Sufficient smoothness for interchange of differentiation and integration
  • The expectation must exist

Core Applications in ML/AI

1. Policy Gradient Methods (Reinforcement Learning)

In RL, we maximize expected return: $$
J(\theta) = \mathbb{E}{\tau \sim p\theta(\tau)}[R(\tau)]

where $\tau$ is a trajectory and $p_\theta(\tau)$ is the trajectory distribution under policy $\pi_\theta$. **The gradient**: $$ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta}[R(\tau) \cdot \nabla_\theta \log p_\theta(\tau)]

For a trajectory : $$
\log p_\theta(\tau) = \sum_{t=0}^{T} \log \pi_\theta(a_t|s_t) + \sum_{t=0}^{T} \log P(s_{t+1}|s_t, a_t)

Since transition dynamics don't depend on $\theta$: $$ \nabla_\theta \log p_\theta(\tau) = \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t)

2. Variational Inference (ELBO Maximization)

In variational inference, we maximize the Evidence Lower BOund (ELBO): $$
\mathcal{L}(\theta) = \mathbb{E}{z \sim q\theta(z)}[\log p(x,z) - \log q_\theta(z)]

The gradient: $$ \nabla_\theta \mathcal{L} = \mathbb{E}_{z \sim q_\theta}[(\log p(x,z) - \log q_\theta(z)) \cdot \nabla_\theta \log q_\theta(z)]

Alternative names: This is also called the REINFORCE gradient estimator in the VAE literature.

3. Black-Box Optimization

For non-differentiable objective : $$
\nabla_\theta \mathbb{E}{x \sim p\theta}[f(x)] = \mathbb{E}{x \sim p\theta}[f(x) \cdot \nabla_\theta \log p_\theta(x)]

This enables gradient-based optimization even when $f$ itself isn't differentiable! ## Concrete Example: Gaussian Distribution Let $p_\theta(x) = \mathcal{N}(x; \mu, \sigma^2)$ where $\theta = [\mu, \sigma]$. The log probability: $$ \log p_\theta(x) = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{(x-\mu)^2}{2\sigma^2}

Score functions: $$
\nabla_\sigma \log p_\theta(x) = -\frac{1}{\sigma} + \frac{(x-\mu)^2}{\sigma^3}

**Verification of a key property**: $$ \mathbb{E}_{x \sim p_\theta}[\nabla_\theta \log p_\theta(x)] = 0

This always holds (it’s the derivative of ).

Variance Considerations

The Variance Problem

The log derivative trick produces unbiased but potentially high-variance gradient estimates:

Variance Reduction Techniques

  1. Baseline subtraction: Replace with where is a baseline

    • Maintains unbiasedness since
    • Optimal baseline:
  2. Control variates: Use correlated variables with known expectations

  3. Reparameterization trick: When possible, use pathwise gradients instead (e.g., in VAEs)

Fisher Information

The Fisher Information Matrix: $$
\mathcal{I}(\theta) = \mathbb{E}{x \sim p\theta}[\nabla_\theta \log p_\theta(x) \cdot (\nabla_\theta \log p_\theta(x))^T]

This measures the information content about parameters in the distribution. ### Natural Gradients Natural gradient descent uses: $$ \tilde{\nabla}_\theta = \mathcal{I}(\theta)^{-1} \nabla_\theta

This is intimately connected to the score function.

Importance Sampling

When we can’t sample from directly: $$
\mathbb{E}{x \sim p\theta}[f(x)] = \mathbb{E}{x \sim q}[f(x) \cdot \frac{p\theta(x)}{q(x)}]

The log derivative trick can be combined with importance sampling for off-policy learning. ## Implementation Considerations ### Numerical Stability - **Log-sum-exp trick**: When computing $\log p_\theta(x)$, use stable implementations - **Gradient clipping**: Score functions can have large magnitudes - **Standardization**: Often helpful to standardize advantages/returns ### Sample Efficiency - More samples → lower variance in gradient estimates - Trade-off between computation and variance reduction - Batch size selection is critical ## Common Pitfalls & Clarifications 1. **Not zero-mean**: While $\mathbb{E}[\nabla_\theta \log p_\theta(x)] = 0$, the term $f(x) \cdot \nabla_\theta \log p_\theta(x)$ is generally not zero-mean 2. **Biased with finite samples**: Individual gradient estimates are unbiased, but nonlinear functions of these estimates (e.g., Adam optimizer updates) can introduce bias 3. **Distribution support**: The trick assumes we can sample from $p_\theta$ - this isn't always tractable ## Advanced Extensions ### Rao-Blackwellization If $x = (x_1, x_2)$ and we can compute $\mathbb{E}[f(x_1, x_2)|x_1]$ analytically: $$ \mathbb{E}[f(x)] = \mathbb{E}_{x_1}[\mathbb{E}_{x_2|x_1}[f(x_1, x_2)]]

This reduces variance by computing part of the expectation exactly.

Multiple Sampling Distributions

The multiple importance sampling variant: $$
\nabla_\theta \mathbb{E}{x \sim p\theta}[f(x)] = \sum_i w_i \mathbb{E}{x \sim q_i}[f(x) \cdot \frac{p\theta(x)}{q_i(x)} \cdot \nabla_\theta \log p_\theta(x)]

## Summary: When to Use the Log Derivative Trick **Use when**: - The expectation's distribution depends on parameters you're optimizing - The function $f(x)$ is non-differentiable or unknown - You can sample from $p_\theta(x)$ but can't compute expectations analytically - Working with discrete distributions (where reparameterization is impossible) **Consider alternatives when**: - Reparameterization is possible (continuous distributions with known inverse CDFs) - Analytical gradients are tractable - Variance is prohibitively high even with reduction techniques This technique bridges the gap between probabilistic modeling and gradient-based optimization, making it fundamental to modern ML approaches that involve stochastic computation graphs.