log-derivative trick

https://andrewcharlesjones.github.io/journal/log-derivative.html

Refactor from derivative
Link to score function

Cleanup the below

\nabla_{θ} lo g p (x; θ) = \frac{\nabla _{θ} p ( x ; θ )}{p ( x ; θ )}

Rearranging this gives us the fundamental form:

\nabla_{θ} p (x; θ) = p (x; θ) \cdot \nabla_{θ} lo g p (x; θ)

Terminology Note: The term $\nabla_{θ} lo g p (x; θ)$ is called the score function in statistics, hence “score function estimator.”

Why This Matters: The Core Problem

Consider this ubiquitous problem in ML: we want to compute gradients of expectations:

\nabla_{θ} E_{x \sim p (x; θ)} [f (x)] = \nabla_{θ} \int p (x; θ) f (x) d x

The challenge: we can’t interchange the gradient and integral operators directly when the distribution $p (x; θ)$ depends on the parameters $θ$ we’re differentiating with respect to.

The Trick in Action

Step-by-Step Derivation

Starting with our expectation gradient:

\nabla_{θ} E_{x \sim p (x; θ)} [f (x)] = \nabla_{θ} \int p (x; θ) f (x) d x = \int \nabla_{θ} p (x; θ) f (x) d x (under regularity conditions) = \int p (x; θ) \cdot \nabla_{θ} lo g p (x; θ) \cdot f (x) d x (log derivative trick) = E_{x \sim p (x; θ)} [f (x) \cdot \nabla_{θ} lo g p (x; θ)]

Key Insight: We’ve transformed a gradient of an expectation (hard to compute) into an expectation of a gradient-weighted function (easy to estimate via sampling).

Prerequisites & Mathematical Foundation

Required Concepts

Gradient operator: $\nabla_{θ} = [\frac{\partial}{\partial θ _{1}}, ..., \frac{\partial}{\partial θ _{n}}]^{T}$
Chain rule: $\frac{d}{d x} lo g f (x) = \frac{1}{f ( x )} \cdot \frac{df}{d x}$
Leibniz integral rule: Under suitable conditions, $\frac{d}{d θ} \int_{a}^{b} f (x, θ) d x = \int_{a}^{b} \frac{\partial}{\partial θ} f (x, θ) d x$

Regularity Conditions

$p (x; θ) > 0$ wherever $f (x) \neq = 0$ (to avoid division by zero)
Sufficient smoothness for interchange of differentiation and integration
The expectation $E [f (x) \cdot \nabla_{θ} lo g p (x; θ)]$ must exist

Core Applications in ML/AI

1. Policy Gradient Methods (Reinforcement Learning)

In RL, we maximize expected return: $$
J(\theta) = \mathbb{E}{\tau \sim p\theta(\tau)}[R(\tau)]

where $\tau$ is a trajectory and $p_\theta(\tau)$ is the trajectory distribution under policy $\pi_\theta$. **The gradient**: $$ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta}[R(\tau) \cdot \nabla_\theta \log p_\theta(\tau)]

For a trajectory $τ = (s_{0}, a_{0}, s_{1}, a_{1}, ...)$ : $$
\log p_\theta(\tau) = \sum_{t=0}^{T} \log \pi_\theta(a_t|s_t) + \sum_{t=0}^{T} \log P(s_{t+1}|s_t, a_t)

Since transition dynamics don't depend on $\theta$: $$ \nabla_\theta \log p_\theta(\tau) = \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t)

2. Variational Inference (ELBO Maximization)

In variational inference, we maximize the Evidence Lower BOund (ELBO): $$
\mathcal{L}(\theta) = \mathbb{E}{z \sim q\theta(z)}[\log p(x,z) - \log q_\theta(z)]

The gradient: $$ \nabla_\theta \mathcal{L} = \mathbb{E}_{z \sim q_\theta}[(\log p(x,z) - \log q_\theta(z)) \cdot \nabla_\theta \log q_\theta(z)]

Alternative names: This is also called the REINFORCE gradient estimator in the VAE literature.

3. Black-Box Optimization

For non-differentiable objective $f (x)$ : $$
\nabla_\theta \mathbb{E}{x \sim p\theta}[f(x)] = \mathbb{E}{x \sim p\theta}[f(x) \cdot \nabla_\theta \log p_\theta(x)]

This enables gradient-based optimization even when $f$ itself isn't differentiable! ## Concrete Example: Gaussian Distribution Let $p_\theta(x) = \mathcal{N}(x; \mu, \sigma^2)$ where $\theta = [\mu, \sigma]$. The log probability: $$ \log p_\theta(x) = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{(x-\mu)^2}{2\sigma^2}

Score functions: $\nabla_{μ} lo g p_{θ} (x) = \frac{x - μ}{σ ^{2}}$ $$
\nabla_\sigma \log p_\theta(x) = -\frac{1}{\sigma} + \frac{(x-\mu)^2}{\sigma^3}

**Verification of a key property**: $$ \mathbb{E}_{x \sim p_\theta}[\nabla_\theta \log p_\theta(x)] = 0

This always holds (it’s the derivative of $\int p_{θ} (x) d x = 1$ ).

Variance Considerations

The Variance Problem

The log derivative trick produces unbiased but potentially high-variance gradient estimates:

\overset{g}{^} = \frac{1}{N} i = 1 \sum N f (x_{i}) \cdot \nabla_{θ} lo g p_{θ} (x_{i}), x_{i} \sim p_{θ}

Variance Reduction Techniques

Baseline subtraction: Replace $f (x)$ with $f (x) - b$ where $b$ is a baseline
- Maintains unbiasedness since $E [\nabla_{θ} lo g p_{θ} (x)] = 0$
- Optimal baseline: $b^{*} = \frac{E [ f ( x ) ∣ \nabla _{θ} l o g p _{θ} ( x ) ∣ ^{2} ]}{E [ ∣ \nabla _{θ} l o g p _{θ} ( x ) ∣ ^{2} ]}$
Control variates: Use correlated variables with known expectations
Reparameterization trick: When possible, use pathwise gradients instead (e.g., in VAEs)

Fisher Information

The Fisher Information Matrix: $$
\mathcal{I}(\theta) = \mathbb{E}{x \sim p\theta}[\nabla_\theta \log p_\theta(x) \cdot (\nabla_\theta \log p_\theta(x))^T]

This measures the information content about parameters in the distribution. ### Natural Gradients Natural gradient descent uses: $$ \tilde{\nabla}_\theta = \mathcal{I}(\theta)^{-1} \nabla_\theta

This is intimately connected to the score function.

Importance Sampling

When we can’t sample from $p_{θ}$ directly: $$
\mathbb{E}{x \sim p\theta}[f(x)] = \mathbb{E}{x \sim q}[f(x) \cdot \frac{p\theta(x)}{q(x)}]

The log derivative trick can be combined with importance sampling for off-policy learning. ## Implementation Considerations ### Numerical Stability - **Log-sum-exp trick**: When computing $\log p_\theta(x)$, use stable implementations - **Gradient clipping**: Score functions can have large magnitudes - **Standardization**: Often helpful to standardize advantages/returns ### Sample Efficiency - More samples → lower variance in gradient estimates - Trade-off between computation and variance reduction - Batch size selection is critical ## Common Pitfalls & Clarifications 1. **Not zero-mean**: While $\mathbb{E}[\nabla_\theta \log p_\theta(x)] = 0$, the term $f(x) \cdot \nabla_\theta \log p_\theta(x)$ is generally not zero-mean 2. **Biased with finite samples**: Individual gradient estimates are unbiased, but nonlinear functions of these estimates (e.g., Adam optimizer updates) can introduce bias 3. **Distribution support**: The trick assumes we can sample from $p_\theta$ - this isn't always tractable ## Advanced Extensions ### Rao-Blackwellization If $x = (x_1, x_2)$ and we can compute $\mathbb{E}[f(x_1, x_2)|x_1]$ analytically: $$ \mathbb{E}[f(x)] = \mathbb{E}_{x_1}[\mathbb{E}_{x_2|x_1}[f(x_1, x_2)]]

This reduces variance by computing part of the expectation exactly.

Multiple Sampling Distributions

The multiple importance sampling variant: $$
\nabla_\theta \mathbb{E}{x \sim p\theta}[f(x)] = \sum_i w_i \mathbb{E}{x \sim q_i}[f(x) \cdot \frac{p\theta(x)}{q_i(x)} \cdot \nabla_\theta \log p_\theta(x)]

## Summary: When to Use the Log Derivative Trick **Use when**: - The expectation's distribution depends on parameters you're optimizing - The function $f(x)$ is non-differentiable or unknown - You can sample from $p_\theta(x)$ but can't compute expectations analytically - Working with discrete distributions (where reparameterization is impossible) **Consider alternatives when**: - Reparameterization is possible (continuous distributions with known inverse CDFs) - Analytical gradients are tractable - Variance is prohibitively high even with reduction techniques This technique bridges the gap between probabilistic modeling and gradient-based optimization, making it fundamental to modern ML approaches that involve stochastic computation graphs.

Max Wolf's Second Brain

Explorer

log-derivative trick

Why This Matters: The Core Problem

The Trick in Action

Step-by-Step Derivation

Prerequisites & Mathematical Foundation

Required Concepts

Regularity Conditions

Core Applications in ML/AI

1. Policy Gradient Methods (Reinforcement Learning)

2. Variational Inference (ELBO Maximization)

3. Black-Box Optimization

Variance Considerations

The Variance Problem

Variance Reduction Techniques

Fisher Information

Importance Sampling

Multiple Sampling Distributions

Graph View

Table of Contents

Max Wolf's Second Brain

Explorer

log-derivative trick

Why This Matters: The Core Problem

The Trick in Action

Step-by-Step Derivation

Prerequisites & Mathematical Foundation

Required Concepts

Regularity Conditions

Core Applications in ML/AI

1. Policy Gradient Methods (Reinforcement Learning)

2. Variational Inference (ELBO Maximization)

3. Black-Box Optimization

Variance Considerations

The Variance Problem

Variance Reduction Techniques

Connections to Related Concepts

Fisher Information

Importance Sampling

Multiple Sampling Distributions

Graph View

Table of Contents