energy-based models

EBMs stand in contrast with causal models

Goal: minimize the energy of a system according to some energy function
The core use-case is learning a good representation of something. A representation is good if you can do useful stuff with it.

E (x, y) = κ (x, y)

… wehere $κ$ is some sort of compatibility measure (I picked $κ$ arbitrarily as symbol).
$E (x, x) = 0$
$- E (x, y) = L (x, y)$ … energy is just the like the negative loss. We use energy in in the test-time compute context. The difference with EBM is, that we learn the energy function $E_{θ} \mapsto R$ . So it’s like a reward model?

$f (\cdot) = E (X, Y)$ … energy landscape, which you want to minimize.

EBMs are a generalization of diffusion, where diffusion models amortize the backward pass of an EBM essentially.

The EBM outputs the “loss” (which you backpropagate through during inference time) while the diffusion directly outputs the gradient to transform the datapoint.

Inference by energy minimization is intrinsically more powerful than inference by feed-forward propagation. - Lecun

“EBTs can technically continue for an unbounded amount of time, meaning they can literally compute problems a feed-forward transformer cannot.” - Author

How to train an EBM?

Make input X corresponding answer pair Y should produce 0 energy

This alone would make it collapse (always predict 0 energy).
→ Need a way to make the energy higher everywhere X!=Y

EBL using contrastive learning (showing negative pairs, where energy should go up)

$L = E_{θ} (x_{1}, y_{1}) + E_{θ} (x_{2}, y_{2}) - E_{θ} (x_{1}, y_{2}) - E_{θ} (x_{2}, y_{1})$ , for unrelated $y_{1}, y_{2}$
Curse of dimensionality: infeasible to collect negative examples to push the energy down everywhere for high dim spaces.
You get a jagged energy landscape, with occasional troughs around datapoints.

regularizing methods that minimize the volume of space that can take low energy

LLMs do this too actually, but very implicitly (finite probability/tokens budget).
- There is no latent variable $z$ which can vary the input / optimize it over the joint probability of symbols in a text, but LLMs factorize the probabilities in terms of conditional probabilities over successive tokens.
- What you want to do for LLMs is not to search for the string of tokens that minimizes the energy, but the abstract representation that minimizes it, given a reward model for good answers.

Max Wolf's Second Brain

Explorer

energy-based models

How to train an EBM?

Graph View

Backlinks