Bootstrap your own latent - A new approach to self-supervised Learning

year: 2020
paper: bootstrap-your-own-latent-a-new-approach-to-self-supervised-learning
website: YK Youtube; understanding-self-supervised-and-contrastive-learning-with-bootstrap-your-own-latent-byol
code:
connections: self-supervised learning, representation learning, SimCLR

Self-supervised image representation learning without negative pairs.
Two augmented views $v, v^{'}$ of the same image are fed through two networks. The online network ( $θ$ ) is encoder → projector → predictor. The target network ( $ξ$ ) is encoder → projector only. The online network is trained to predict the target’s projection of the other view; the target’s parameters are a slow exponential moving average of the online’s, with stop-gradient on the target branch.

Slop below

Notation

$x$ … input image; $v = t (x), v^{'} = t^{'} (x)$ … two augmented views, with $t, t^{'}$ sampled from an augmentation distribution
$y_{θ} = f_{θ} (v) \in R^{2048}$ … representation (the only thing kept after training)
$z_{θ} = g_{θ} (y_{θ}) \in R^{256}$ … online projection
$z_{ξ}^{'} = g_{ξ} (f_{ξ} (v^{'})) \in R^{256}$ … target projection (sg)
$q_{θ} (z_{θ})$ … predictor’s guess of $z_{ξ}^{'}$
$\overline{\cdot}$ … L2-normalised vector

Methodology

The training signal is “the online net should predict, from view $v$ , what the target net produces from the other view $v^{'}$ ”. The loss measures how well that prediction lines up with the actual target projection:
$L_{θ, ξ} = ∥ \overline{q_{θ} (z_{θ})} - \overline{z_{ξ}^{'}} ∥_{2}^{2}$
i.e. squared Euclidean distance between prediction and target, after both have been L2-normalised. Equivalently $2 - 2 cos (q_{θ} (z_{θ}), z_{ξ}^{'})$ : zero when prediction and target point the same way, $4$ when opposite. So the loss only cares about the direction of the projection in feature space, not its magnitude. The L2-normalisation step is what enforces that: we strip the norms first, then measure distance, so two projections pointing the same way score zero loss regardless of how long they were before normalising.

Why ignore magnitudes? In a learned representation, the direction of a feature vector is what carries semantic content. Two projections pointing the same way mean the same thing about the input, whether one has norm 1 and the other norm 100. Letting the loss penalise magnitude differences would just have the network spend capacity matching scales for no gain.

The loss above is one-directional (online sees $v$ , target sees $v^{'}$ ). BYOL symmetrises: it also pushes $v^{'}$ through the online branch and $v$ through the target, computes the same loss, and sums. Each view ends up acting as both the input and the prediction target over the course of training, which doubles the signal for free. Gradients only flow through $θ$ . The target branch carries a stop-gradient and is never directly optimised.

The target’s parameters are an exponential moving average of the online’s:
$ξ \leftarrow τ ξ + (1 - τ) θ, τ \in [0, 1]$
with $τ$ cosine-scheduled from $0.996$ at the start of training up to $1$ at the end. Why a moving target instead of just copying $θ$ each step? You need the target to be two things at once: stable enough that the online net has a coherent signal to chase from one step to the next, and fresh enough that it keeps absorbing the online net’s improvements (otherwise you’re just predicting a fixed random initialisation, which only gets you to ~19% top-1). The EMA interpolates between those, and the schedule shifts the balance over time: early in training the online net is far from converged so we want the target to follow it relatively quickly (smaller $τ$ ); late in training the online net’s representations are close to good and we want the target to stop moving so the online net can settle (larger $τ$ , eventually frozen).

The projector $g$ and the predictor $q$ have the same architecture: Linear(2048→4096) → BN → ReLU → Linear(4096→256). The final 256-dim output is not batch-normalised, unlike in SimCLR (this seemingly small detail turns out to matter; see the BN callout below). Augmentations are SimCLR’s set: random crop with resize, colour jitter, Gaussian blur, and solarisation.

The reason the architecture is asymmetric (predictor on the online branch only, stop-gradient on the target) is that without those two choices, the trivial fixed point $z_{θ} = z_{ξ}^{'} = c$ for any constant $c$ would be a global minimum of the loss. Both networks could just learn to output the same constant vector for every input and the prediction error would be zero. The predictor breaks the symmetry of the loss surface, and the stop-gradient prevents the target from gradient-descending into the constant solution alongside the online net. The next callout argues why this is actually enough to make the constant solution an unstable equilibrium rather than just a non-trivial one.

Why doesn't it collapse? (paper's hypothesis)

$ξ$ is not updated by gradient descent on $L$ . So there’s no joint loss being minimised over $(θ, ξ)$ , and no a priori reason the dynamics should converge to a minimum at all. Same shape as a GAN: alternating updates, no joint objective.

Assume the predictor is optimal, $q^{*} (z_{θ}) = E [z_{ξ}^{'} ∣ z_{θ}]$ . Then in expectation the online gradient becomes
$\nabla_{θ} E [i \sum Var (z_{ξ, i}^{'} ∣ z_{θ})]$
i.e., $θ$ is pushed to make $z_{θ}$ maximally informative about $z_{ξ}^{'}$ . Conditioning on a constant gives the worst predictor ( $Var (z_{ξ}^{'} ∣ c) \geq Var (z_{ξ}^{'} ∣ z_{θ})$ for any $z_{θ}$ ), so collapse is an unstable equilibrium under this dynamic.

The slow EMA’s role: keep the predictor near $q^{*}$ while $θ$ moves. A hard copy ( $τ = 0$ ) would break the optimal-predictor assumption, and empirically destroys training.

BatchNorm is doing implicit contrastive learning

From an imbue replication study: removing BN from the projector/predictor MLPs makes BYOL collapse to random performance. Replacing BN with LayerNorm (which doesn’t mix examples) also collapses. So it’s the cross-batch interaction in BN, not normalisation per se, that prevents collapse.

Mechanism: BN forces activations to have zero mean / unit variance across the mini-batch. A constant projection across the batch is exactly what BN subtracts away. Every sample is implicitly pushed away from the batch mean, which acts as a soft negative pulled from the rest of the batch. Contrast is happening, just routed through batch statistics rather than an explicit InfoNCE term.

Caveat the BYOL authors raised in response: with LARS + weight decay (instead of plain SGD), BYOL can still learn without BN, though much worse and brittle to hyperparameters. Those tricks prevent collapse through different mechanisms (weight regularisation keeps the network from sliding into a degenerate constant solution). So BN isn’t the only anti-collapse mechanism, but it’s the most robust one in practice.

Robustness vs. SimCLR

No negatives → no dependence on a huge batch. BYOL is stable from batch 256 to 4096; SimCLR drops sharply below 4096.
Removing colour distortion: SimCLR loses 27 points (it was solving the contrastive task via colour histograms alone), BYOL only 13. BYOL has to predict the target’s full projection, not just discriminate, so it can’t shortcut to a single statistic.
EMA decay $τ$ between 0.9 and 0.999 all work. $τ = 0$ (instant copy) destroys training. $τ = 1$ (frozen random target) plateaus around 18.8% top-1; still well above the 1.4% an untrained encoder gets, so even predicting a fixed random target is a non-trivial pretext task.

Graph View

Bootstrap your own latent - A new approach to self-supervised Learning

Backlinks