ResNet

motivation / intuition

Learning $y = x + f (x)$ is easier than learning $y = f (x)$
Instead of learning the mapping $x \mapsto y$ , we only learn the small difference / “residual”.

To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers.

Most of the time, the input will be close to the output (think resolution upscaling task).

The network does not need to worry about memorizing the input.

vanishing gradients are prevented

Classic resnet block, size is preserved:
center
Increasing the stride (and proportionally also the conv channels) reduces the spatial dim and is often done in classification.

Resnet is a discrete, less general case of NODEs, it is an euler integrator, but there are much better ones:

overview and intuition

A ResNet can be thought of as a discrete-time analogs to an ODE:
$h_{t + 1} = h_{t} + f (h_{t}, θ_{t})$
You update the representation in discrete steps layer by layer. Neural ODE views this as an infinite process with continous time.

The regular NN has a discrete gradient field, whereas the gradient field of the ODE Network is continous. The black dots are data points.

Again, with resnet you only model the distance (residual) of
$x_{k + 1} = x_{k} + residual f (x_{k})$
… this is a textbook euler numerical integrator (However euler integration, esp. with large time steps is a bad numerical integration method). So the resnet is doing euler integration over a vector field $f$ , so what the neural network is learning, $x$ , can also be thought of as a vector field, a differential equation $\overset{x}{˙} = f$

With NODE, we model the differential equation itself, instead of just a one-timestep-update or flow-map of the differential equation.
$\frac{d}{d t} x = f (x)$
So it is a generalization of resnets to continuous time, where you also use a fancier integrator and smallar or arbitrary time-steps to model the vector field very accurately.
The mathematician’s solution to this would be:
$x_{k + 1} = x_{k} \int_{t_{k}}^{t_{k + 1}} f (x (τ)) d τ$
That’s hard to compute, hence we approximate it (and you can use any integrator for that).

Classic dynamical system:
$\overset{x}{˙} = f (x) ⟹ x (t_{0} + Δ) = flow map, ϕ x (t_{0}) + \int_{t_{0}}^{t_{0} + Δ t} d τ$
With NODE, $f$ is parametrized by a neural network
$\overset{x}{˙} (t) = neural net f_{θ} (x (t), t) ⟹ x (τ) = used in loss function, L Φ (x (t_{0}), t_{0}, τ; θ)$
Free parameters:
$f_{θ} \dots$ network parameters
$t_{0} \dots$ initial condition
$Δ t \dots$ timestep

Hidden state: $x (τ)$
… data is samples at discrete points in time (doesn’t need to be uniformly in time), so between any two data / measurement points, there is a continous state and the flow map integrates along that hidden state $x (τ)$ from $t_{0}$ to $t_{0} + Δ t$ .
And we want information about this hidden state, which is not in the training data!
$lagrange multiplier a (t) : = - \frac{\partial L}{\partial x ( t )} ⟹ \frac{d}{d t} a (t) = - a (t) auto-diff \frac{\partial f _{0}}{\partial x ( t )}$

TODO

steve skips over the details here “lagrange multiplier, adjoint equation”, “we been knowing how to do this the past 20 years”, … TODO steve’s optimization bootcamp)

tldr: we can keep track of the $a$ variable which keeps track of $x (τ)$ with AD, in the paper called reverse-mode differentiation

Link to original

References

Official pytorch code
Residual Connection | PapersWithCode

Max Wolf's Second Brain

Explorer

ResNet

overview and intuition

References

Graph View

Backlinks