year: 2022
paper: https://www.semanticscholar.org/reader/b618b2e463c1aedd7068bd8d064ff3c1d7c6dcc7
website:
code:
connections: continual learning, loss of plasticity, backpropagation


TLDR

CBP continually injects random features alongside SGD and uses a generate-and-test mechanism that consists of two parts: a generator proposes new features, and a tester finds and replaces low utility features with the features proposed by the generator.
Note: Continual backprop in its current form does not tackle forgetting! The utility measure only considers the importance of units for current data. One could imagine a long-term utility that remembers usefulness.
Note: The utility in CBP is based on heuristics.

Is continual backprop ~= UPGD, just that they measure utility a little differently?

→ Indeed it is - UPGD is a better version - from the UPGD paper:

The generate-and-test method (Mahmood & Sutton 2013) is a method that finds better features using search, which, when combined with gradient descent (see Dohare et al. 2023a), is similar to a feature-wise variation of our method. However, this method only works with networks with single-hidden layers in single-output regression problems. It uses the weight magnitude to determine, such as classification (Elsayed 2022). On the contrary, our variation uses a better notion of utility that enables better search in the feature space and works with arbitrary network structures or objective functions so that it can be seen as a generalization of the generate-and-test method.

Continual backprop

Continual backpropagation selectively reinitializes low-utility units in the network. Our utility measure, called the contribution utility, is defined for each connection or weight and each unit. The basic intuition behind the contribution utility is that the magnitude of the product of units’ activation and outgoing weight gives information about how valuable this connection is to its consumers. If the contribution of a hidden unit to its consumer is small, its contribution can be overwhelmed by contributions from other hidden units. In such a case, the hidden unit is not useful to its consumer. We define the contribution utility of a hidden unit as the sum of the utilities of all its outgoing connections. The contribution utility is measured as a running average of instantaneous contributions with a decay rate , which is set to in all experiments. in a feed-forward neural network, the contribution utility, , of the hidden unit in layer at time is updated as:

\boldsymbol{u}{l}[i] = \eta \cdot\boldsymbol{u}{l}[i] + (1 - \eta) \cdot |\boldsymbol{h}{l,i,t}| \cdot\sum^{n{l+1}}{k=1} |\boldsymbol{w{l,i,k,t}}|

… where $\boldsymbol{h}_{l,i,t}$ is the output of the $i^{\text{th}}$ hidden unit in layer $l$ at time $t$, $\boldsymbol{w}_{l,i,k,t}$ is the weight connecting the $i^{\text{th}}$ unit in layer $l$ to the $k^{\text{th}}$ unit in layer $l+1$ at time $t$, and $n_{l+1}$ is the number of units in layer $l+1$. However, initializing the outgoing weight to zero makes the new unit vulnerable to immediate reinitialization, as it has zero utility. To protect new units from immediate reinitialization, they are protected from a reinitialization for maturity threshold $m$ number of updates. We call a unit mature if its age is more than $m$. Every step, a fraction of mature units $\rho$, called the replacement rate, is reinitialized in every layer. The replacement rate is typically set to a very small value, meaning that only one unit is replaced after hundreds of updates.
Link to original