meta learning

https://people.idsia.ch/~juergen/metalearning.html

Goal: General purpose meta-learning, i.e. discovering learning algorithms that generalize very well.

The objective for human-engineered learning algorithms (LAs) is fixed, given by the algorithm.
For meta learning algorithms, like MetaGenRL, it is parametrized by a neural network:

Meta-learning Spectrum: Less structure $\leftrightarrow$ More structure

More structure: Learned Optimizers (Adam - adapts gradient landscape during learning)
Gradient-based optimization with learned initialization (MAML)
Gradient based with learned objective functions (MetaGenRL, LPG)
black-box optimization with parameter-sharing and symmetries (VSML, SymLA)
Least structure: Black-box (MetaRNN, RL², transformer, GPICL)

Summary from The Sensory Neuron as a Transformer Permutation-Invariant Neural Networks for Reinforcement Learning

Meta-learning RNNs have been proposed to approach the problem of learning the learning rules for a neural network using the reward or error signal, enabling meta-learners to learn to solve problems presented outside of their original training domains. The goals are to enable agents to continually learn from their environments in a single lifetime episode, and to obtain much better data efficiency than conventional learning methods such as SGD. A meta-learned policy that can adapt the weights of a neural network to its inputs during inference time have been proposed in fast weights, associative weight networks, hypernetwork, and hebbian learning approaches. Recently works (VSML, Meta-Learning Bidirectional Update Rules) combine ideas of self-organization with meta-learning RNNs, and have demonstrated that modular meta-learning RNN systems not only can learn to perform SGD-like learning rules, but can also discover more general learning rules that transfer to classification tasks on unseen datasets.

The variable ratio problem: $\frac{learned variables V _{L}}{meta variables V _{M}}$

Meta RNNs are simple, but they have much more meta variables than learned variables ( $∣ V_{L} ∣ \in O (N), ∣ V_{M} ∣ \in O (N^{2}) ⟹ ∣ V_{L} ∣ ≪ ∣ V_{M} ∣$ ) leading them to be overparametrized, prone to overfit.

Learned learning rules / Fast Weight Networks have $∣ V_{L} ∣ ≫ ∣ V_{M} ∣$ , but introduce a lot of complexity in the meta-learning network etc.

→A variable-sharing and sparcity principle can be used to unify these approaches into a simple framework: VSML.

Link to original

… but transformers should improve this, as they have a much bigger effective state size / param ratio, as GPICL shows, right?

General purpose in context learners need diverse task distributions! (duh)

Link to original

If you meta learn on enough tasks, suddenly you can generalize to all tasks.

Link to original

Definition

Each task is defined by its dataset $D = {(x_{i}, y_{i})}_{i = 1}^{n}$ , . The optimal model parameters are:
$θ^{*} = ar g θ min E_{D \sim p (D)} [L_{θ} (D)]$
Just like regular SL, but instead of sampling examples form one dataset, we sample datasets (representing different tasks) from a distribution of datasets.
→ The goal is to achieve good performance on out of distribution tasks.
→ Kinda everything is meta-learning in that sense, on a continuous spectrum, but the bar is continually moving higher.

Few-Shot Classification

Training a classifier

We train a classifier $f_{θ}$ with parameters $θ$ on a dataset $D = {(x_{i}, y_{i})}_{i = 1}^{n}$ to output the probability of a datapoint belonging to the class $y$ given the feature vector $x$ , $P_{θ} (y ∣ x)$ .
The optimal parameters maximize the probability of the true labels:
$θ^{*} = ar g θ max E_{(x, y) \in D} [P_{θ} (y ∣ x)])]$
When training with mini-batches $B \subset D$ :
$θ^{*} = ar g θ max E_{B \subset D} (x, y) \in B \sum P_{θ} (y ∣ x)$

Link to original

In few-shot classification, we sample two disjoint sets from $D^{T} \subset D$ each epoch, where $T \subset T$ is a subset of labels (tasks) of $D$ : A support set $S \subset D$ and a training batch $B \subset D$ .
The support set is part of the model input. The goal is that the model learns to generalize to other datasets, by learning how to learn from the support set.
$θ = ar g θ max E_{T \subset T} E_{(S, B) \subset D^{T}} (x, y) \in B \sum P_{θ} (y ∣ x, S)$
How would such a network represent never-seen classes? It would need to output embeddings.

Link to original

Viewing meta-learning as learner and meta-learner

The learner $f_{θ}$ is trained to perform well on a given task.
The meta learner $g_{ϕ}$ learns how to update the learner’s parameters via the support set $S$ : $θ^{'} = g_{ϕ} (θ, S)$
For a classification task, $θ$ and $ϕ$ are optimized to maximize the objective:
$θ, ϕ = ar g θ, ϕ max E_{T \subset T} E_{(S, B) \subset D^{T}} (x, y) \in B \sum P_{g_{ϕ} (θ, S)} (y ∣ x)$

Common approaches to meta-learning – how to learn $P (y ∣ x, S)$

model-based meta-learning: $P_{θ} (y ∣ x, S) = f_{θ} (x, S)$ –
metric-based meta-learning: $P_{θ} (y ∣ x, S) = \sum_{(x_{i}, y_{i})} k_{θ} (x, x_{i}) y_{i}$ – metric learning, $k_{θ}$ is a similarity kernel
optimization-based meta-learning: $P_{θ} (y ∣ x, S) = f_{g_{ϕ} (θ, S)} (y ∣ x))$ – gradient descent updates of $θ$ based on $S$

(Some) Meta-learning approaches touch all fundamental aspect of machine learning at once

data ${x, y}_{i}$ : obtaining new tasks, …
model $y \approx f_{θ} (x)$ : data efficiency, …
loss $L (θ) = \sum_{i = 1}^{N} l (f_{θ} (x_{i}), y_{i})$ : …
optimization $θ^{*} = ar g min_{θ} L (θ)$ : …

References

(2023) Louis Kirsch - Towards Automating ML Research with general-purpose meta-learners @ UCL DARK

(2021) Oriol Vinyals: Perspective and Frontiers of Meta-Learning

https://lilianweng.github.io/posts/2018-11-30-meta-learning/

Max Wolf's Second Brain

Explorer

meta learning

References

Graph View

Backlinks