Mixture of A Million Experts

year: 2024/07
paper: https://arxiv.org/pdf/2407.04153
website:
code:
connections: mixture of experts, scaling laws, product keys

Motivation

…increasing the number of experts is an effective way to improve performance without increasing the inference cost. However, their experiments showed that the efficiency gains provided by MoEs plateau after a certain model size is reached.
More recently, Krajewski et al. (2024) discovered that this plateau was caused by using a fixed number of training tokens. When the number of training tokens is compute-optimal, MoEs consistently outperform dense models in terms of FLOP efficiency. Moreover, they introduced granularity (the number of active experts) as a new scaling axis and empirically showed that using higher granularity improves performance. Extrapolating this fine-grained MoE scaling law suggests that continued improvement of model capacity will ultimately lead to a large model with high granularity, corresponding to an architecture of an immense number of tiny experts.

Beyond efficient scaling, another reason to have a vast number of experts is lifelong learning, where MoE has emerged as a promising approach. For instance, Chen et al. (2023) showed that, by simply adding new experts and regularizing them properly, MoE models can adapt to continuous data streams. Freezing old experts and updating only new ones prevents catastrophic forgetting and maintains plasticity by design. In lifelong learning settings, the data stream can be indefinitely long or never-ending (Mitchell et al., 2018), necessitating an expanding pool of experts..

Totally not inefficient, lol.

Results

This design is supported by the recent discovery of the fine-grained MoE scaling law. To overcome the computational overhead of routing to a large number of experts, we apply the product keys to efficiently select a small subset of hidden neurons within a wide MLP layer. Empirical analysis using language modeling tasks demonstrate that given the same compute budget, PEER significantly outperforms dense transformers, coarse-grained MoEs and product key memory layers.

I = T_{k} ({q (x)^{T} k_{i}}_{i = 1}^{N})

… get the top-K $T_{k}$ out of $N$ experts $E : = {k_{i}}_{i = 1}^{N}$ by taking the dot-product between their product keys $K : = {k_{i}}_{i = 1}^{N}$ with the query vector $q (x)$ from the input vector $x \in R^{n}$ .

g_{i} (x) = σ (q (x)^{T} k_{i})

… then take the softmax or sigmoid to those similarities, giving us the router scores (ofc we don’t recompute anything).

f (x) = i \in I \sum g_{i} (x) e_{i} (x)

… expert outputs are then aggregated via a weighted sum.

Problem: Large number of $N$ experts, each with a $d$ -dim key, computing every product takes $O (N d)$ , very slow.
Solution:
We create two matrices to replace the key matrix $c \in C, c^{'} \in C^{'} \subset R^{d /2}$ , which have a cardinality of $N$ , $c, c^{'}$ have dimensionality $\frac{d}{2}$ .

K = {[c c^{'}]}

I_{C} = T_{k} ((q_{1}^{T} c_{i})), I_{C^{'}} = T_{k} ((q_{2}^{T} c_{i}^{'}))

where $q_{1}, q_{2}$ are the first and second half of the query vector $q (x)$ . So in this equation, we are doing the dot products with halfed dimensionality and only $N$ times each.
This creates $k^{2}$ candidate keys $K^{'} : = {[c_{i} c_{j}] ∣ i \in I_{C}, j \in I_{C}^{'}}$ (i.e. containing all the combinations) which contain the $k$ most similar keys from $K$ to $q (x)$ . We can then simply apply the top-k operator again on this set of candidates, leaving us with a runtime complexity of $O (N + k^{2}) d$ , as opposed to the original $O (N d)$ , which is much better for large $N$ - remember, $k$ is just the number of experts we choose, so way less.
This could even be done in $O (k lo g (k))$ with a priority queue, but is not so GPU-friendly.

The cartesian product of c and c’ is the set of all possible pairs of keys, which is then reduced to the top-k most similar pairs to the query vector q(x).

Link to original

In this paper, “each expert is only a single neuron”:

e_{i} : = σ (u_{i}^{T}) v_{i} + b

$v_{i}, u_{i}$ are vectors with the same dim as $x$ , $σ$ is ReLU / GELU.

???? Single neuron but the are vectors and there are two of them ???

Ig there is a single neuron for each input dimension → scalar activation output → multiplied with second vector to pull up dimensionality again (prlly $v_{i}$ stores information and $σ$ is like a gate???)

That’s prlly it - matches this figure. The “singleton” is the squashing of the input to just the scalar activation.

They increase the expressiveness of the network not by adjusting the size of the experts, but adding multiple query heads (similar to MHA / Product Key Memories), by just running them in parallel - the results get summed:

f (x) : = i = 1 \sum h f^{i} (x) = i = 1 \sum h j \in I^{i} \sum g_{j} (x) e_{j} (x)

One such “PEER” layer with $h$ heads is equivalent to using one expert with $h$ hidden neurons:

$f (x) = i = 1 \sum h e_{i} (x) = i = 1 \sum h σ (u_{i}^{T} x) v_{i} = Vσ (W^{T} x)$
where $W = [u_{1}, \dots, u_{h}], V = [v_{1}, \dots, v_{h}]$ .
→ PEER dynamically assembles an MLP with $h$ neurons by aggregating $h$ singleton MLPs retrieved from a shared repository.
Allows sharing of hidden neurons among experts, enhancing knowledge transfer and parameter efficiency.

Max Wolf's Second Brain

Explorer

Mixture of A Million Experts

Graph View

Backlinks