If its input is permuted, a permutation equivariant function will apply the same permutation of the output:
Unlike in permutation invariance, the dimensions of input and output must be equal
Link to originalMHA is permutation equivariant (without positional encodings)
One crucial characteristic of the multi-head attention is that it is permutation-equivariant with respect to its inputs. This means that if we switch two input elements in the sequence, e.g. (neglecting the batch dimension for now), the output is exactly the same besides the elements 1 and 2 switched. Hence, the multi-head attention is actually looking at the input not as a sequence, but as a set of elements.