The only difference to a “regular” transformer is what nodes attend over:
Transformer: (all tokens, fully connected)
Graph transformer: (neighbors only, via adjacency mask)
Some graph transformers (e.g. Graphormer) skip the masking entirely and do full attention over all nodes, injecting structure through positional/structural encodings instead (Laplacian eigenvectors, random walk probabilities, graph distance biases).
These are literally transformers with extra input features.
Link to originalExpressiveness of dot-product scoring
Dot-product attention computes , a bilinear map in and .
This limits the scoring function to pairwise multiplicative interactions between input features.In contrast: GATv2’s MLP scorer is a universal approximator, making it strictly more expressive as a single-layer scoring function. The paper proves DPGAT (dot-product graph attention) is strictly weaker than GATv2.
The post-attention MLP in a transformer block does not compensate: it transforms the already-aggregated output. Deep stacking can compensate across layers (reshape representations so later dot-products route better) but does not change the fundamental limitation of the scoring function at each layer.