The only difference to a “regular” transformer is what nodes attend over:
Transformer: (all tokens, fully connected)
Graph transformer: (neighbors only, via adjacency mask)

Some graph transformers (e.g. Graphormer) skip the masking entirely and do full attention over all nodes, injecting structure through positional/structural encodings instead (Laplacian eigenvectors, random walk probabilities, graph distance biases).
These are literally transformers with extra input features.

Expressiveness of dot-product scoring

Dot-product attention computes , a bilinear map in and .
This limits the scoring function to pairwise multiplicative interactions between input features.

In contrast: GATv2’s MLP scorer is a universal approximator, making it strictly more expressive as a single-layer scoring function. The paper proves DPGAT (dot-product graph attention) is strictly weaker than GATv2.

The post-attention MLP in a transformer block does not compensate: it transforms the already-aggregated output. Deep stacking can compensate across layers (reshape representations so later dot-products route better) but does not change the fundamental limitation of the scoring function at each layer.

Link to original


https://github.com/lucidrains/graph-transformer-pytorch/blob/main/graph_transformer_pytorch/graph_transformer_pytorch.py