graph transformer

The only difference to a “regular” transformer is what nodes attend over:
Transformer: $j \in S$ (all tokens, fully connected)
Graph transformer: $j \in N_{i}$ (neighbors only, via adjacency mask)

Some graph transformers (e.g. Graphormer) skip the masking entirely and do full attention over all nodes, injecting structure through positional/structural encodings instead (Laplacian eigenvectors, random walk probabilities, graph distance biases).
These are literally transformers with extra input features.

Expressiveness of dot-product scoring

Dot-product attention computes $e (h_{i}, h_{j}) = (W_{Q} h_{i})^{⊤} (W_{K} h_{j}) = h_{i}^{⊤} W_{Q}^{⊤} W_{K} h_{j}$ , a bilinear map in $h_{i}$ and $h_{j}$ .
This limits the scoring function to pairwise multiplicative interactions between input features.

In contrast: GATv2’s MLP scorer is a universal approximator, making it strictly more expressive as a single-layer scoring function. The paper proves DPGAT (dot-product graph attention) is strictly weaker than GATv2.

The post-attention MLP in a transformer block does not compensate: it transforms the already-aggregated output. Deep stacking can compensate across layers (reshape representations so later dot-products route better) but does not change the fundamental limitation of the scoring function at each layer.

Link to original

https://github.com/lucidrains/graph-transformer-pytorch/blob/main/graph_transformer_pytorch/graph_transformer_pytorch.py

Graph View

graph transformer

Backlinks