1-WL Expressiveness Is (Almost) All You Need

Markus Zopf
NEC Laboratories Europe GmbH
Heidelberg, Germany
markus.zopf@neclab.eu

Abstract—It has been shown that a message passing neural networks (MPNNs), a popular family of neural networks for graph-structured data, are at most as expressive as the first-order Weisfeiler-Leman (1-WL) graph isomorphism test, which has motivated the development of more expressive architectures. In this work, we analyze if the limited expressiveness is actually a limiting factor for MPNNs and other WL-based models in standard graph datasets. Interestingly, we find that the expressiveness of WL is sufficient to identify almost all graphs in most datasets. Moreover, we find that the classification accuracy upper bounds are often close to 100%. Furthermore, we find that simple WL-based neural networks and several MPNNs can be fitted to several datasets. In sum, we conclude that the performance of WL/MPNNs is not limited by their expressiveness in practice.

I. Introduction

Graph-structured data can be found in a wide range of domains such as chemistry, biology, physics, and computer science, where graphs are used, for instance, to model social networks, scientific citation graphs, protein structures, syntax trees, and the world wide web. Consequently, developing and understanding machine learning methods for graph-structured data has gained a lot of attention from the machine learning community [1]prior work. Many neural networks for graph-based datasets belong to the family of Message Passing Neural Networks (MPNNs) [2]. Examples of popular MPNNs are the GraphSAGE model [3], Graph Convolutional Networks (GCNs) [4], Graph Attention Networks (GATs) [5], and Graph Isomorphism Networks (GINs) [6].

Recently, it has been shown that the Weisfeiler-Leman (WL) graph isomorphism test [7] provides an upper bound on the expressiveness of MPNNs [6], [8], which means that MPNNs cannot distinguish graphs that cannot be distinguished by the WL graph isomorphism test even if they are not isomorphic. One example of two non-isomorphic graphs that cannot be distinguished by WL, and hence, MPNNs can be found in Figure 1. Non-isomorphic indistinguishable graphs can be problematic for machine learning models since it is not possible for deterministic models to generate different predictions for indistinguishable graphs.

Two non-isomorphic graphs and their identical color histograms Fig. 1. Example of two non-isomorphic graphs (top) that cannot be distinguished by the 1-WL graph isomorphism test since they have the same color/label histogram (bottom).

The limited expressiveness of WL and MPNNs has increased the interested in more expressive GNNs. For insistence, models that explicitly encode more expressive features into the graph representation [9], higher-order networks [8]–[10], approaches based on sub-graph sampling [11]–[15], and non-anonymous graphs [16]–[20] have been developed.

However, a crucial question that remains open is whether the theoretically limited expressiveness of WL/MPNNs is actually a limiting factor for the generalization performance in real-world datasets [21] or if synthetic counter-examples as illustrated in Figure 1 are rather rare cases and not relevant in practice. In this work, we address this question and find that the limited expressiveness is actually not a limiting factor in practice. More specifically, we find that a large fraction of graphs can be distinguished in most datasets by the WL test and that the theoretical upper bound performance is close to or even reaches a classification accuracy of 100%. Moreover, we show that a simple WL-based machine learning model and several MPNNs can be fitted to the datasets, meaning that their limited expressiveness is not relevant in practice.

In sum, we conclude that WL and MPNNs are sufficiently expressive for many real-world datasets and that the development of new GNNs that are at most as expressive as the WL test should not be abandoned because of their theoretically limited expressiveness. Our findings align with the experimental results in [22]prior work, where the authors find that MPNNs often outperform theoretically more expressive GNNs. It is noteworthy to mention that a sufficiently expressive model is no guarantee for a good generalization performance. Hence, it remains an open research questions which network architectures and training procedures will lead to a strong generalization performance for graph-structured data.

Diagram of the Weisfeiler-Leman graph color refinement algorithm showing nodes being re-labeled over iterations k=0, 1, and 2, resulting in a final color histogram. Fig. 2. Illustration of the Weisfeiler-Leman graph color refinement algorithm. Initially ( $k = 0$ k equals zero), all nodes are labeled with the same color/label. In each iteration, the nodes are re-labeled according to an injective node labeling function based on the neighborhood nodes. For instance, the green and blue nodes obtained different colors in iteration $k = 1$ k equals one since they have different neighborhoods (2 orange nodes vs. 3 orange nodes, respectively) in iteration $k = 0$ k equals zero. After the maximum number of re-labeling iterations, the algorithm stops and a histogram based on the node color distributions in all iterations can be generated.

II. NOTATION

An undirected graph is a pair $G = (V_{G}, E_{G})$ G equals V G, E G, where $V_{G}$ V G is a finite set of vertices (also called nodes), and $E_{G} \subseteq {{u, v} : u, v \in V_{G}, u \neq = v}$ E G is a subset of the set of all unordered pairs u, v of distinct vertices is a symmetric, irreflexive, binary relation on $V_{G}$ V G. The elements in $E_{G}$ E G are called edges.

A colored undirected graph is a pair $(G, λ)$ G, lambda of an undirected graph $G$ G and a coloring function $λ : V_{G} \to C$ lambda from V G to C, where $C$ C is a set of colors (also called labels). Without loss of generality, we often use $C = N_{0}$ C equals the set of natural numbers including zero. The coloring function $λ$ lambda induces a partition $π (λ)$ pi of lambda of the graph nodes $V_{G}$ V G. For a color $c \in C$ c in C, $π^{- 1} (c) \subseteq V_{G}$ the inverse image of c is a subset of V G denotes the color class of $c$ c, i.e. the set of nodes that are colored with $c$ c.

$N (v) = {u : {u, v} \in V}$ N of v, the set of neighbors u of v denotes the neighborhood of $v$ v, $∣ \cdot ∣$ the absolute value symbol denotes the size of a set, and ${{\cdot}}$ double braces denotes a multiset, i.e. a set that may contain the same element multiple times.

III. THE WEISFEILER-LEMAN GRAPH COLORING ALGORITHM AND GRAPH ISOMORPHISM TEST

In the following, we first discuss the Weisfeiler-Leman graph coloring algorithm and then show how it can be used to implement a simple yet effective graph isomorphism test [7].

A. The Weisfeiler-Leman graph coloring algorithm

The first-order Weisfeiler-Leman graph coloring algorithm (1-WL, or WL for short) is an approach to generate a canonical representation of a graph as illustrated in Figure 2. For an undirected graph $G$ G, the algorithm performs the following steps. First, all nodes in $G$ G are assigned to the same initial color $c$ c. Without loss of generality, we use $0$ zero to indicate the first color and denote the initial coloring function by $λ^{(0)} (v)$ lambda zero of v and set $λ^{(0)} (v) = 0, \forall v \in G_{V}$ lambda zero of v equals zero for all v in G V. For each iteration $k$ k, a coloring is defined according to

λ^{(k)} (v) = hash (λ^{(k - 1)} (v), {{λ^{(k - 1)} (u), u \in N (v)}}), (1)

lambda k of v equals the hash of v’s previous color and the multiset of the previous colors of its neighbors

where hash is a function that maps $v$ ’s previous color and the multiset of the previous colors of $v$ ’s neighbors to a color in $C$ C. As illustrated in Figure 2, the coloring algorithm refines the node partition in each iteration (i.e. each color class can either be divided into multiple new color classes or stays constant). The coloring $π (λ^{(k)})$ pi of lambda k is called stable coloring if $π (λ^{(k)}) = π (λ^{(k + 1)})$ pi of lambda k equals pi of lambda k plus one. In Figure 2, a stable coloring has been generated for $k = 2$ k equals two. The algorithm terminates as soon as a stable coloring has been generated or a maximum number of $k$ k iterations has been executed. While a naive implementation of the color refinement has quadratic running time of $O (mn)$ big O of m times n, there are more efficient refinement strategies that run in $O (m lo g n)$ big O of m log n, where $n = ∣ E_{G} ∣$ n equals the number of edges and $m = ∣ V_{G} ∣$ m equals the number of vertices [23]. Instead of using $λ^{(0)} (v) = 0, \forall v \in G_{V}$ initializing with zero for all nodes as initialization, predefined node labels can also be used as initialization if a dataset already provides labels for nodes (see Section VI for details).

Based on the obtained node colors, for each iteration $k$ k, a histogram of the node colors can be obtained according to

h_{G}^{(k)} = {{λ^{(k)} (v), v \in G_{V}}} . (2)

h k of G is the multiset of colors lambda k of v for all vertices v

The histogram $h_{G}^{(k)}$ h k of G can then be used as canonical graph representation. In practice, $h_{G}^{(k)}$ h k of G can be converted into a $d$ -dimensional vector $v_{G}^{(k)}$ v k of G where $d$ d equals the number of used colors. The entry in dimension $j$ j is set according to

v_{G, j}^{(k)} = ∣ {{v : λ^{(k)} (v) = j, v \in G_{V}}} ∣ (3)

v k of G in dimension j is the count of nodes with color j

For a graph $G$ G, we denote the representation obtained by the first order WL coloring in iteration $k$ k as $1 -WL_{G}^{(k)}$ 1 WL k of G, or $WL^{(k)}$ WL k for short. Consequently, the coloring in iteration $k = 2$ k equals two for the graph in Figure 2 can be converted into the 5-dimensional vector $v_{G}^{(2)} = (0, 0, 2, 1, 2)^{T}$ v 2 of G equals the vector zero, zero, two, one, two.

B. The Weisfeiler-Leman graph isomorphism test

The first-order Weisfeiler-Leman graph isomorphism test uses the previously defined WL-based canonical graph representation to test if two graphs are isomorphic. Two undirected graphs $G = (V_{G}, E_{G})$ G and $H = (V_{H}, E_{H})$ H are said to be isomorphic if there is a bijection $φ : E_{G} \to E_{H}$ phi from E G to E H such that

{v, u} \in E_{G} ⟺ {φ (v), φ (u)} \in E_{H} \forall v, u \in V_{G} (4)

the pair v, u is an edge in G if and only if the pair phi of v, phi of u is an edge in H for all nodes

Two colored undirected graphs $(G, λ_{G})$ G with coloring lambda G and $(H, λ_{H})$ H with coloring lambda H are isomorphic if $G$ G and $H$ H are isomorphic and $λ_{G} (v) = λ_{H} (φ (v)) \forall v \in V_{H}$ the colors of corresponding nodes are identical.

The Weisfeiler-Leman graph isomorphism test iteratively computes graph representations for two graphs $G$ G and $H$ H and iteration $k \in {1, \dots, K}$ k from one to K according to the WL graph coloring

algorithm. If the graph representations $h_{G}^{(k)}$ h G k and $h_{H}^{(k)}$ h H k differ in iteration $k$ k, the algorithm terminates. In this case, it is guaranteed that $G$ G and $H$ H are not isomorphic. The algorithm also terminates if a stable coloring has been generated for both graphs (i.e. the coloring does not change anymore) or the number of iteration reaches a predefined maximum number of iterations $k$ k. In these cases, it is not guaranteed that the graphs are not isomorphic, since the WL algorithm is known to not be able to distinguish all non-isomorphic graphs. In this case, the graphs are said to be indistinguishable by the WL test and we write $G \approx H$ G is approximately equal to H to express that $G$ and $H$ cannot be distinguished by the WL test. Given a set of graphs $G$ G, a graph $G \in G$ G in G is said to be identifiable by WL if there is no other graph $H \in G, H \neq = G$ H in G, H not equal to G that is not distinguishable from $G$ by the WL test.

IV. EXPRESSIVENESS OF WL AND MPNNS

Several machine learning method that are based on WL-representations have already been proposed [24]. When WL-representations are used as a basis for machine learning models, the fact WL cannot distinguish all non-isomorphic graphs can become problematic since it can lower the best achievable accuracy of the model.

In general, the expressiveness of a machine learning model (also called model capacity) is a crucial property that describes how well a model can fit a given dataset, for instance expressed in terms of the Vapnik–Chervonenkis (VC) dimension [25]. If a model is not sufficiently expressive for a specific dataset, it is impossible to properly fit it to the dataset. A prominent example of insufficient expressiveness are linear classification models that are trained on data that requires a non-linear decision boundary.

The expressiveness of a model needs to be distinguished from the generalization abilities of a model, which describes the ability of a model to generalize to unseen data instances and is usually estimated by applying a trained model on an unseen test dataset. While a sufficient expressiveness is necessary to achieve a good generalization performance (i.e. a good performance on an unseen test set), it is not sufficient to achieve a good generalization performance. Hence, a strong expressiveness does not guarantee a good generalization performance. In fact, it is easy in many cases to create sufficiently expressive models that do not generalize well.

As already mentioned, the WL test is not able to distinguish all non-isomorphic graphs. In the context of WL-based graph representations, this means that non-isomorphic graphs may be represented with the same color histogram / vector representation. The graphs illustrated in Figure 1 are an example for this situation. In general, it has been shown that WL cannot distinguish non-isomorphic graphs is several scenarios [26], [27].

However, there are also some positive results about the expressiveness of WL. For instance, it has been shown that WL can distinguish all non-isomorphic trees [26]. Moreover, it has been shown that WL is powerful enough to distinguish almost all pairs of randomly generated non-isomorphic graphs except for rare counterexamples. This means that the fraction of graphs that cannot be identified in a set of randomly generated graphs with $n$ n nodes tends to $0$ zero when $n \to \infty$ n goes to infinity [28]. Even stronger expressiveness has been shown for higher-dimensional WL algorithms. For instance, it has been shown that the 2-WL test identifies almost all $d$ -regular graphs for every degree $d$ d-regular graphs for every degree d [29].

The limited expressiveness of the WL test and the graph representations generated by the WL graph coloring algorithm are not only relevant for graph isomorphism testing but also when the WL graph representations are used in machine learning models [24], for instance in kernel machines [30], since their generalization performance can be limited by the expressiveness of the WL representations. More specifically, (deterministic) machine learning models that use WL-based representations to represent graphs for graph classification cannot generate different predictions for graphs that cannot be distinguished by the WL test. More formally, for two (potentially non-isomorphic) graphs $G$ and $H$ with $G \approx H$ G and H with G approximately equal to H (i.e. $h_{G} = h_{H}$ h G equals h H), it is not possible to learn a function $f$ f such that $f (h_{G}) \neq = f (h_{H})$ f of h G does not equal f of h H, which is problematic when non-isomorphic graphs have the same 1-WL representation but a different class label. For instance, if graphs with a triangular pattern are labeled positive and graphs without a triangular pattern are labeled negative, no WL-based machine learning function can obtain a better prediction accuracy than 0.5 for the two graphs in Figure 1, since both graphs are either labeled positive or negative, but it is impossible to label one graph positive and the other graph negative. Hence, we obtain an upper bound of 0.5 for the prediction accuracy.

Recently, it has been shown that the limited expressiveness of 1-WL also limits the expressiveness of a popular family of graph neural networks (GNNs). More specifically, it has been shown that GNNs that are based on the message passing approach (i.e. MPNNs [2])namely message passing neural networks are at most as expressive as the 1-WL test [6], [8]. Key to this insight is the fact that the message passing algorithm can be interpreted as a continuous version of the 1-WL graph coloring algorithm. Among the MPNNs with limited expressiveness are popular GNNs such as GraphSAGE, GCN, GIN, and GAT. Among these networks, the GIN architecture is specifically interesting since it has been shown to be exactly as expressive as 1-WL, while other MPNNs are potentially even less expressive [6]. Moreover, [31]prior work showed that several graph properties such as shortest/longest cycle, diameter, and certain motifs cannot be computed by GNNs that rely entirely on local information. They also provide the first data dependent generalization bounds for MPNNs. [32]Other work studies limitations of MPNNs to count graph substructures. [33]Further studies draw a connection between graph isomorphism testing and the approximation of permutation-invariant functions.

V. TOWARDS MORE EXPRESSIVE ARCHITECTURES

The limited expressiveness of the WL isomorphism test and GNNs based on the message passing algorithm has motivated the development of new approaches for machine learning on graph-structured data that are more expressive, i.e. that do

not suffer from the limited expressiveness of the WL test. In the following, we describe several works and categorize them into four different groups, namely methods that use more expressive features, GNNs that use higher-order structures, graph sub-sampling based approaches, and non-anonymous networks.

The first set of approaches to extend the expressiveness of MPNNs beyond WL expressiveness explicitly encode more expressive features into the graph representation. For instance, a motif [34] detector can be used to analyze input graphs and to add information about the presence or absence of specific motifs or the number of specific motifs to the graph representation. A simple way to achieve this is to combine the canonical representation obtained by the WL coloring algorithm and the additionally obtained motif features. An example of such a combination of different features can be found in [35]. The obtained representation is more expressive than the initial WL representation and can, for instance, distinguish the two graphs in Figure 1. [9]Prior work propose an extension of this idea by learning a weighted sum of motif features up to a specific size.

The second way to improve the 1-WL test is to use higher-order structures such as pairs of nodes, or even larger groups. The resulting architectures are called higher-order networks. [10]Researchers show that k-order graph neural networks are as expressive as the k-WL test. For instance, a 3-order GNNs is as expressive as the 3-WL test and, hence, more expressive than 1-WL [36]. However, higher-order GNNs come with a higher computational cost since they process higher order tensors and the number of different higher-order structures can grow quickly when larger structures are considered. Moreover, [10] present an extended 2-order GNN (which is computationally easier to handle) that is as expressive as 3-WL. [8] propose $k$ -GNNsk G N N s which are based on $k$ -WLk W L and take higher-order structures into account.

Another direction to obtain more expressive GNNs is to sample sub-graph structures, i.e. by removing edges and/or nodes from the original input graph. For instance, Equivariant Subgraph Aggregation Networks (ESAN) [11] improve the expressiveness of MPNNs by modeling each graph as a set of subgraphs. [12] propose Graph Substructure Networks (GSNs) that use additional substructure-based features during the message passing. DropGNNs [13] can also be grouped into the sub-graph structure methods since they sample sub-graphs by randomly dropping some nodes. Message Passing Simplicial Networks (MPSNs) perform message passing over simplicial complexes (SCs), a specific class of sub-graph structures. CW Networks [14] extend MPSNs [15] as they use the broader class of closure-finite weak (CW) complexes.

Another direction to extend the expressiveness of GNNs is to convert graphs into non-anonymous networks, i.e. graphs in which each node has an attribute that uniquely identifies it. Examples of such attributes are unique node ids or (unique) random features. [16] shows that GNNs are Turing universal under sufficient conditions on their architecture (depth and width), node attributes (i.e. (partial) non-anonymity), and layer expressiveness, which means that they can solve the graph isomorphism problem. [17] introduced the communication capacity measure that measures how much information the nodes of a network can exchange during the forward pass. Moreover, it introduces communication complexity, which addresses the question how much information needs to be exchanged, e.g. to distinguish (connected) graphs and trees. Moreover, it has been shown that GNNs with random node initialization (RNI) are universal [18], meaning that they can approximate any function on graphs of any fixed order. Consequently, GNNs with RNI are more expressive than WL [19].

TABLE I
OVERVIEW OF POPULAR GRAPH DATASETS

Dataset	Size	Node labels	Classes	Unique
DD	1,178	89	2	1.00
ENZYMES	600	3	6	0.98
MUTAG	188	7	2	0.87
NCI1	4,110	37	2	0.97
NCI109	4,127	38	2	0.97
PROTEINS	1,113	3	2	0.94
COLLAB	5,000	0	3	0.78
IMDB-B	1,000	0	2	0.42
IMDB-M	1,500	0	3	0.19
REDDIT-B	2,000	0	2	1.00
REDDIT-M-5K	4,999	0	5	1.00

VI. Analysis of WL-Expressiveness in Real-World Datasets

In the following, we analyze if and to which extend the limited expressiveness of the first-order Weisfeiler-Leman graph isomorphism test (i.e. 1-WL), and hence, the limited expressiveness of message passing neural networks is a limiting factor in popular graph datasets that have been used by prior works that proposed more expressive GNNs [13], [15]. To this end, we compute the fraction of graphs that can be identified by the 1-WL test with different number of iterations $k$ k (i.e. $1 - W L^{(k)}$ one W L k). Moreover, we compute an upper bound for the classification accuracy that can be achieved with machine learning models that are based on 1-WL representations and message passing neural networks.

Table I provides an overview of the datasets that are used in this study. We report the size of the datasets in terms of number of graphs in the dataset. Moreover, we report the number of different node labels in case that the nodes in the graph have been initially labeled. Initial node labels are relevant in the study since they can be used as initialization when generating graph representations with WL. We also report how large the fraction of unique graphs is, i.e. for how many graphs no other isomorphic graph is present in the dataset. The fraction of unique graphs is important since it is a dataset-specific limitation for the fraction of identifiable graphs. Interestingly, we found that only 19% and 42% of the graphs in IMDB-M and IMDB-N are unique, respectively.

A. Fraction of WL-indentifiable graphs

In the following, we evaluate how severe the limited expressiveness of the 1-WL representations, and thus also

Dataset	k=1	k=2	k=3
DD	100.00	100.00	100.00
ENZYMES	100.00	100.00	100.00
MUTAG	32.32	92.68	96.34
NCI1	94.18	99.47	100.00
NCI109	94.91	99.40	100.00
PROTEINS	100.00	100.00	100.00
COLLAB	100.00	100.00	100.00
IMDB-B	100.00	100.00	100.00
IMDB-M	100.00	100.00	100.00
REDDIT-B	100.00	100.00	100.00
REDDIT-M-5K	100.00	100.00	100.00

**TABLE II** FRACTION OF IDENTIFIABLE GRAPHS FOR $k$ WL ITERATIONS

Dataset	k=1	k=2	k=3
DD	100.00	100.00	100.00
ENZYMES	100.00	100.00	100.00
MUTAG	95.74	99.47	100.00
NCI1	99.54	99.81	99.83
NCI109	99.73	99.85	99.88
PROTEINS	99.73	99.73	99.73
COLLAB	97.98	97.98	97.98
IMDB-B	88.60	88.60	88.60
IMDB-M	63.27	63.27	63.27
REDDIT-B	100.00	100.00	100.00
REDDIT-M-5K	100.00	100.00	100.00

**TABLE III** UPPER BOUND ACCURACY FOR $k$-LAYER WL REPRESENTATION.

how severe the limited expressiveness of MPNNs, is. To this end, we compute the fraction of non-isomorphic graphs in several commonly used datasets that can be identified by WL representations. More specifically, we compute for each graph WL-based representations $1 -WL^{(k)}$ one W L of k, where $k$ k indicates the number of iterations of the WL algorithm. A small fraction of identifiable graphs means that the WL algorithm cannot distinguish many graphs. In the extreme case, a fraction of 0% means that all graphs have the same 1-WL encoding. Vice versa, a large fraction of identifiable graphs shows that only a few graphs cannot be distinguished. A fraction of 100% means that each unique graph has an individual representation.

We report the results of the experiment in Table II and observe that a large fraction of graphs are identifiable in many datasets. More specifically, more than 94% of the graphs can be identified in 10 out of 11 datasets with just a single iteration of WL. In 8 out of 11 datasets, already all non-isomorphic graphs can be identified with one iteration of the WL graph coloring algorithm. Only MUTAG is an outlier with only 32.32% of identifiable non-isomorphic graphs. More iterations (i.e. $k > 1$ )where k is greater than one increase the fraction of identifiable graphs in 3 dataset (MUTAG, NCI1, and NCI109). Interestingly, the number of identifiable graphs is already at its maximum in all datasets that do not have node labels. One explanation for this observation is the fact that the variety of different graph representations is smaller when all nodes are initialized with the same color in contrast to graphs that are initialized with a large set of different node labels/colors. In sum, we conclude that the limited expressiveness of WL is not a crucial limitation in most of the datasets. Hence, the expressiveness of WL is usually not a limitation for the generalization performance.

B. Upper performance bounds for WL-based representations

In the previous section, we computed the fraction of WL-identifiable graphs, which provides insights on the expressiveness of WL without considering the class labels of the graphs. However, in practice, it is more relevant to estimate an upper bound or the classification accuracy, which indicates the best possible performance a deterministic model can achieve. As already discussed above, no deterministic¹ MPNN with the expressiveness of the corresponding WL representation can achieve a better performance than a WL-based model. The classification upper performance bound is often higher than the fraction of identifiable instances. The reason for this is the fact that graphs that have the same label do not necessarily have to be identified by a machine learning model in order to correctly classify the graphs. It can even be desired that graphs that belong to the same class are represented in the same way to obtain a higher generalization performance.

In the following, we estimate an upper bound for the classification accuracy of machine learning models that are based on WL representations. To this end, we compute the WL representation for all graphs for $k$ k iterations. For each cluster of graphs with the same WL representation, we perform a majority voting to obtain the best possible prediction for this cluster. As already discussed above, the same class needs to be predicted for all graphs in a specific cluster.

Table III reports the results of the experiment. We observe that a very high accuracy can in theory be achieved by WL-based models even for $k = 1$ k equals one. In 4 datasets (DD, ENZYMES, REDDIT-B, and REDDIT-M-5K), the upper performance bound for $k = 1$ k equals one is even 100%, which means that a WL-based model with $k = 1$ k equals one is sufficiently expressive to perfectly classify the dataset. In the MUTAG dataset, the difference between the fraction of identifiable graphs and the upper performance bound is most prominent. Even though only a small fraction of 32.32% of the graphs are WL identifiable, a WL-based model can still achieve an accuracy of 95.74%. Important to note is that the highest achievable performance is lower than 100% in datasets with non-unique graphs such as IMDB and COLLAB [37].

C. Results without using initial node labels

As mentioned before, some datasets come with predefined node features. However, it is also common to evaluate the performance of GNNs if the node features are not used. In this case, all node features that are present in the dataset are replaced with the same constant label. Hence, it is also

Bar chart showing the percentage of identifiable graphs across datasets for k=1, comparing with and without node features. Fig. 3. Fraction of identifiable graphs with and w/o node features for $k = 1$ k equals 1.

Bar chart showing the percentage of identifiable graphs across datasets for k=2, comparing with and without node features. Fig. 4. Fraction of identifiable graphs with and w/o node features for $k = 2$ k equals 2.

TABLE IV
WL UPPER BOUND PERFORMANCE WITHOUT USING NODE FEATURES
Dataset	\multicolumn{3}{c	}{Upper Bound Performance}
	k=1	k=2	k=3
DD	100.00	100.00	100.00
ENZYMES	98.83	100.00	100.00
MUTAG	91.49	96.28	96.81
NCI1	85.21	99.22	99.42
NCI109	85.51	98.91	99.32
PROTEINS	95.24	97.48	97.48

TABLE V
IDENTIFIABLE GRAPHS AND UPPER BOUND PERFORMANCE FOR $k = 0$
Dataset	Node Labels	Identifiable	Upper Bound Acc.
DD	yes	99.32	100.00
ENZYMES	yes	43.39	81.33
MUTAG	yes	25.00	93.09
NCI1	yes	56.34	91.34
NCI109	yes	56.98	91.66
PROTEINS	yes	52.86	91.91
DD	no	15.11	83.70
ENZYMES	no	3.05	38.50
MUTAG	no	0.92	86.17
NCI1	no	0.28	63.70
NCI109	no	0.33	63.46
PROTEINS	no	5.51	73.23
COLLAB	no	2.22	60.70
IMDB-B	no	3.80	60.60
IMDB-M	no	3.82	44.13
REDDIT-B	no	19.39	83.85
REDDIT-M-5K	no	9.06	55.17

interesting to evaluate if WL representations are sufficiently expressive in this case. To this end, we perform the same analysis as before, but replace the provided node labels with the same constant label such that all nodes are labeled identically.

A comparison of the fraction of identifiable graphs with and without node features for $k = 1$ k equals 1 and $k = 2$ k equals 2 can be found in Figure 3 and Figure 4, respectively. For $k = 1$ k equals 1, we can see that replacing the original node labels reduces the fraction of identifiable graphs in several datasets. However, the situation already changes when the number of iterations is increased to $k = 2$ k equals 2. In this case, we see that the differences between using and not using node labels decreases. There are also datasets such as DD and ENZYMES where the node labels do not contribute to the identifiability of the graphs.

Moreover, we report the upper bound for the classification accuracy without using node features in Table IV. Even though the number of identifiable graphs decreases substantially when no node features are used for $k = 1$ k equals 1, the upper bound performance is still very high, which means that a very high classification accuracy can be achieved even without using node features. For $k = 2$ k equals 2, the difference becomes even smaller.

D. WL expressiveness and upper accuracy bounds for $k = 0$ k equals zero

Prior works have shown that a good generalization performance can be achieved for graph classification even if the structure of the graph is not considered [38]. A simple classification model that does not use the graph structure can be obtained when all node features are aggregated without performing any message passing operations. This idea corresponds to the WL algorithm for $k = 0$ k equals zero since no re-coloring is performed for $k = 0$ k equals zero.

We report the fraction of identifiable graphs and the upper bound performance with and without using node features for $k = 0$ k equals zero in Table V. We can see that the fraction of identifiable graphs decreases substantially for $k = 0$ k equals zero when no node features are available, which is reasonable since the WL representation reduces to a representation based on the number of nodes in the graphs. The effect is less severe when node features are available, but still clearly observable when compared with the results for $k = 1$ k equals one. Hence, we conclude that using initial node features and at least one iteration of the WL algorithm have a substantial positive effect on the WL identifiability. However, the effect on the upper bound classification performance is less strong even for $k = 0$ k equals zero, specifically when node features are used. DD is an extreme case in which the upper performance bound does not drop below 100%. In ENZYMES, the effect is stronger and the upper bound drops from 100% to 81.33%. In sum, we conclude that upper performance bound is still reasonably high for $k = 0$ k equals zero when using node features.

E. WL-MLP and GNN upper performance bounds

So far, we have focused on the WL expressiveness and the upper bound performance which answers the question if we

TABLE VI ACCURACY UPPER BOUND FOR WL-MLP AND DIFFERENT GNNS

Dataset	WL-MLP	GCN	GAT	GIN
DD	100.00	100.00	100.00	100.00
ENZYMES	100.00	90.67	84.83	99.50
MUTAG	95.74	90.96	91.49	100.00
NCI1	99.54	97.27	96.23	99.34
NCI109	99.69	76.86	78.63	88.08
PROTEINS	99.73	76.55	76.91	92.63
COLLAB	97.92	71.70	52.00	76.18
IMDB-B	88.60	66.00	51.00	73.70
IMDB-M	63.20	37.93	34.33	59.53
REDDIT-B	100.00	92.45	51.00	93.45
REDDIT-M-5K	59.67	50.59	21.06	57.33

need more expressive GNNs for standard graph datasets from a theoretical point of view. In the following, we also evaluate how well popular GNNs can fit the graph datasets, which not only considers the expressiveness of the GNNs but also their trainability. Trainability of a machine learning model describes how easy it is to fit a model to a specific dataset. Since the models are trained with gradient-based approaches that do not guarantee to find the global optimum in the parameter space, it is unlikely that we obtain the optimal upper bound performance but only an approximation thereof. Similarly, we only obtain an approximation of the expressiveness.

We report the empirical upper bound performance of the models in Table VI. In this experiments, we used only one iteration of WL and single layer GNNs. Hence, the results can be compared with the results in Table III, column $k = 1$ k equals one. First, we find that the WL-based multi-layer perpetration baseline (WL-MLP) can successfully learn the datasets with node features, meaning that WL is sufficiently expressive for these datasets. Interestingly, the results are very close to the previously computed upper bound results in Table III, meaning that the WL-MLP is not only very expressive, but can also be fitted easily. The results for the GNNs are not as high as WL-MLP, but they still demonstrate a strong expressiveness. In the datasets without node features, it is more difficult to observe a high expressiveness of the machine learning models in many cases. This aligns with the results in Table III, in which we found that the limited expressiveness is most severe in the IMDB-B and IMDB-M datasets. The performance in REDDIT-M-5K is surprisingly low for all models. As mentioned above, the obtained results are approximations of the upper bound performance and better results might be achieved with an improved training procedure.

F. Motif histograms

Motif-based representations are a popular alternative to WL-based representation since they can capture properties of graphs that cannot be captured by WL representations. In the following, we evaluate motif-based representations to see if they are a reasonable choice in standard graph datasets. However, due to the high computational cost to count motifs, counting them is not feasible in several datasets. Hence, we only provide results for a subset of the datasets in Table VII.

TABLE VII IDENTIFIABLE GRAPHS AND UPPER BOUND PERFORMANCE FOR MOTIFS UP TO SIZE 4

Dataset	Identifiable	Upper Bound Acc.
ENZYMES	99.49	99.83
MUTAG	15.85	92.55
NCI1	17.12	75.52
NCI109	17.08	75.09
IMDB-B	100.00	88.60
IMDB-M	100.00	63.27

The results show that the fraction of unique identifiable graphs and the upper bound performance for motif-based representations is lower than WL representations with $k = 1$ k equals one for the datasets with node features. For datasets without node features, we don’t observe substantial differences between WL representations and motif-based representations. Similarly, we observe that the upper performance bounds for datasets with node features is lower than for WL-based representations with $k = 1$ k equals one in NCI1 and NCI109. However, the performance is only slightly lower in ENZYMES and MUTAG. Since motif-based representation are usually shorter, presuming a better generalization performance could be reasonable.

VII. CONCLUSIONS

A main motivation for the development of several new machine learning methods for graph-structured data is the limited expressiveness of the WL graph isomorphism test and MPNNs. However, our experiments show that WL-based representations are sufficiently expressive to distinguish most graphs in several datasets. Already one iteration of WL was sufficient to obtain a large fraction of WL-identifiable graphs. Moreover, we found that WL is sufficiently expressiveness to achieve a near perfect classification accuracy in almost all datasets. Moreover, a simple WL-based neural network and popular GNNs such as the GIN can be fitted to most datasets. Hence, we conclude that the limited expressiveness of MPNNs is not a limiting factor for their generalization performance and more expressive GNNs are not necessary to achieve a good performances. Hence, our findings suggest that the development of new MPNNs whose expressiveness is limited by the WL is still relevant and should not be abandoned. Moreover, observing a higher generalization performance of more expressive GNNs cannot be attributed to their stronger expressiveness, but can likely be explained by other factors.

Furthermore, we found that two 1-WL iterations can lead to a larger fraction of identifiable graphs than only one iteration in a few cases, but the added value of a third iteration is marginal. Furthermore, we found that only one 1-WL iteration is sufficient to achieve a high upper bound of the classification accuracy. Both results indicate that one or two iterations of WL are already sufficiently expressive in practice. We also found that motif-based representations are less strong than one iteration of WL and also have a lower upper bound accuracy, which is a disadvantage of motif-based representations in addition to their much higher computational costs.

References

[references omitted]

Models with a non-deterministic behavior may achieve higher performance. However, GNNs are usually implemented as deterministic functions. ↩

Graph View

TTS

1-WL Expressiveness Is (Almost) All You Need

I. Introduction

A. The Weisfeiler-Leman graph coloring algorithm

B. The Weisfeiler-Leman graph isomorphism test

IV. EXPRESSIVENESS OF WL AND MPNNS

V. TOWARDS MORE EXPRESSIVE ARCHITECTURES

VI. Analysis of WL-Expressiveness in Real-World Datasets

A. Fraction of WL-indentifiable graphs

B. Upper performance bounds for WL-based representations

C. Results without using initial node labels

D. WL expressiveness and upper accuracy bounds for $k = 0$ k equals zero

E. WL-MLP and GNN upper performance bounds

VII. CONCLUSIONS

References

Graph View

TTS

1-WL Expressiveness Is (Almost) All You Need

I. Introduction

A. The Weisfeiler-Leman graph coloring algorithm

B. The Weisfeiler-Leman graph isomorphism test

IV. EXPRESSIVENESS OF WL AND MPNNS

V. TOWARDS MORE EXPRESSIVE ARCHITECTURES

VI. Analysis of WL-Expressiveness in Real-World Datasets

A. Fraction of WL-indentifiable graphs

B. Upper performance bounds for WL-based representations

C. Results without using initial node labels

D. WL expressiveness and upper accuracy bounds for k=0k equals zero

E. WL-MLP and GNN upper performance bounds

VII. CONCLUSIONS

References

Footnotes

D. WL expressiveness and upper accuracy bounds for $k = 0$ k equals zero