# On a two-truths phenomenon in spectral graph clustering

^{a}Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD 21218;^{b}Center for Imaging Science, Johns Hopkins University, Baltimore, MD 21218;^{c}Human Language Technology Center of Excellence, Johns Hopkins University, Baltimore, MD 21218;^{d}Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218;^{e}Institute for Defense Analyses, Center for Computing Sciences, Bowie, MD 20715;^{f}Department of Mathematics and Statistics, University of Massachusetts, Amherst, MA 01003;^{g}Department of Biostatistics, Johns Hopkins University, Baltimore, MD 21218

See allHide authors and affiliations

Edited by Peter J. Bickel, University of California, Berkeley, CA, and approved February 8, 2019 (received for review August 21, 2018)

## Significance

Spectral graph clustering—clustering the vertices of a graph based on their spectral embedding—is of significant current interest, finding applications throughout the sciences. But as with clustering in general, what a particular methodology identifies as “clusters” is defined (explicitly, or, more often, implicitly) by the clustering algorithm itself. We provide a clear and concise demonstration of a “two-truths” phenomenon for spectral graph clustering in which the first step—spectral embedding—is either Laplacian spectral embedding, wherein one decomposes the normalized Laplacian of the adjacency matrix, or adjacency spectral embedding given by a decomposition of the adjacency matrix itself. The two resulting clustering methods identify fundamentally different (true and meaningful) structure.

## Abstract

Clustering is concerned with coherently grouping observations without any explicit concept of true groupings. Spectral graph clustering—clustering the vertices of a graph based on their spectral embedding—is commonly approached via *K*-means (or, more generally, Gaussian mixture model) clustering composed with either Laplacian spectral embedding (LSE) or adjacency spectral embedding (ASE). Recent theoretical results provide deeper understanding of the problem and solutions and lead us to a “two-truths” LSE vs. ASE spectral graph clustering phenomenon convincingly illustrated here via a diffusion MRI connectome dataset: The different embedding methods yield different clustering results, with LSE capturing left hemisphere/right hemisphere affinity structure and ASE capturing gray matter/white matter core–periphery structure.

The purpose of this paper is to cogently present a “two-truths” phenomenon in spectral graph clustering, to understand this phenomenon from a theoretical and methodological perspective, and to demonstrate the phenomenon in a real-data case consisting of multiple graphs each with multiple categorical vertex class labels.

A graph or network consists of a collection of vertices or nodes V representing n entities together with edges or links E representing the observed subset of the

It is often the case that practitioners cluster the vertices of a graph—say, via K-means clustering composed with Laplacian spectral embedding—and pronounce the method as having performed either well or poorly based on whether the resulting clusters correspond well or poorly with some known or preconceived notion of “correct” clustering. Indeed, such a procedure may be used to compare two clustering methods and to pronounce that one works better (on the particular data under consideration). However, clustering is inherently ill-defined, as there may be multiple meaningful groupings, and two clustering methods that perform differently with respect to one notion of truth may in fact be identifying inherently different, but perhaps complementary, underlying structure. With respect to graph clustering, ref. 1 shows that there can be no algorithm that is optimal for all possible community detection tasks (Fig. 1).

We compare and contrast Laplacian and adjacency spectral embedding as the first step in spectral graph clustering and demonstrate that the two methods, and the two resulting clusterings, identify different—but both meaningful—graph structure. We trust that this simple, clear explication will contribute to an awareness that connectivity-based structure discovery via spectral graph clustering should consider both Laplacian and adjacency spectral embedding and the development of new methodologies based on this awareness.

## Spectral Graph Clustering

Given a simple graph

The first step of spectral graph clustering (2, 3) involves embedding the graph into Euclidean space via an eigendecomposition. We consider two options: Laplacian spectral embedding (LSE), wherein we decompose the normalized Laplacian of the adjacency matrix, and adjacency spectral embedding (ASE) given by a decomposition of the adjacency matrix itself. With target dimension d, either spectral embedding method produces n points in

Spectral graph clustering concludes via classical Euclidean clustering of the rows of X. As described below, central limit theorems for spectral embedding of the (sufficiently dense) stochastic block model via either LSE or ASE suggest Gaussian mixture modeling (GMM) for this clustering step. Thus, we consider spectral graph clustering to be GMM composed with LSE or ASE:

## Stochastic Block Model

The random graph model we use to illustrate our phenomenon is the stochastic block model (SBM), introduced in ref. 4. This model is parameterized by (*i*) a block membership probability vector *ii*) a symmetric

For sufficiently dense graphs, both LSE and ASE have a central limit theorem (6⇓–8) demonstrating that, for large n, embedding via the top d eigenvectors from a rank

We make significant conceptual use of the positive definite two-block SBM (

### Affinity: a , c ≫ b .

An SBM with

### Core-periphery: a ≫ b , c .

An SBM with

The relative performance of LSE and ASE for these two cases provides the foundation for our analyses. Informally, LSE outperforms ASE for affinity, and ASE is the better choice for core–periphery. We make this clustering performance assessment analytically precise via Chernoff information, and we demonstrate this in practice via the adjusted Rand index.

## Clustering Performance Assessment

We consider two approaches to assessing the performance of a given clustering, defined to be a partition of

### Chernoff Information.

Comparing and contrasting the relative performance of LSE vs. ASE via the concept of Chernoff information (9, 10), in the context of their respective central limit theorems (CLTs), provides a limit theorem notion of superiority. Thus, in the SBM, we allude to the GMM provided by the CLT for either LSE or ASE.

The Chernoff information between two distributions is the exponential rate at which the decision-theoretic Bayes error decreases as a function of sample size. In the two-block SBM, with the true clustering of the vertices given by the block memberships, we are interested in the large-sample optimal error rate for recovering the underlying block memberships after the spectral embedding step has been carried out. Thus, we require the Chernoff information

### Adjusted Rand Index.

In practice, we wish to empirically assess the performance of a particular clustering algorithm on a given graph. There are numerous cluster assessment criteria available in the literature: the Rand index (RI) (12), normalized mutual information (NMI) (13), variation of information (VI) (14), Jaccard (15), etc. These are typically used to compare either an empirical clustering against a “truth” or two separate empirical clusterings. For concreteness, we consider the well-known adjusted Rand index (ARI), popular in machine learning, which normalizes the RI so that expected chance performance is zero: The ARI is the adjusted-for-chance probability that two partitions of a collection of data points will agree for a randomly chosen pair of data points, putting the pair into the same partition cell in both clusterings or splitting the pair into different cells in both clusterings. (Our empirical connectome results are essentially unchanged when using other cluster assessment criteria.)

In the context of spectral clustering via

In the context of two truths, we consider

## Model Selection × 2

To perform the spectral graph clustering

### SBM vs. Network Histogram.

If the SBM were actually true, then as

The bias–variance tradeoff demonstrates that any quest for a universally optimal methodology for choosing the “best” dimension and number of clusters, in general, for finite n, is a losing proposition. Even for a low-rank model, subsequent inference may be optimized by choosing a dimension smaller than the true signal dimension, and even for a mixture of K Gaussians, inference performance may be optimized by choosing a number of clusters smaller than the true cluster complexity. In the case of semiparametric SBM fitting, wherein low-rank and finite mixtures are used as a practical modeling convenience as opposed to a believed true model, and one presumes that both

For

### Choosing the Embedding Dimension d ^ .

A ubiquitous and principled general methodology for choosing the number of dimensions in eigendecompositions and SVDs (e.g., principal components analysis, factor analysis, spectral embedding, etc.) is to examine the so-called scree plot and look for “elbows” defining the cutoff between the top (signal) dimensions and the noise dimensions. There are a plethora of variations for automating this singular value thresholding (SVT); section 2.8 of ref. 16 provides a comprehensive discussion in the context of principal components, and ref. 17 provides a theoretically justified (but perhaps practically suspect, for small n) universal SVT. We consider the profile-likelihood SVT method of ref. 18. Given

### Choosing the Number of Clusters K ^ .

Choosing the number of clusters in Gaussian mixture models is most often addressed by maximizing a fitness criterion penalized by model complexity. Common approaches include the Akaike information criterion (AIC) (19), the Bayesian information criterion (BIC) (20), minimum description length (MDL) (21), etc. We consider penalized likelihood via the BIC (22). Given n points in

## Connectome Data

We consider for illustration a diffusion MRI dataset consisting of 114 connectomes (57 subjects, two scans each) with 72,783 vertices each and both left/right/other hemispheric and gray/white/other tissue attributes for each vertex. Graphs were estimated using the NeuroData’s MR Graphs pipeline (23), with vertices representing subregions defined via spatial proximity and edges defined by tensor-based fiber streamlines connecting these regions (Fig. 2).

The actual graphs we consider are the largest connected component (LCC) of the induced subgraph on the vertices labeled as both left or right and gray or white. This yields

### Sparsity.

The only notions of sparsity relevant here are linear algebraic: whether there are enough edges in the graph to support spectral embedding and whether there are few enough to allow for sparse matrix computations. We have a collection of observed connectomes and we want to cluster the vertices in these graphs, as opposed to in an unobserved sequence with the number of vertices tending to infinity. Our connectomes have, on average,

### Synthetic Analysis.

We consider a synthetic data analysis via a priori projections onto the SBM—block model estimates based on known or assumed block memberships. Averaging the collection of

### Two-Block Projections.

A priori projections onto the two-block SBM for *Theoretical Results* and *Simulation Results* and then empirically on the original connectomes in *Connectome Results*.

### Theoretical Results.

Analysis using the large-sample Gaussian mixture model approximations from the LSE and ASE CLTs shows that the 2D embedding of the four-block model, when clustered into two clusters, will yield { {LG,LW}, {RG,RW} } (i.e., {Left, Right}) when embedding via LSE and { {LG,RG}, {LW,RW} } (i.e., {Gray, White}) when using ASE. That is, using numerical integration for the

### Simulation Results.

We augment the Chernoff limit theory via Monte Carlo simulation, sampling graphs from the four-block model and running the

### Connectome Results.

Figs. 5–7 present empirical results for the connectome dataset,

First, in Fig. 5, we consider a priori projections of the individual connectomes, analogous to the Fig. 4 projections of the composite connectome. Letting

In Figs. 6 and 7 we present the results of

## Conclusion

The results presented herein demonstrate that practical spectral graph clustering exhibits a two-truths phenomenon with respect to Laplacian vs. adjacency spectral embedding. This phenomenon can be understood theoretically from the perspective of affinity vs. core–periphery stochastic block models and via consideration of the two a priori projections of a four-block two-truths SBM onto the two-block SBM. For connectomics, this phenomenon manifests itself via LSE better capturing the left hemisphere/right hemisphere affinity structure and ASE better capturing the gray matter/white matter core–periphery structure and suggests that a connectivity-based parcellation based on spectral clustering should consider both LSE and ASE, as the two spectral embedding approaches facilitate the identification of different and complementary connectivity-based clustering truths.

## Acknowledgments

The authors thank the Isaac Newton Institute for Mathematical Sciences, Cambridge, United Kingdom, for support and hospitality during the program Theoretical Foundations for Statistical Network Analysis (Engineering and Physical Sciences Research Council Grant EP/K032208/1), where a portion of the work on this paper was undertaken, and the University of Haifa, where these ideas were conceived in June 2014. This work is partially supported by Defense Advanced Research Projects Agency (XDATA, GRAPHS, SIMPLEX, D3M), Johns Hopkins University Human Language Technology Center of Excellence, and the Acheson J. Duncan Fund for the Advancement of Research in Statistics.

## Footnotes

- ↵
^{1}To whom correspondence should be addressed. Email: cep{at}jhu.edu.

Author contributions: C.E.P., J.T.V., J.M.C., and V.L. designed research; C.E.P., V.L., M.T., A.A., and J.C. performed research; C.E.P., Y.P., and E.B. analyzed data; and C.E.P. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

- Copyright © 2019 the Author(s). Published by PNAS.

This open access article is distributed under Creative Commons Attribution License 4.0 (CC BY).

## References

- ↵
- Peel L,
- Larremore DB,
- Clauset A

- ↵
- ↵
- ↵
- ↵
- Olhede SC,
- Wolfe PJ

- ↵
- Athreya A, et al.

- ↵
- Tang M,
- Priebe CE

- ↵
- Rubin-Delanchy P,
- Priebe CE,
- Tang M,
- Cape J

- ↵
- Chernoff H

- ↵
- Chernoff H

- ↵
- Cape J,
- Tang M,
- Priebe CE

*Network Science*, in press. - ↵
- ↵
- ↵
- ↵
- ↵
- Jackson JE

- ↵
- Chatterjee S

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Kiar G, et al.

## Citation Manager Formats

## Article Classifications

- Physical Sciences
- Statistics