Hypothesis Testing for Automated Community Detection in Networks

Peter J. Bickel1, Purnamrita Sarkar2
1University of California at Berkeley USA
2University of Texas at Austin USA

Tóm tắt

Summary Community detection in networks is a key exploratory tool with applications in a diverse set of areas, ranging from finding communities in social and biological networks to identifying link farms in the World Wide Web. The problem of finding communities or clusters in a network has received much attention from statistics, physics and computer science. However, most clustering algorithms assume knowledge of the number of clusters k. We propose to determine k automatically in a graph generated from a stochastic block model by using a hypothesis test of independent interest. Our main contribution is twofold; first, we theoretically establish the limiting distribution of the principal eigenvalue of the suitably centred and scaled adjacency matrix and use that distribution for our test of the hypothesis that a random graph is of Erdős–Rényi (noise) type. Secondly, we use this test to design a recursive bipartitioning algorithm, which naturally uncovers nested community structure. Using simulations and quantifiable classification tasks on real world networks with ground truth, we show that our algorithm outperforms state of the art methods.

Từ khóa


Tài liệu tham khảo

Adamic, 2005, Proc. 3rd Int. Wrkshp Link Discovery

Airoldi, 2008, Mixed membership stochastic blockmodels, J. Mach. Learn. Res., 9, 1981

Airoldi, 2013, Advances in Neural Information Processing Systems, 692

Amini, 2013, Pseudo-likelihood methods for community detection in large sparse networks, Ann. Statist., 41, 2097, 10.1214/13-AOS1138

Bartlett, 1937, Properties of sufficiency and statistical tests, Proc. R. Soc. Lond., 160, 268

Bickel, 2009, A nonparametric view of network models and Newman Girvan and other modularities, Proc. Natn. Acad. Sci. USA, 106, 21068, 10.1073/pnas.0907096106

Bloemendal, 2014, Isotropic local laws for sample covariance and generalized Wigner matrices, Electron. J. Probab., 19, 1

Chatterjee, 2015, Matrix estimation by universal singular value thresholding, Ann. Statist., 43, 177, 10.1214/14-AOS1272

Erdős, 2012, Rigidity of eigenvalues of generalized Wigner matrices, Adv. Math., 229, 1435, 10.1016/j.aim.2011.12.010

Füredi, 1981, The eigenvalues of random symmetric matrices, Combinatorica, 1, 233, 10.1007/BF02579329

Hamerly, 2003, Advances in Neural Information Processing Systems

Handcock, 2007, Model-based clustering for social networks (with discussion), J. R. Statist. Soc. A, 170, 301, 10.1111/j.1467-985X.2007.00471.x

Holland, 1983, Stochastic blockmodels: first steps, Socl Netwrks, 5, 109, 10.1016/0378-8733(83)90021-7

Karrer, 2011, Stochastic blockmodels and community structure in networks, Phys. Rev. E, 83, 016107, 10.1103/PhysRevE.83.016107

Larsen, 1999, Proc. 5th Int. Conf. Knowledge Discovery and Data Mining, 16

Latouche, 2011, Overlapping stochastic block models with application to the French political blogosphere, Ann. Appl. Statist, 5, 309, 10.1214/10-AOAS382

Lee, 2014, A necessary and sufficient condition for edge universality of Wigner matrices, Duke Math. J., 163, 117, 10.1215/00127094-2414767

Lei, 2014, A goodness-of-fit test for stochastic block models

McAuley, 2012, Advances in Neural Information Processing Systems, 539

Newman, 2006, Finding community structure in networks using the eigenvectors of matrices, Phys. Rev. E, 74, 036104, 10.1103/PhysRevE.74.036104

Olhede, 2014, Network histograms and universality of blockmodel approximation, Proc. Natn. Acad. Sci. USA, 111, 14722, 10.1073/pnas.1400374111

Oliveira, 2009, Concentration of the adjacency matrix and of the laplacian in random graphs with independent edges

Patterson, 2006, Population structure and eigenanalysis, PLOS Genet, 2, 2074, 10.1371/journal.pgen.0020190

Pelleg, 2000, Proc. 17th Int. Conf. Machine Learning, 727

Raftery, 2002, Latent space approaches to social network analysis, J. Am. Statist. Ass., 97, 1090, 10.1198/016214502388618906

Resnick, 1997, Protecting adolescents from harm: findings from the national longitudinal study on adolescent health, J. Am. Med. Ass., 278, 823, 10.1001/jama.1997.03550100049038

Rohe, 2011, Spectral clustering and the high-dimensional stochastic blockmodel, Ann. Statist., 39, 1878, 10.1214/11-AOS887

Snijders, 1997, Estimation and prediction for stochastic blockmodels for graphs with latent block structure, J. Classificn, 14, 75, 10.1007/s003579900004

Soshnikov, 1999, Universality at the edge of the spectrum in Wigner random matrices, Communs Math. Phys., 207, 697, 10.1007/s002200050743

Tracy, 1994, Level-spacing distributions and the airy kernel, Communs Math. Phys., 159, 151, 10.1007/BF02100489

Wigner, 1958, On the distribution of the roots of certain symmetric matrices, Ann. Math., 67, 325, 10.2307/1970008

Zachary, 1977, An information flow model for conflict and fission in small groups, J. Anthr. Res., 33, 452

Zhao, 2011, Community extraction for social networks, Proc. Natn. Acad. Sci. USA, 108, 7321, 10.1073/pnas.1006642108

Zhao, 2012, Consistency of community detection in networks under degree-corrected stochastic block models, Ann. Statist., 40, 2266, 10.1214/12-AOS1036