Simultaneous clustering and variable selection: A novel algorithm and model selection procedure
Tóm tắt
The growing availability of high-dimensional data sets offers behavioral scientists an unprecedented opportunity to integrate the information hidden in the novel types of data (e.g., genetic data, social media data, and GPS tracks, etc.,) and thereby obtain a more detailed and comprehensive view towards their research questions. In the context of clustering, analyzing the large volume of variables could potentially result in an accurate estimation or a novel discovery of underlying subgroups. However, a unique challenge is that the high-dimensional data sets likely involve a significant amount of irrelevant variables. These irrelevant variables do not contribute to the separation of clusters and they may mask cluster partitions. The current paper addresses this challenge by introducing a new clustering algorithm, called Cardinality K-means or CKM, and by proposing a novel model selection strategy. CKM is able to perform simultaneous clustering and variable selection with high stability. In two simulation studies and an empirical demonstration with genetic data, CKM consistently outperformed competing methods in terms of recovering cluster partitions and identifying signaling variables. Meanwhile, our novel model selection strategy determines the number of clusters based on a subset of variables that are most likely to be signaling variables. Through a simulation study, this strategy was found to result in a more accurate estimation of the number of clusters compared to the conventional strategy that utilizes the full set of variables. Our proposed CKM algorithm, together with the novel model selection strategy, has been implemented in a freely accessible R package.
Tài liệu tham khảo
Adachi, K., & Trendafilov, N.T. (2016). Sparse principal component analysis subject to prespecified cardinality of loadings. Computational Statistics, 31(4), 1403–1427.
Arias-Castro, E., & Pu, X. (2017). A simple approach to sparse clustering. Computational Statistics & Data Analysis, 105, 217–228.
Arvey, R.D., Li, W.D., & Wang, N. (2016). Genetics and organizational behavior. Annual Review of Organizational Psychology and Organizational Behavior, 3, 167–190.
Bertsimas, D., King, A., & Mazumder, R. (2016). Best subset selection via a modern optimization lens. The Annals of Statistics, 44(2), 813–852.
Bouveyron, C., & Brunet-Saumard, C. (2014). Model-based clustering of high-dimensional data: A review. Computational Statistics & Data Analysis, 71, 52–78.
Bouveyron, C., Celeux, G., Murphy, T.B., & Raftery, A.E. (2019). Model-based clustering and classification for data science: With applications in R (Vol 50). Cambridge University Press.
Brudvig, S., Brusco, M.J., & Cradit, J.D. (2019). Joint selection of variables and clusters: recovering the underlying structure of marketing data. Journal of Marketing Analytics, 7(1), 1–12.
Brusco, M.J., & Cradit, J.D. (2001). A variable-selection heuristic for k-means clustering. Psychometrika, 66(2), 249–270.
Bzdok, D., & Meyer-Lindenberg, A. (2018). Machine learning for precision psychiatry: Opportunities and challenges. Biological Psychiatry: Cognitive Neuroscience and Neuroimaging, 3(3), 223–230.
Chi, W., Li, W.D., Wang, N., & Song, Z. (2016). Can genes play a role in explaining frequent job changes? An examination of gene-environment interaction from human capital theory. Journal of Applied Psychology, 101(7), 1030.
Chipman, H., & Tibshirani, R. (2006). Hybrid hierarchical clustering with applications to microarray data. Biostatistics, 7(2), 286–301.
Davis, C., Zai, C.C., Adams, N., Bonder, R., & Kennedy, J.L. (2019). Oxytocin and its association with reward-based personality traits: A multilocus genetic profile (mlgp) approach. Personality and Individual Differences, 138, 231–236.
De Roover, K., Ceulemans, E., Timmerman, M.E., Vansteelandt, K., Stouten, J., & Onghena, P. (2012). Clusterwise simultaneous component analysis for analyzing structural differences in multivariate multiblock data. Psychological methods, 17(1), 100.
Ding, C., & He, X. (2004). K-means clustering via principal component analysis. In Proceedings of the twenty-first international conference on machine learning (p. 29).
Fan, J., Han, F., & Liu, H. (2014). Challenges of big data analysis. National Science Review, 1(2), 293–314.
Feldman, R., Monakhov, M., Pratt, M., & Ebstein, R.P. (2016). Oxytocin pathway genes: Evolutionary ancient system impacting on human affiliation, sociality, and psychopathology. Biological Psychiatry, 79(3), 174–184.
Fowlkes, E.B., & Mallows, C.L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78(383), 553–569.
Friedman, J.H., & Meulman, J.J. (2004). Clustering objects on subsets of attributes (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66(4), 815–849.
Gil de Zuniga, H., & Diehl, T. (2017). Citizenship, social media, and big data: Current and future research in the social sciences. Social Science Computer Review, 35(1), 3–9.
Groeneveld, P.W., & Rumsfeld, J.S. (2016). Can big data fulfill its promise? Circulation: Cardiovascular Quality and Outcomes, 9(6), 679–682.
Guerra-Urzola, R., Van Deun, K., Vera, J.C., & Sijtsma, K. (2021). A guide for sparse pca: Model comparison and applications. Psychometrika, 1–27.
Huang, D.W., Sherman, B.T., Tan, Q., Kir, J., Liu, D., Bryant, D., & et al. (2007). David bioinformatics resources: Expanded annotation database and novel algorithms to better extract biology from large gene lists. Nucleic Acids Research, 35(suppl_2), W169–W175.
Joel, S., Eastwick, P.W., & Finkel, E.J. (2017). Is romantic desire predictable? Machine learning applied to initial romantic attraction. Psychological Science, 28(10), 1478–1489.
Krzanowski, W.J., & Lai, Y. (1988). A criterion for determining the number of groups in a data set using sum-of-squares clustering. Biometrics, 23–34.
Lebart, L., Morineau, A., & Piron, M. (1995). Statistique exploratoire multidimensionnelle (Vol. 3). Dunod Paris.
Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., & Hornik, K. (2012). Cluster: Cluster analysis basics and extensions. R Package Version, 1(2), 56.
Mothi, S.S., Sudarshan, M., Tandon, N., Tamminga, C., Pearlson, G., Sweeney, J., & Keshavan, M.S. (2019). Machine learning improved classification of psychoses using clinical and biological stratification: Update from the bipolar-schizophrenia network for intermediate phenotypes (b-snip). Schizophrenia Research, 214, 60.
Nishimura, Y., Martin, C.L., Vazquez-Lopez, A., Spence, S.J., Alvarez-Retuerto, A.I., Sigman, M., & et al. (2007). Genome-wide expression profiling of lymphoblastoid cell lines distinguishes different forms of autism and reveals shared pathways. Human Molecular Genetics, 16(14), 1682–1698.
Park, G., Schwartz, H.A., Eichstaedt, J.C., Kern, M.L., Kosinski, M., Stillwell, D.J., & Seligman, M.E. (2015). Automatic personality assessment through social media language. Journal of Personality and Social Psychology, 108(6), 934.
Raftery, A.E., & Dean, N. (2006). Variable selection for model-based clustering. Journal of the American Statistical Association, 101(473), 168–178.
Shen, H., & Huang, J.Z. (2008). Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis, 99(6), 1015–1034.
Steinley, D. (2006). K-means clustering: A half-century synthesis. British Journal of Mathematical and Statistical Psychology, 59(1), 1–34.
Steinley, D., & Brusco, M.J. (2008a). A new variable weighting and selection procedure for k-means cluster analysis. Multivariate Behavioral Research, 43(1), 77–108.
Steinley, D., & Brusco, M.J. (2008b). Selection of variables in cluster analysis: An empirical comparison of eight procedures. Psychometrika, 73(1), 125.
Steinley, D., & Brusco, M.J. (2011). Evaluating mixture modeling for clustering: Recommendations and cautions. Psychological Methods, 16(1), 63.
Sun, D., van Erp, T.G., Thompson, P.M., Bearden, C.E., Daley, M., Kushan, L., & Cannon, T.D. (2009). Elucidating a magnetic resonance imaging-based neuroanatomic biomarker for psychosis: classification analysis using probabilistic brain atlas and machine learning algorithms. Biological Psychiatry, 66(11), 1055–1060.
ten Berge, J.M. (1993) Least squares optimization in multivariate analysis. Leiden University Leiden: DSWO Press.
Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the Gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411–423.
Tseng, G.C. (2007). Penalized and weighted k-means for clustering with scattered objects and prior information in high-throughput biological data. Bioinformatics, 23(17), 2247–2255.
Waldherr, A., Maier, D., Miltner, P., & Günther, E. (2017). Big data, big noise: The challenge of finding issue networks on the web. Social Science Computer Review, 35(4), 427–443.
Waldman, D.A., Wang, D., & Fenters, V. (2019). The added value of neuroscience methods in organizational research. Organizational Research Methods, 22(1), 223–249.
Wang, J. (2010). Consistent selection of the number of clusters via crossvalidation. Biometrika, 97(4), 893–904.
Witten, D.M., & Tibshirani, R. (2010). A framework for feature selection in clustering. Journal of the American Statistical Association, 105(490), 713–726.
Xu, Q., Ding, C., Liu, J., & Luo, B. (2015). Pca-guided search for k-means. Pattern Recognition Letters, 54, 50–55.
Yamashita, N., & Adachi, K. (2020). A modified k-means clustering procedure for obtaining a cardinality-constrained centroid matrix. Journal of Classification, 37(2), 509–525.
Yarkoni, T., & Westfall, J. (2017). Choosing prediction over explanation in psychology: Lessons from machine learning. Perspectives on Psychological Science, 12(6), 1100–1122.
Yuan, S., De Roover, K., Dufner, M., Denissen, J.J., & Van Deun, K. (2019). Revealing subgroups that differ in common and distinctive variation in multi-block data: Clusterwise sparse simultaneous component analysis. Social Science Computer Review, 0894439319888449.
Yuan, S., Kroon, B., & Kramer, A (2021). Building prediction models with grouped data: A case study on the prediction of turnover intention. Human Resource Management Journal.