Convex clustering for binary data

Advances in Data Analysis and Classification - Tập 13 - Trang 991-1018 - 2018

Hosik Choi¹, Seokho Lee²

¹Kyonggi University, Suwon, Republic of Korea

²Hankuk University of Foreign Studies, Seoul, Republic of Korea

Tóm tắt

We present a new clustering algorithm for multivariate binary data. The new algorithm is based on the convex relaxation of hierarchical clustering, which is achieved by considering the binomial likelihood as a natural distribution for binary data and by formulating convex clustering using a pairwise penalty on prototypes of clusters. Under convex clustering, we show that the typical $$\ell _1$$ pairwise fused penalty results in ineffective cluster formation. In an attempt to promote the clustering performance and select the relevant clustering variables, we propose the penalized maximum likelihood estimation with an $$\ell _2$$ fused penalty on the fusion parameters and an $$\ell _1$$ penalty on the loading matrix. We provide an efficient algorithm to solve the optimization by using majorization-minimization algorithm and alternative direction method of multipliers. Numerical studies confirmed its good performance and real data analysis demonstrates the practical usefulness of the proposed method.

Tài liệu tham khảo

Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imaging Sci 2:183–202 Böhning D (1992) Multinomial logistic regression algorithm. Ann Inst Stat Math 44:197–200 Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trend${}^{\textregistered }$ Mach Learn 3:1–122 Chi EC, Lange K (2015) Splitting methods for convex clustering. J Comput Graph Stat 24:994–1013 Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32:407–499 Finch H (2005) Comparison of distance measures in cluster analysis with dichotomous data. J Data Sci 3:85–100 Goldstein T, O’Donoghue B, Setzer S, Baraniuk R (2014) Fast alternating direction optimization methods. SIAM J Imaging Sci 7:1588–1623 Golub GH, Van Loan CF (1996) Matrix computations, 3rd edn. Johns Hopkins University Press, Baltimore Hallac D, Leskovec J, Boyd S (2015) Network lasso: clustering and optimization in large graphs. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp 387–396 Hocking TD, Joullin A, Bach F, Vert J-P (2011) Cluterpath: an algorithm for clustering using convex fusion penalties. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 754–762 Jolliffe IT (2012) Principal component analysis, 2nd edn. Springer, New York Lange K (2004) Optimization. Springer, New York Lee S, Huang JZ (2014) A biclustering algorithm for binary matrices based on penalized Bernoulli likelihood. Stat Comput 24:429–441 Lee S, Huang JZ, Hu J (2010) Sparse logistic principal component analysis for binary data. Ann Appl Stat 4:1579–1601 Li T (2006) A unified view on clustering binary data. Mach Learn 62:199–215 Lichman M (2013) UCI machine learning repository [http://archive.ics.uci.edu/ml]. University of California, School of Information and Computer Science, Irvine Pan W, Shen X, Liu B (2013) Cluster analysis: unsupervised learning via supervised learning with a non-convex penalty. J Mach Learn Res 14:1865–1889 Polson NG, Scott JG, Willard BT (2015) Proximal algorithms in statistics and machine learning. Stat Sci 30:559–581 Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846–850 Shen X, Huang HC (2010) Grouping pursuit through a regularization solution surface. J Am Stat Assoc 105:727–739 Shen X, Pan W (2012) Simultaneous supervised clustering and feature selection over a graph. Biometrika 99:899–914 Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B 63:411–423 Turlach BA, Venables W (2005) Simultaneous variable selection. Technometrics 47:349–363 Witten DM, Tibshirani R (2010) A framework for feature selection in clustering. J Am Stat Assoc 105:713–726 Wu C, Kwon S, Shen X, Pan W (2016) A new algorithm and theory for penalized regression-based clustering. J Mach Learn Res 17:1–25 Yang H, Liu X (2017) Studies on the clustering algorithm for analyzing gene expression data with a bidirectional penalty. J Comput Biol 24:689–698 Yang Y, Guan X, You J (2002) CLOPE: a fast and effective clustering algorithm for transactional data. In: SIGKDD ’02 Edmonton, Alberta, Canada, pp 682–687 Zhang Z, Li T, Ding C, Zhang X (2007) Binary matrix factorization with applications. In: IEEE international conference on data mining, pp 391–400

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA