Clustering with Instance and Attribute Level Side Information

Jinlong Wang1,2, Shunyao Wu1, Gang Li3
1School of Computer Engineering, Qingdao Technological University, Qingdao, China
2Medical College of Qingdao University, Qingdao, China
3School of Information Technology, Deakin University, Victoria, Australia

Tóm tắt

Selecting a suitable proximity measure is one of the fundamental tasks in clustering. How to effectively utilize all available side information, including the instance level information in the form of pair-wise constraints, and the attribute level information in the form of attribute order preferences, is an essential problem in metric learning. In this paper, we propose a learning framework in which both the pair-wise constraints and the attribute order preferences can be incorporated simultaneously. The theory behind it and the related parameter adjusting technique have been described in details. Experimental results on benchmark data sets demonstrate the effectiveness of proposed method.

Tài liệu tham khảo

A. K. Jain, M. N. Murty and P. J. Flynn, “Data clustering: a review”, ACM Computing Surveys, 31(3):264–323(1999). R. K. Brouwer, “Clustering feature vectors with mixed numerical and categorical attributes”, International Journal of Computational Intelligence Systems, 1(4):285–298(2008). R. K. Brouwer, “Fuzzy relational fixed point clustering”, International Journal of Computational Intelligence Systems, 2(1):69–82(2009). S. Ilhan, N. Duru and E. Adali, “Improved fuzzy art method for initializing k-means”, International Journal of Computational Intelligence Systems, 3(3):274–279(2010). S. Basu, M. Bilenko and R. J. Mooney, “A probabilistic framework for semi-supervised clustering”, Proc. of the 10th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, 59–68(2004). N. Grira, M. Crucianu and N. Boujema, “Unsupervised and semi-supervised clustering: a brief survey”, In a Review of Machine Learning Techniques for Processing Multimedia Content, Report of the MUSCLE European Network of Excellence (FP6)(2005). I. Davidson, K. Wagstaff and S. Basu, “Measuring constraint-set utility for partitional clustering algorithms”, Proc. of the 10th Euro. Conf. on Principle and Practice of Knowledge Discovery in Databases, 115–126(2006). L. Yang and R. Jin, “Distance metric learning: A comprehensive survey”, Michigan State Universiy, (2006). R. Kulis, S. Basu, I. Dhillon and R. Mooney, “Semisupervised graph clustering: a kernel approach”, Mach. Learn., 74:1–22(2009). X. S. Yin, S. C. Chen, E. L. Hu and D. Q. Zhang, “Semi-supervised clustering with metric learning: an adaptive kernel method”, Pattern Recognition, 43(4):1320–1333(2010). K. Wagstaff and C. Cardie, “Clustering with instance-level constraints”, Proc. of the 17th Intl. Conf. on Machine Learning, 1103–1110(2000). K. Wagstaff, C. Cardie, S. Rogers and S. Schrödl, “Constrained k-means clustering with background knowledge”, Proc. of the 18th Intl. Conf. on Machine Learning, 577–584(2001). D. Klein, S. D. Kamvar and C. D. Manning, “From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering”, Proc. of the 19th Intl. Conf. on Machine Learning, 307–314(2002). N. Shental, A. Bar-hillel and D. Weinshall, “Computing gaussian mixture models with em using equivalence constraints”, Advances in Neural Information Processing Systems 16, (2003). M. Bilenko, S. Basu and R. J. Mooney, “Integrating constraints and metric learning in semi-supervised clustering”, Proc. of the 21st Intl. Conf. on Machine Learning, 81–88(2004). N. Kumar and K. Kummamuru, “Semi-supervised clustering with metric learning using relative comparisons”, IEEE Transactions on Knowledge and Data Engineering, 20(4):496–503(2008). E. P. Xing, A. Y. Ng, M. I. Jordan and S. J. Russell, “Distance metric learning with application to clustering with side-information”, Advances in Neural Information Processing Systems 15, 505–512(2002). A. Bar-Hillel, T. Hertz, N. Shental and D. Weinshall, “Learning a mahalanobis metric from equivalence constraints”, J. Mach. Learn. Res., 6:937–965(2005). M. Halkidi, D. Gunopulos, M. Vazirgiannis, N. Kumar and C. Domeniconi, “A clustering framework based on subjective and objective validity criteria”, ACM Trans. Knowl. Discov. Data., 1(4):1–25(2008). S. Xiang, F. Nie and C. Zhang, “Learning a Mahalanobis distance metric for data clustering and classification”, Pattern Recognition, 41(12):3600–3612(2008). S. Basu, A. Banerjee and R. J. Mooney, “Active semi-supervision for pairwise constrained clustering”, Proc. of the 4th SIAM Intl. Conf. on Data Mining, 333–344(2004). A. Huang, D. Milne, E. Frank and I. H. Witten, “Clustering documents with active learning using Wikipedia”, Proc. of the 8th IEEE Intl. Conf. on Data Mining, 839–844(2008). R. Huang and W. Lam, “An active learning framework for semi-supervised document clustering with language modeling”, Data & Knowledge Engineering, 68(1):49–67(2009). J. Wang, S. Wu, Vu. H and G. Li, “Text document clustering with metric learning”, Proc. of the 33rd Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, 783–784(2010). S. Banerjee, K. Ramanathan and A. Gupta, “Clustering short texts using wikipedia”, Proc. of the 30th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, 787–788(2007). I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin and C. G. Nevill-Manning, “KEA: Practical automatic keyphrase extraction”, Proc. of the 4th ACM Conf. on Digital Libraries, 255(1999). P. D. Turney, “Learning to extract keyphrases from text”, National Research Council, Institute for Information Technology, Technical Report ERB-1057, (1999). X. Wu and A. Bolivar, “Keyword extraction for contextual advertisement”, Proc. of the 17th Intl Conf. on World Wide Web, 1195–1196(2008). T. Joachims, “Optimizing search engines using click-through data”, Proc. of the 8th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, 133–142(2002). C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton and G. Hullender, “Learning to rank using gradient descent”, Proc. of the 22nd Intl. Conf. on Machine Learning, 89–96(2005). S. Yu, K. Yu, V. Tresp and H. P. Kriegel, “Collaborative ordinal regression”, Proc. of the 23rd Intl. Conf. on Machine learning, 1089–1096(2006). X. Zhu and A. Goldberg, “Kernel regression with order preferences”, Proc. of the 22nd AAAI Conf. on Artificial Intelligence, 681–687(2007). J. Sun, W. Zhao, J. Xue, Z. Shen and Y. Shen, “Clustering with feature order preferences”, Proc. of the 10th Pacific Rim Intl. Conf. on Artificial Intelligence, 382–393(2008). X. Ji and W. Xu, “Document clustering with prior knowledge”, Proc. of the 29th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, 405–412(2006). Y. Chen, M. Rege, M. Dong and J. Hua, “Incorporating user provided constraints into document clustering”, Proc. of the 7th IEEE Intl. Conf. on Data Mining, 103–112(2007). G. Hu, S. Zhou, J. Guan and X. Hu, “Towards effective document clustering: A constrained k-means based approach”, Inf. Process. Manage., 44(4):1397–1409(2008). S. Boyd and L. Vandenberghe, “Convex optimization”, Cambridge University Press, (2004). E. D. Andersen and Y. Ye, “On a homogeneous algorithm for the monotone complementarity problem”, Mathematical Programming, 84(2):375–399(1999). A. K. Jain and R. C. Dubes, “Algorithms for clustering data”, Prentice-Hall, Inc., (1988). M. Halkidi, Y. Batistakis and M. Vazirgiannis, “On Clustering Validation Techniques”, Journal of Intelligent Information Systems, 17(2–3):107–145(2001). D. Pfitzner, R. Leibbrandt and D. Powers, “Characterization and evaluation of similarity measures for pairs of clusterings”, Knowl. Inf. Syst., 19:361–394(2009). X. Z. Fern and C. E. Brodley, “Random projection for high dimensional data clustering: A cluster ensemble approach”, Prof. of the 20th Intl. Conf. on Machine Learning, 186–193(2003). A. Fred and A. Jain, “Robust data clustering”, Proc. of the 2003 IEEE Intl. Conf. on Computer Vision and Pattern Recognition, 2, 128–136(2003). X. Yin, E. Hu and S. Chen, “Discriminative semi-supervised clustering analysis with pairwise constraints”, Journal of Software(in Chinese), 19(11):2791–2802(2008). X. Hu, X. Zhang, C. Lu, E. K. Park and X. Zhou, “Exploiting Wikipedia as external knowledge for document clustering”, Proceedings of the 15th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, 389–396(2009).