A Similarity Function for Feature Pattern Clustering and High Dimensional Text Document Classification

Springer Science and Business Media LLC - Tập 25 - Trang 1077-1094 - 2019
Vinay Kumar Kotte1,2, Srinivasan Rajavelu3, Elijah Blessing Rajsingh3
1Department of CSE, Karunya Institute of Technology and Sciences (Deemed to be university), Coimbatore, India
2Department of CSE, Kakatiya Institute of Technology and Science, Warangal, India
3Karunya Institute of Technology and Sciences (Deemed to be University), Coimbatore, India

Tóm tắt

Text document classification and clustering is an important learning task which fits to both data mining and machine learning areas. The learning task throws several challenges when it is required to process high dimensional text documents. Word distribution in text documents plays a very key role in learning process. Research related to high dimensional text document classification and clustering is usually limited to application of traditional distance functions and most of the research contributions in the existing literature did not consider the word distribution in documents. In this research, we propose a novel similarity function for feature pattern clustering and high dimensional text classification. The similarity function proposed is used to carry supervised learning based dimensionality reduction. The important feature of this work is that the word distribution before and after dimensionality reduction is the same. Experiment results prove the proposed approach achieves dimensionality reduction, retains the word distribution and obtained better classification accuracies compared to other measures.

Tài liệu tham khảo

Bingham, E., & Mannila, H. (2001). Random projection in dimensionality reduction: applications to image and text data. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining (KDD’01) (pp. 245–250). New York, NY: ACM. http://dx.doi.org/10.1145/502512.502546D.

Gaber, M. M., Zaslavsky, A., & Krishnaswamy, S. (2005). Mining data streams—A review. SIGMODC Record, 34(2), 18–26.

Han, J., Kamber, M., & Pei, J. (Eds.). (2012a). Advanced cluster analysis. In The morgan kaufmann series in data management systems, data mining (3rd ed., pp. 497–541). Morgan Kaufmann. https://doi.org/10.1016/B978-0-12-381479-1.00011-3.

Han, J., Kamber, M., & Pei, J. (Eds.). (2012b). Classification: Basic concepts. In The morgan kaufmann series in data management systems, data mining (3rd ed., pp. 327–391). Morgan Kaufmann. https://doi.org/10.1016/B978-0-12-381479-1.00008-3.

Hyvarinen, A., Karhunen, J., & Oja, E. (2004). Independent component analysis. Hoboken: Wiley.

Jiang, J. Y., Cheng, W. H., Chiou, Y. S., & Lee, S. J. (2011a). A similarity measure for text processing. In 2011 international conference on machine learning and cybernetics, Guilin (pp. 1460–1465). https://doi.org/10.1109/icmlc.2011.6016998.

Jiang, J. Y., Liou, R. J., & Lee, S. J. (2011b). A fuzzy self-constructing feature clustering algorithm for text classification. IEEE Transactions on Knowledge and Data Engineering, 23(3), 335–349. https://doi.org/10.1109/TKDE.2010.122.

Johnson, R. A., & Wichern, D. W. (2007). Applied multivariate statistical analysis (6th ed.). Upper Saddle River: Prentice Hall.

Lin, Y. S., Jiang, J. Y., & Lee, S. J. (2014). A similarity measure for text classification and clustering. IEEE Transactions on Knowledge and Data Engineering, 26(7), 1575–1590. https://doi.org/10.1109/TKDE.2013.19.

Radhakrishna, V., Aljawarneh, S. A., Kumar, P. V, et al. (2017c). ASTRA—A novel interest measure for unearthing latent temporal associations and trends through extending basic Gaussian membership function. Multimedia Tools and Applications. https://doi.org/10.1007/s11042-017-5280-y.

Radhakrishna, V., Kumar, P. V., & Janaki, V. (2017g). Design and analysis of similarity measure for discovering similarity profiled temporal association patterns. IADIS International Journal on Computer Science and Information Systems, 12(1), 45–60.

Radhakrishna, V., Kumar, P. V., & Janaki, V. (2018). Krishna Sudarsana: A Z-space similarity measure. In Proceedings of the fourth international conference on engineering & MIS 2018 (ICEMIS’18). New York, NY: ACM, Article 44, 4 pp.

Radhakrishna, V., Kumar, P. V., Janaki, V., & Cheruvu, A. (2017i). A dissimilarity measure for mining similar temporal association patterns. IADIS International Journal on Computer Science and Information Systems, 12(1), 126–142.

Sammulal, P., Usha Rani, Y., & Yepuri, A. (2017). A class based clustering approach for imputation and mining of medical records (CBC-IM). IADIS International Journal on Computer Science & Information Systems, 12(1), 61–74.

SureshReddy, G., Rajinikanth, T. V., & Ananda Rao, A. (2014). Design and analysis of novel similarity measure for clustering and classification of high dimensional text documents. In B. Rachev & A. Smrikarov (Eds.), Proceedings of the 15th international conference on computer systems and technologies (CompSysTech’14) (pp. 194–201). New York, NY: ACM. http://dx.doi.org/10.1145/2659532.2659615.

Tatbul, N., & Zdonik, S. (2006). A subset-based load shedding approach for aggregation queries over data streams. In Proceedings of international conference on very large data bases (VLDB).

Tsai, S. C., Jiang, J. Y., Wu, C., & Lee, S. J. (2009). A fuzzy similarity-based approach for multi-label document classification. In 2009 second international workshop on computer science and engineering, Qingdao (pp. 59–63). https://doi.org/10.1109/wcse.2009.766.