A Similarity Function for Feature Pattern Clustering and High Dimensional Text Document Classification

Springer Science and Business Media LLC - Tập 25 - Trang 1077-1094 - 2019

Vinay Kumar Kotte^1,2, Srinivasan Rajavelu³, Elijah Blessing Rajsingh³

¹Department of CSE, Karunya Institute of Technology and Sciences (Deemed to be university), Coimbatore, India

²Department of CSE, Kakatiya Institute of Technology and Science, Warangal, India

³Karunya Institute of Technology and Sciences (Deemed to be University), Coimbatore, India

Tóm tắt

Text document classification and clustering is an important learning task which fits to both data mining and machine learning areas. The learning task throws several challenges when it is required to process high dimensional text documents. Word distribution in text documents plays a very key role in learning process. Research related to high dimensional text document classification and clustering is usually limited to application of traditional distance functions and most of the research contributions in the existing literature did not consider the word distribution in documents. In this research, we propose a novel similarity function for feature pattern clustering and high dimensional text classification. The similarity function proposed is used to carry supervised learning based dimensionality reduction. The important feature of this work is that the word distribution before and after dimensionality reduction is the same. Experiment results prove the proposed approach achieves dimensionality reduction, retains the word distribution and obtained better classification accuracies compared to other measures.

Tài liệu tham khảo

Abualigah, L. M., Khader, A. T., Al-Betar, M. A., & Alomari, O. A. (2017). Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering. Expert Systems with Applications, 84, 24–36.

Adeli, E., Shi, F., An, L., Wee, C. Y., Wu, G. R., Wu, T., et al. (2016). Joint feature-sample selection and robust diagnosis of Parkinson’s disease from mri data. Neuroimage, 141, 206–219. https://doi.org/10.1016/j.neuroimage.2016.05.054.

Aggarwal, C. (2007). Data streams models and algorithms. Cham: Springer.

Aggarwal, C., Han, J., Wang, J., & Yu, P. (2003). A framework for clustering evolving data streams. In VLDB conference.

Aljawarneh, S., Radhakrishna, V., Kumar, P. V., & Janaki, V. (2016). A similarity measure for temporal pattern discovery in time series data generated by IoT. In 2016 international conference on engineering & MIS (ICEMIS), Agadir (pp. 1–4).

Aljawarneh, S. A., Radhakrishna, V., & Cheruvu, A. (2017a). Extending the Gaussian membership function for finding similarity between temporal patterns. In 2017 international conference on engineering & MIS (ICEMIS), Monastir (pp. 1–6).

Aljawarneh, S. A., Radhakrishna, V., Kumar, P. V., & Janaki, V. (2017b). G-SPAMINE: An approach to discover temporal association patterns and trends in internet of things. Future Generation Computer Systems, 74, 430–443. https://doi.org/10.1016/j.future.2017.01.013.

Aljawarneh, S. A., & Vangipuram, R. (2018). GARUDA: Gaussian dissimilarity measure for feature representation and anomaly detection in Internet of things. The Journal of Super Computing. https://doi.org/10.1007/s11227-018-2397-3.

Babcock, B., Babu, S., Datar, M., Motwani, R., & Widom, J. (2002). Models and issues in data stream systems. In Proceedings of PODS.

Bartell, B. T., Cottrell, G. W., & Belew, R. K. (1992). Latent semantic indexing is an optimal special case of multidimensional scaling. In Proceedings of SIGIR, ACM, USA (pp. 161–167).

Berka, T., & Vajteršic, M. (2011). Dimensionality reduction for information retrieval using vector replacement of rare terms. In Proceedings of TM.

Berka, T., & Vajteršic, M. (2013). Parallel rare term vector replacement: Fast and effective dimensionality reduction for text. Journal of Parallel and Distributed Computing, 73(3), 341–351.

Bharti, K. K., & Singh, P. K. (2014). A three-stage unsupervised dimension reduction method for text clustering. Journal of Computational Science, 5(2), 156–169.

Bingham, E., & Mannila, H. (2001). Random projection in dimensionality reduction: applications to image and text data. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining (KDD’01) (pp. 245–250). New York, NY: ACM. http://dx.doi.org/10.1145/502512.502546D.

Chang, J. H., & Lee, W. S. (2005). estWin: Online data stream mining of recent frequent item sets by sliding window method. Journal of Information Science, 31(2), 7690.

Charikar, M., Callaghan, L., & Panigrahy, R. (2003). Better streaming algorithms for clustering problems. In Proceedings of 35th ACM symposium on theory of computing.

Chen, Y. C., Peng, W. C., & Lee, S. Y. (2015). Mining temporal patterns in time interval-based data. IEEE Transactions on Knowledge and Data Engineering, 27(12), 3318–3331.

Cox, T. F., & Cox, M. A. A. (2001). Multidimensional scaling. Boca Raton: Chapman & Hall.

Gaber, M. M., Zaslavsky, A., & Krishnaswamy, S. (2004). Towards an adaptive approach for mining data streams in resource constrained environments. In Proceedings of sixth international conference on data warehousing and knowledge discovery. Lecture Notes in Computer Science (LNCS). Cham: Springer.

Gaber, M. M., Zaslavsky, A., & Krishnaswamy, S. (2005). Mining data streams—A review. SIGMODC Record, 34(2), 18–26.

Gama, J. (2013). Knowledge discovery from databases. Boca Raton: CRC Press.

Ganguly, D., Leveling, J., & Jones, G. J. F. (2015). Context-driven dimensionality reduction for clustering text documents. In P. Majumder, M. Mitra, M. Agrawal, & P. Mehta (Eds.), Proceedings of the 7th forum for information retrieval evaluation (FIRE’15) (pp. 1–7). New York, NY: ACM.

Han, J., Kamber, M., & Pei, J. (Eds.). (2012a). Advanced cluster analysis. In The morgan kaufmann series in data management systems, data mining (3rd ed., pp. 497–541). Morgan Kaufmann. https://doi.org/10.1016/B978-0-12-381479-1.00011-3.

Han, J., Kamber, M., & Pei, J. (Eds.). (2012b). Classification: Basic concepts. In The morgan kaufmann series in data management systems, data mining (3rd ed., pp. 327–391). Morgan Kaufmann. https://doi.org/10.1016/B978-0-12-381479-1.00008-3.

Hanneke, S. (2016). The optimal sample complexity OF PAC learning. Journal of Machine Learning Research, 17(38), 1–15.

He, C., Dong, Z., Li, R., & Zhong, Y. (2008). Dimensionality reduction for text using LLE. In 2008 international conference on natural language processing and knowledge engineering, Beijing (pp. 1–7).

Hyvarinen, A., Karhunen, J., & Oja, E. (2004). Independent component analysis. Hoboken: Wiley.

Jiang, J. Y., Cheng, W. H., Chiou, Y. S., & Lee, S. J. (2011a). A similarity measure for text processing. In 2011 international conference on machine learning and cybernetics, Guilin (pp. 1460–1465). https://doi.org/10.1109/icmlc.2011.6016998.

Jiang, J. Y., Liou, R. J., & Lee, S. J. (2011b). A fuzzy self-constructing feature clustering algorithm for text classification. IEEE Transactions on Knowledge and Data Engineering, 23(3), 335–349. https://doi.org/10.1109/TKDE.2010.122.

Johnson, R. A., & Wichern, D. W. (2007). Applied multivariate statistical analysis (6th ed.). Upper Saddle River: Prentice Hall.

Lin, Y. S., Jiang, J. Y., & Lee, S. J. (2014). A similarity measure for text classification and clustering. IEEE Transactions on Knowledge and Data Engineering, 26(7), 1575–1590. https://doi.org/10.1109/TKDE.2013.19.

Lin, Y., et al. (2013). A similarity measure for text classification and clustering. IEEE Transactions of Knowledge and Data Engineering, 26, 1575–1590.

Mallick, K., & Bhattacharyya, S. (2012). Uncorrelated local maximum margin criterion: An efficient dimensionality reduction method for text classification. Procedia Technology, 4, 370–374.

Neagoe, V. E., & Neghina, E. C. (2016). Feature selection with ant colony optimization and its applications for pattern recognition in space imagery. In IEEE ICC.

Paatero, P., & Tapper, U. (1994). Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics, 5(2), 111–126.

Pang, G., Jin, H., & Jiang, S. (2013). An effective class-centroid-based dimension reduction method for text classification. In Proceedings of the 22nd international conference on World Wide Web (WWW’13 Companion) (pp. 223–224). New York, NY: ACM.

Phridviraj, M. S. B., Srinivas, C., & GuruRao, C. V. (2014). Clustering text data streams. A tree based approach with ternary function and ternary feature vector. Procedia Computer Science, 31, 976–984.

Radhakrishna, V., Aljawarneh, S. A., Janaki, V., & Kumar, P. V. (2017b). Looking into the possibility for designing normal distribution based dissimilarity measure to discover time profiled association patterns. In 2017 international conference on engineering & MIS (ICEMIS), Monastir (pp. 1–5).

Radhakrishna, V., Aljawarneh, S. A., Kumar, P. V, et al. (2017c). ASTRA—A novel interest measure for unearthing latent temporal associations and trends through extending basic Gaussian membership function. Multimedia Tools and Applications. https://doi.org/10.1007/s11042-017-5280-y.

Radhakrishna, V., Aljawarneh, S. A., Kumar, P. V., & Choo, K.-K. R. (2016a). A novel fuzzy Gaussian-based dissimilarity measure for discovering similarity temporal association patterns. Soft Computing, 1, 1. https://doi.org/10.1007/s00500-016-2445-y.

Radhakrishna, V., Aljawarneh, S. A., Kumar, P. V., & Janaki, V. (2017a). A novel fuzzy similarity measure and prevalence estimation approach for similarity profiled temporal association pattern mining. Future Generation Computer Systems, 1, 1. https://doi.org/10.1016/j.future.2017.03.016.

Radhakrishna, V., Kumar, P. V., Aljawarneh, S. A., & Janaki, V. (2017e). Design and analysis of a novel temporal dissimilarity measure using Gaussian membership function. In 2017 international conference on engineering & MIS (ICEMIS), Monastir (pp. 1–5).

Radhakrishna, V., Kumar, P. V., & Janaki, V. (2015). An approach for mining similarity profiled temporal association patterns using Gaussian based dissimilarity measure. In Proceedings of the international conference on engineering & MIS 2015 (ICEMIS’15).

Radhakrishna, V., Kumar, P. V., & Janaki, V. (2016b). A computationally optimal approach for extracting similar temporal patterns. In 2016 international conference on engineering & MIS (ICEMIS), Agadir (pp. 1–6).

Radhakrishna, V., Kumar, P. V., & Janaki, V. (2016c). Looking into the possibility of novel dissimilarity measure to discover similarity profiled temporal association patterns in IoT. In 2016 international conference on engineering & MIS (ICEMIS), Agadir (pp. 1–6).

Radhakrishna, V., Kumar, P. V., & Janaki, V. (2017a). A computationally efficient approach for mining similar temporal patterns. In Matoušek R (Eds.), Recent advances in soft computing. ICSC-MENDEL 2016. Advances in intelligent systems and computing (Vol. 576). Springer, Cham.

Radhakrishna, V., Kumar, P. V., & Janaki, V. (2017e). SRIHASS—A similarity measure for discovery of hidden time profiled temporal associations. Multimedia Tools and Applications.. https://doi.org/10.1007/s11042-017-5185-9.

Radhakrishna, V., Kumar, P. V., & Janaki, V. (2017g). Design and analysis of similarity measure for discovering similarity profiled temporal association patterns. IADIS International Journal on Computer Science and Information Systems, 12(1), 45–60.

Radhakrishna, V., Kumar, P. V., & Janaki, V. (2017h). Normal distribution based similarity profiled temporal association pattern mining (N-SPAMINE). Database Systems Journal, 7(3), 22–33.

Radhakrishna, V., Kumar, P. V., & Janaki, V. (2018). Krishna Sudarsana: A Z-space similarity measure. In Proceedings of the fourth international conference on engineering & MIS 2018 (ICEMIS’18). New York, NY: ACM, Article 44, 4 pp.

Radhakrishna, V., Kumar, P. V., Janaki, V., & Aljawarneh, S. (2016d). A similarity measure for outlier detection in timestamped temporal databases. In 2016 international conference on engineering & MIS (ICEMIS), Agadir (pp. 1–5).

Radhakrishna, V., Kumar, P. V., Janaki, V., & Aljawarneh, S. (2016e). A computationally efficient approach for temporal pattern mining in IoT. In 2016 international conference on engineering & MIS (ICEMIS), Agadir (pp. 1–4).

Radhakrishna, V., Kumar, P. V., Janaki, V., & Cheruvu, A. (2017i). A dissimilarity measure for mining similar temporal association patterns. IADIS International Journal on Computer Science and Information Systems, 12(1), 126–142.

Sammulal, P., Usha Rani, Y., & Yepuri, A. (2017). A class based clustering approach for imputation and mining of medical records (CBC-IM). IADIS International Journal on Computer Science & Information Systems, 12(1), 61–74.

SureshReddy, G., Rajinikanth, T. V., & Ananda Rao, A. (2014). Design and analysis of novel similarity measure for clustering and classification of high dimensional text documents. In B. Rachev & A. Smrikarov (Eds.), Proceedings of the 15th international conference on computer systems and technologies (CompSysTech’14) (pp. 194–201). New York, NY: ACM. http://dx.doi.org/10.1145/2659532.2659615.

Tatbul, N., & Zdonik, S. (2006). A subset-based load shedding approach for aggregation queries over data streams. In Proceedings of international conference on very large data bases (VLDB).

Tsai, S. C., Jiang, J. Y., Wu, C., & Lee, S. J. (2009). A fuzzy similarity-based approach for multi-label document classification. In 2009 second international workshop on computer science and engineering, Qingdao (pp. 59–63). https://doi.org/10.1109/wcse.2009.766.

Uguz, H. (2012). A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowledge-Based Systems, 24, 1024–1032.

Unler, A., Murat, A., & Chinnam, R. (2011). mr2pso: A maximum relevance minimum redundancy feature selection method based on swarm intelligence for support vector machine classification. Information Sciences, 181, 4625–4641.

Usha Rani, Y., & Sammulal, P. (2017). An approach for imputation of medical records using novel similarity measure. In R. Matoušek (Eds.), Recent advances in soft computing. ICSC-MENDEL 2016. Advances in Intelligent Systems and Computing (Vol. 5). Cham: Springer.

Usha Rani, Y., Sammulal, P., & Golla, M. (2018). An efficient approach for imputation and classification of medical data values using class-based clustering of medical records. Computers & Electrical Engineering, 66, 487–504.

VinayKumar, K., Srinivasan, R., & Singh, E. B. (2015). A feature clustering approach for dimensionality reduction and classification. In R. Matoušek (Eds.), Mendel 2015. ICSC-MENDEL 2016. Advances in intelligent systems and computing (Vol. 378). Cham: Springer.

Xu, X., Liang, T., Zhu, J., Zheng, D., & Sun, T. (2018). Review of classical dimensionality reduction and sample selection methods for large-scale data processing. Neurocomputing. ISSN 0925-2312.

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA