Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec

Information Sciences - Tập 477 - Trang 15-29 - 2019
Dong‐Hwa Kim1, Deokseong Seo1, Suhyoun Cho1, Pilsung Kang1
1School of Industrial Management Engineering, Korea University, Seoul, Republic of Korea

Tóm tắt

Từ khóa


Tài liệu tham khảo

Amin, 2014, Customer churn prediction in telecommunication industry: With and without counter-example, 134

Aung, 2009, Random forest classifier for multi-category classification of web pages, 372

Bíró, 2008, Latent dirichlet allocation in web spam filtering, 29

Blei, 2003, Latent dirichlet allocation, J. Mach. Learn. Res., 3, 993

Blum, 1998, Combining labeled and unlabeled data with co-training, 92

Borko, 1963, Automatic document classification, J. ACM (JACM), 10, 151, 10.1145/321160.321165

Bouguelia, 2013, A stream-based semi-supervised active learning approach for document classification, 611

Chapelle, 2010

Druck, 2007, Semi-supervised classification with hybrid generative/discriminative methods, 280

Glorot, 2011, Domain adaptation for large-scale sentiment classification: A deep learning approach, 513

Go, 2009, Twitter sentiment classification using distant supervision, CS224N Project Report, Stanford, 1, 12

Harish, 2010, Representation and classification of text documents: a brief review, IJCA, Special Issue on RTIPPR, 110

Khan, 2010, A review of machine learning algorithms for text-documents classification, J. Adv. Inf. Technol., 1, 4

Kim, 2006, Some effective techniques for naive bayes text classification, IEEE Trans. Knowl. Data Eng., 18, 1457, 10.1109/TKDE.2006.180

J.H. Lau, T. Baldwin, An empirical evaluation of doc2vec with practical insights into document embedding generation, arXiv:1607.05368 (2016).

Le, 2014, Distributed representations of sentences and documents, 14, 1188

T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, arXiv:1301.3781 (2013).

Nigam, 2000, Analyzing the effectiveness and applicability of co-training, 86

Nigam, 2000, Text classification from labeled and unlabeled documents using em, Mach. Learn., 39, 103, 10.1023/A:1007692713085

Pang, 2002, Thumbs up?: sentiment classification using machine learning techniques, 79

Qiu, 2014, Collapsed gibbs sampling for latent dirichlet allocation on spark, J. Mach. Learn. Res., 36, 17

Ranjan, 2017, Document classification using lstm neural network, J. Data Mining Manage., 2

Robertson, 2004, Understanding inverse document frequency: on theoretical arguments for idf, J. Document., 60, 503, 10.1108/00220410410560582

Rosenberg, 2005, Semi-supervised self-training of object detection models, 29

Sabbah, 2017, Modified frequency-based term weighting schemes for text classification, Appl. Soft Comput., 58, 193, 10.1016/j.asoc.2017.04.069

Tong, 2001, Support vector machine active learning with applications to text classification, J. Mach. Learn. Res., 2, 45

Wang, 2012, Semi-supervised latent dirichlet allocation and its application for document classification, 306

Xing, 2014, Document classification with distributions of word vectors, 1

Xu, 2016, Bayesian naïve bayes classifiers to text classification, J. Inf. Sci.

Yun-tao, 2005, An improved tf-idf approach for text classification, J. Zhejiang Univ. Sci. A, 6, 49, 10.1631/jzus.2005.A0049

Zhang, 2011, A comparative study of tf* idf, lsi and multi-words for text classification, Expert. Syst. Appl., 38, 2758, 10.1016/j.eswa.2010.08.066

Zhu, 2005