Classification of heterogeneous text data for robust domain-specific language modeling

Ján Staš1, Jozef Juhár1, Daniel Hládek1
1Department of Electronics and Multimedia Communications, Technical University of Košice, Park Komenského 13, 041 20, Košice, Slovakia

Tóm tắt

Từ khóa


Tài liệu tham khảo

Juhár J, Staš J, Hládek D: Recent progress in development language model for Slovak large vocabulary continuous speech recognition. In New Technologies - Trends, Innovations and Research. Edited by: C Volosencu, C Volosencu . InTech Open Access, Rijeka; 2012:261-276.

Juhár J, Trnka M, Darjaa S, Hládek D, Sabo R, Pleva M, Rusko M: Recent advances in the Slovak dictation system for the judicial domain. In Proceedings of the 6th Language and Technology Conference on HLT. Poznań, LTC; 2013:555-560.

Huang A: Similarity measures for text document clustering. In Proceedings of the 6th New Zealand Computer Science Research Student Conference. Christchurch, NZCSRSC; 2008:49-56.

Yue L, Xiao S, Lv X, Wang T: Topic detection based on keyword. In Proceedings of 2011 International Conference on Mechatronic Science, Electric Engineering and Computer. Jilin, MEC; 2011:464-467.

Manning CD, Raghavan P, Schütze H: Introduction to Information Retrieval. Cambridge: Cambridge University Press; 2009.

Peng F, Schuurmans D, Wang S: Augmenting naïve Bayes classifiers with statistical language models. Inf. Retr. 2004, 7(3–4):317-345.

Tan S: An effective refinement strategy for KNN text classifier. Expert Syst. Appl 2006, 30(2):290-298. 10.1016/j.eswa.2005.07.019

Remeikis N, Skučas I, Melninkaité V: Text categorization using neural networks initialized with decision trees. Informatica 2004, 15(4):551-564.

Joachims T: Text categorization with support vector machines: learning with many relevant features. In Proceedings of the 10th European Conference on ML. Chemnitz, ECML; 1998:137-142.

Zhang W, Yoshida T, Tang X: Text classification using semi-supervised clustering. In Proceedings of the 2nd International Conference on Business Intelligence and Financial Engineering. Beijing, BIFE; 2009:197-200.

Darjaa S, Cerňak M, Trnka M, Rusko M: Effective triphone mapping for acoustic modeling in speech recognition. In Proceeding of INTERSPEECH 2011. Florence, INTERSPEECH; 2011:1717-1720.

Pleva M, Juhár J: Building of broadcast news database for evaluation of the automated subtitling service. Communications 2013, 15(2A):124-128.

Hládek D, Juhár J, Staš J: the Slovak morphological classifier. In Proceedings of the 54th International Symposium ELMAR 2012. Zadar, ELMAR; 2012:195-198.

Garabík R: Slovak morphology analyzer based on Levenshtein edit operations. In Proceedings of the 1st Workshop on Intelligent and Knowledge Oriented Technologies. Bratislava, WIKT; 2006:2-5.

Hládek D, Juhár J, Ološtiak M, Staš J: Automatic extraction of multiword units from Slovak text corpora. In Proceedings of the 7th International Conference on Natural Language Processing, Corpus Linguistics and E-learning. Bratislava, SLOVKO; 2013:228-237.

Reed JW, Jiao Y, Potok TE, Klump BA, Elmore MT, Hurson AR, TF-ICF: a new term weighting scheme for clustering dynamic data sets. In Proceedings of the 5th International Conference on Machine Learning and Applications. Orlando: ICMLA; 2006:258-263.

Zlacký D, Staš J, J Juhár, A Čižmár, Term weighting schemes for Slovak text document clustering. (J. Electr. Electron. Eng, ed.), vol. 6, (2013), pp. 163–166

Jin R, Falusos C, Hauptmann AG: Meta-scoring: automatically evaluating term weighting schemes in IR without precision-recall. In Proceedings of the 24th Annual International ACM Conference on Research and Development in Information Retrieval. New Orleans, USA, SIGIR ACM, New York; 2001:83-89.

Robertson SE, Walker S, Jones S, Hancock-Beaulieu MM, Gatford M: Okapi at TREC-3. In Proceedings of the 3rd Text Retrieval Conference. Gaithersburg, TREC-3; 1996:109-126.

Whissell JS, Clarke ChLA: Improving document clustering using Okapi BM25 feature weighting. Inf. Retr 2011, 14(5):466-487. 10.1007/s10791-011-9163-y

Singhal A: AT&T at TREC-6. In Proceedings of the 6th Text Retrieval Conference. Gaithersburg, TREC-6; 1998:215-226.

Lee S, Song J, Kim Y: An empirical comparison of four text mining methods. J. Comp. Inf. Sys 2010, 51(1):1-10.

Cha SH: Comprehensive survey on distance/similarity measures between probability density functions. Intl. J. Math. Model. Methods Appl. Sci 2007, 1(4):300-307.

Rosin PL: Edges: saliency measures and automatic thresholding. Technical Note No. I.95.58: Institute for Remote Sensing Applications 1995.

Lee A, Kawahara T: Recent development of open-source speech recognition engine Julius. In em Proceedings of the 2009 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. Sapporo, APSIPA ASC; 2009:131-137.

Stolcke A, Zheng J, Wang W, Abrash V: SRILM at sixteen: update and outlook. In Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop. Waikoloa, ASRU; 2011:5 pages-5 pages.