Parallel rare term vector replacement: Fast and effective dimensionality reduction for text

Journal of Parallel and Distributed Computing - Tập 73 - Trang 341-351 - 2013
T. Berka1, M. Vajteršic1,2
1Department of Computer Sciences, University of Salzburg, Salzburg, Austria
2Department of Informatics, Mathematical Institute, Slovak Academy of Sciences, Bratislava, Slovak Republic

Tài liệu tham khảo

Aggarwal, 2010, The generalized dimensionality reduction problem, 607 Aizerman, 1964, Theoretical foundations of the potential function method in pattern recognition learning, Automat. Remote Control, 25, 821 Bartell, 1992, Latent semantic indexing is an optimal special case of multidimensional scaling, 161 T. Berka, M. Vajteršic, Dimensionality reduction for information retrieval using vector replacement of rare terms, in: Proc. TM, 2011. Berry, 1993, Massively-parallel implementations of Lanczos algorithms for computing the SVD of large sparse matrices, 437 Berry, 1999, Matrices, vector spaces, and information retrieval, SIAM Rev., 41, 335, 10.1137/S0036144598347035 Berry, 2006, vol. 184, 117 Campoy, 2009, Dimensionality reduction by self organizing maps that preserve distances in output space, 2976 Cancho, 2003, Least effort and the origins of scaling in human language, Proc. Natl. Acad. Sci. USA, 100, 788, 10.1073/pnas.0335980100 Chen, 2009, Lanczos vectors versus singular vectors for effective dimension reduction, IEEE Trans. Knowl. Data Eng., 21, 1091, 10.1109/TKDE.2008.228 Cox, 2001 Cuzzocrea, 2006, Accuracy control in compressed multidimensional data cubes for quality of answer-based OLAP tools, 301 Cuzzocrea, 2006, A hierarchy-driven compression technique for advanced OLAP visualization of multidimensional data cubes, vol. 4081, 106 Deerwester, 1990, Indexing by latent semantic analysis, J. Soc. Inf. Sci., 41, 391, 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9 Dhillon, 2001, Concept decompositions for large sparse text data using clustering, Mach. Learn., 42, 143, 10.1023/A:1007612920971 Eckart, 1936, The approximation of one matrix by another of lower rank, Psychometrika, 1, 211, 10.1007/BF02288367 Faloutsos, 1995, Fastmap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets, 163 E. Gallopoulos, D. Zeimpekis, CLSI: a flexible approximation scheme from clustered term-document matrices, in: Proc. SDM, 2005, pp. 631–635. Hofmann, 1999, Probabilistic latent semantic indexing, 50 M.P. Holmes, A.G. Gray, C.L. Isbell, G. Tech, QUIC-SVD: fast svd using cosine trees, in: Proc. NIPS, 2009, pp. 673–680. Hussain, 2010, Text categorization using word similarities based on higher order co-occurrences, 1 Hyvarinen, 2001 Janecek, 2010, Utilizing nonnegative matrix factorization for e-mail classification problems Johnson, 2007 Jolliffe, 2002 Karypis, 2000, Fast supervised dimensionality reduction algorithm with applications to document categorization & retrieval, 12 Kobayashi, 2002, Matrix computations for information retrieval and major and outlier cluster detection, J. Comput. Appl. Math., 149, 119, 10.1016/S0377-0427(02)00524-1 Lan, 2005, A comprehensive comparative study on term weighting schemes for text categorization with support vector machines, 1032 Langville, 2008, Nonnegative matrix factorization for document classification, 339 Lewis, 2004, RCV1: a new benchmark collection for text categorization research, J. Mach. Learn. Res., 5, 361 Mao, 2007, The phrase-based vector space model for automatic retrieval of free-text medical documents, Data Knowl. Eng., 61, 76, 10.1016/j.datak.2006.02.008 MPI forum, MPI: a message-passing interface standard, Tech. Rep., Knoxville, TN, USA, 1994. Paatero, 1994, Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values, Environmetrics, 5, 111, 10.1002/env.3170050203 Powers, 1998, Applications and explanations of Zipf’s law, 151 Roweis, 2000, Nonlinear dimensionality reduction by locally linear embedding, Science, 290, 2323, 10.1126/science.290.5500.2323 Sakellaridi, 2008, Graph-based multilevel dimensionality reduction with applications to eigenfaces and latent semantic indexing, 194 Schölkopf, 1998, Nonlinear component analysis as a Kernel eigenvalue problem, Neural Comput., 10, 1299, 10.1162/089976698300017467 Tenenbaum, 2000, A global geometric framework for nonlinear dimensionality reduction, Science, 290, 2319, 10.1126/science.290.5500.2319 Tsatsaronis, 2009, A generalized vector space model for text retrieval based on semantic relatedness, 70 Wong, 1985, Generalized vector spaces model in information retrieval, 18