A segment-based approach to clustering multi-topic documents
Tóm tắt
Document clustering has been recognized as a central problem in text data management. Such a problem becomes particularly challenging when document contents are characterized by subtopical discussions that are not necessarily relevant to each other. Existing methods for document clustering have traditionally assumed that a document is an indivisible unit for text representation and similarity computation, which may not be appropriate to handle documents with multiple topics. In this paper, we address the problem of multi-topic document clustering by leveraging the natural composition of documents in text segments that are coherent with respect to the underlying subtopics. We propose a novel document clustering framework that is designed to induce a document organization from the identification of cohesive groups of segment-based portions of the original documents. We empirically give evidence of the significance of our segment-based approach on large collections of multi-topic documents, and we compare it to conventional methods for document clustering.
Tài liệu tham khảo
Arotaritei D, Mitra S (2004) Web mining: a survey in the fuzzy framework. Fuzzy Sets Syst 148:5–19
Banerjee A, Krumpelman C, Ghosh J, Basu S, Mooney RJ (2005) Model-based overlapping clustering. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 532–537
Banerjee A, Shan H (2007) Latent Dirichlet conditional naive-Bayes models. In: Proceedings of the 7th IEEE international conference on data mining (ICDM), pp 421–426
Baraldi A, Blonda P (1999) A survey of fuzzy clustering algorithms for pattern recognition. i–ii. IEEE Trans Syst Man Cybern Part B 29(6):778–801
Bezdek J (1981) Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Brants T, Chen F, Tsochantaridis I (2002) Topic-based document segmentation with probabilistic latent semantic analysis. In: Proceedings of the 11th ACM international conference on information and knowledge management (CIKM), pp 211–218
Campos R, Dias G, Nunes C (2006) WISE: hierarchical soft clustering of web page search results based on web content mining techniques. In: Proceedings of the IEEE/WIC/ACM international conference on web intelligence, pp 301–304
Chen CL, Tseng FSC, Liang T (2011) An integration of fuzzy association rules and WordNet for document clustering. Knowl Inf Syst 28(3):687–708
Chim H, Deng X (2008) Efficient phrase-based document similarity for clustering. IEEE Trans Knowl Data Eng 20(9):1217–1229
Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41:391–407
Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42(1/2):143–175
Du L, Buntine WL, Jin H (2010) A segmented topic model based on the two-parameter Poisson–Dirichlet process. Mach Learn 81(1):5–19
Farahat AK, Kamel MS (2011) Statistical semantics for enhancing document clustering. Knowl Inf Syst 28(2):365–393
Fu Q, Banerjee A (2008) Multiplicative mixture models for overlapping clustering. In: Proceedings of the 8th IEEE international conference on data mining (ICDM), pp 791–796
Fu Q, Banerjee A (2009) Bayesian overlapping subspace clustering. In: Proceedings of the 9th IEEE international conference on data mining (ICDM), pp 776–781
Fung B, Wang K, Ester M (2003) Hierarchical document clustering using frequent itemsets. In: Proceedings of the 3rd SIAM international conference on data mining (SDM), pp 59–70
He Q, Chang K, Lim EP, Banerjee A (2010) Keep it simple with time: a re-examination of probabilistic topic detection models. IEEE Trans Pattern Anal Mach Intell 32(10):1795–1808
Hearst MA (1997) TextTiling: segmenting text into multi-paragraph subtopic passages. Comput Linguist 23(1):33–64
Hearst MA, Plaunt C (1993) Subtopic structuring for full-length document access. In: Proceedings of the 16th ACM international conference on research and development in information retrieval (SIGIR), pp 59–68
Heller KA, Ghahramani Z (2007) A nonparametric Bayesian approach to modeling overlapping clusters. In: Proceedings of the 11th international conference on artificial intelligence and statistics (AISTATS)
Hofmann T (1999) Probabilistic latent semantic indexing. In: Proceedings of the 22nd ACM international conference on research and development in information retrieval (SIGIR), pp 50–57
Hofmann T (2001) Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning 42(1–2):177–196
Jain A, Dubes R (1988) Algorithms for clustering data. Prentice-Hall, Upper Saddle River
Jing L, Ng MK, Huang JZ (2010) Knowledge-based vector space model for text clustering. Knowl Inf Syst 25(1):35–55
Jing L, Ng MK, Xu J, Huang JZ (2005) Subspace clustering of text documents with feature weighting-means algorithm. In: Proceedings of the 9th Pacific-Asia conference on advances in knowledge discovery and data mining (PAKDD), pp 802–812
Karypis G (2007) CLUTO—software for clustering high-dimensional datasets. http://www.cs.umn.edu/~cluto
Khy S, Ishikawa Y, Kitagawa H (2007) A novelty-based clustering method for online documents. World Wide Web 11:1–37
Kim YM, Pessiot JF, Amini MR, Gallinari P (2008) An extension of PLSA for document clustering. In: Proceedings of the 17th ACM international conference on information and knowledge management (CIKM), pp 1345–1346
Kogan J (2007) Introduction to clustering large and high-dimensional data. Cambridge University Press, Cambridge
Krishnapuram R, Joshi A, Yi L (1999) A fuzzy relative of the \(k\)-medoids algorithm with application to web document and snippet clustering. In: Proceedings of the IEEE international conference on fuzzy systems, pp 1281–1286
Kummamuru K, Dhawale A, Krishnapuram R (2003) Fuzzy co-clustering of documents and keywords. In: Proceedings of the 12th IEEE international conference on fuzzy systems, pp 772–777
Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. In: Proceedings of the 5th ACM international conference on knowledge discovery and data mining (KDD), pp 16–22
Lewis DD, Yang Y, Rose T, Li F (2004) RCV1: a new benchmark collection for text categorization research. J Mach Learn Res 5:361–397
Li T, Ma S, Ogihara M (2004) Document clustering via adaptive subspace iteration. In: Proceedings of the 27th ACM international conference on research and development in information retrieval (SIGIR), pp 218–225
Mendes MES, Sacks L (2003) Evaluating fuzzy clustering for relevance-based information access. In: Proceedings of the 12th IEEE international conference on fuzzy systems, pp 648–653
Misra H, Yvon F, Jose JM, Cappé O (2009) Text segmentation via topic modeling: an analytical study. In: Proceedings of the 18th ACM international conference on information and knowledge management (CIKM), pp 1553–1556
Mittal V, Kantrowitz M, Goldstein J, Carbonell J (1999) Selecting text spans for document summaries. In: Proceedings of 16th national conference on artificial intelligence and 11th conference on innovative applications of artificial intelligence, pp 467–473
Ni X, Quan X, Lu Z, Wenyin L, Hua B (2011) Short text clustering by finding core terms. Knowl Inf Syst 27(3):345–365
Osinski S, Stefanowski J, Weiss D (2004) Lingo: search results clustering algorithm based on singular value decomposition. In: Proceedings of the international conference on intelligent information systems, pp 359–368
Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newsl 6(1):90–105
Salton G (1989) Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley, Boston
Shafiei MM, Milios EE (2006) Latent Dirichlet co-clustering. In: Proceedings of the 6th IEEE international conference on data mining (ICDM), pp 542–551
Shafiei MM, Milios EE (2006) Model-based overlapping co-clustering. In: Proceedings of the 4th workshop on text mining, in conjunction with the 6th SIAM international conference on data mining (SDM)
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: Proceedings of the KDD’00 workshop on text mining
Tagarelli A, Karypis G (2008) A segment-based approach to clustering multi-topic documents. In: Proceedings of the 6th workshop on text mining, in conjunction with the 8th SIAM international conference on data mining (SDM)
Tsai FS, Zhang Y (2011) D2S: document-to-sentence framework for novelty detection. Knowl Inf Syst 29(2):419–433
Ueda N, Saito K (2002) Single-shot detection of multiple categories of text using parametric mixture models. In: Proceedings of the 8th ACM international conference on knowledge discovery and data mining (KDD), pp 626–631
Wan X, Yang J, Xiao J (2008) Towards a unified approach to document similarity search using manifold-ranking of blocks. Inf Process Manag 44:1032–1048
Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the 21st ACM international conference on research and development in information retrieval (SIGIR), pp 46–54
Zeng HJ, He QC, Chen Z, Ma WY, Ma J (2004) Learning to cluster web search results. In: Proceedings of the 27th ACM international conference on research and development in information retrieval (SIGIR), pp 210–217
Zhao Y, Karypis G (2004) Empirical and theoretical comparison of selected criterion functions for document clustering. Mach Learn 55(3):311–331
Zhao Y, Karypis G (2004) Soft clustering criterion functions for partitional document clustering: a summary of results. In: Proceedings of the 13th ACM international conference on information and knowledge management (CIKM), pp 246–247
Zhao Y, Karypis G, Fayyad UM (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Discov 10(2):141–168
Zhong S, Ghosh J (2005) Generative model-based document clustering: a comparative study. Knowl Inf Syst 8(3):374–384