Latent entity space: a novel retrieval approach for entity-bearing queries

Springer Science and Business Media LLC - Tập 18 - Trang 473-503 - 2015
Xitong Liu1, Hui Fang1
1Department of Electrical and Computer Engineering, University of Delaware, Newark, USA

Tóm tắt

Analysis on Web search query logs has revealed that there is a large portion of entity-bearing queries, reflecting the increasing demand of users on retrieving relevant information about entities such as persons, organizations, products, etc. In the meantime, significant progress has been made in Web-scale information extraction, which enables efficient entity extraction from free text. Since an entity is expected to capture the semantic content of documents and queries more accurately than a term, it would be interesting to study whether leveraging the information about entities can improve the retrieval accuracy for entity-bearing queries. In this paper, we propose a novel retrieval approach, i.e., latent entity space (LES), which models the relevance by leveraging entity profiles to represent semantic content of documents and queries. In the LES, each entity corresponds to one dimension, representing one semantic relevance aspect. We propose a formal probabilistic framework to model the relevance in the high-dimensional entity space. Experimental results over TREC collections show that the proposed LES approach is effective in capturing latent semantic content and can significantly improve the search accuracy of several state-of-the-art retrieval models for entity-bearing queries.

Tài liệu tham khảo

Balog, K., Azzopardi, L., & De Rijke, M. (2006). Formal models for expert finding in enterprise corpora. In SIGIR (pp. 43–50). Balog, K., de Vries, A. P., Serdyukov, P., Thomas, P., & Westerveld, T. (2010). Overview of the TREC 2009 entity track. In Proceedings of TREC. Balog, K., Serdyukov, P., & de Vries, A. P. (2011). Overview of the TREC 2010 entity track. In Proceedings of TREC. Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., & Etzioni, O. (2007). Open information extraction from the Web. IJCAI, 7, 2670–2676. Bendersky, M., & Croft, W. B. (2008). Discovering key concepts in verbose queries. In SIGIR (pp. 491–498). Billerbeck, B., & Zobel, J. (2004). Questioning query expansion: An examination of behaviour and parameters. In Proceedings of the 15th Australasian database conference-Volume 27 (pp. 69–76). Australian Computer Society Inc. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. The Journal of machine Learning research, 3, 993–1022. Bollacker, K., Evans, C., Paritosh, P., Sturge, T., & Taylor, J. (2008). Freebase: A collaboratively created graph database for structuring human knowledge. In SIGMOD (pp. 1247–1250). Cafarella, M. J., Madhavan, J., & Halevy, A. (2009). Web-scale extraction of structured data. ACM SIGMOD Record, 37(4), 55–61. Clarke, C. L. A., Craswell, N., & Soboroff, I. (2009). Overview of the TREC 2009 Web track. In TREC. Clarke, C. L. A., Craswell, N., Soboroff, I., & Cormack, G. (2010). Overview of the TREC 2010 Web track. In TREC. Clarke, C. L. A., Craswell, N., Soboroff, I., & Voorhees, E. (2011). Overview of the TREC 2011 Web track. In TREC. Clarke, C. L. A., Craswell, N., & Voorhees, E. (2012). Overview of the TREC 2012 Web track. In TREC. Collins-Thompson, K., Bennett, P., Diaz, F., Clarke, C. L. A., & Voorhees, E. M. (2013). TREC 2013 Web track overview. In TREC. Collins-Thompson, K., Macdonald, C., Bennett, P., Diaz, F., & Voorhees, E. M. (2014). TREC 2014 Web track overview. In TREC. Cormack, G., Smucker, M., & Clarke, C. (2011). Efficient and effective spam filtering and re-ranking for large web datasets. Information Retrieval, 14(5), 441–465. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273–297. Craswell, N., de Vries, A. P., & Soboroff, I. (2005). Overview of the TREC 2005 enterprise track. In Proceedings of TREC. Cucerzan, S. (2007). Large-scale named entity disambiguation based on Wikipedia data. In EMNLP-CoNLL, 7, 708–716. Dalton, J., Dietz, L., & Allan, J. (2014). Entity query feature expansion using knowledge base links. In SIGIR (pp. 365–374). Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman, R. A. (1990). Indexing by latent semantic analysis. JASIS, 41(6), 391–407. Demartini, G. (2011). From people to entities: Typed search in the enterprise and the web. PhD thesis, Leibniz University of Hannover, Germany. Demartini, G., de Vries, A., Iofciu, T., & Zhu, J. (2009). Overview of the INEX 2008 entity ranking track. In Focused retrieval and evaluation (pp. 243–252). Demartini, G., Gaugaz, J., & Nejdl, W. (2009) A vector space model for ranking entities and its application to expert search. In ECIR (pp. 189–201). Egozi, O., Markovitch, S., & Gabrilovich, E. (2011). Concept-based information retrieval using explicit semantic analysis. ACM Transactions on Information Systems (TOIS), 29(2), 8. Elsas, J. L., Arguello, J., Callan, J., & Carbonell, J. G. (2008). Retrieval and feedback models for blog feed search. In SIGIR (pp. 347–354). Fang, H., Zhai, C. (2007). Probabilistic models for expert finding. In ECIR (pp. 418–430). Frank, J. R., Kleiman-Weiner, M., Roberts, D. A., Niu, F., Zhang, C., Ré, C., & Soboroff, I. (2012). Building an entity-centric stream filtering test collection for TREC 2012. In Proceedings of TREC. Gabrilovich, E., & Markovitch, S. (2009). Wikipedia-based semantic interpretation for natural language processing. Journal of Artificial Intelligence Research, 34(2), 443. Gabrilovich, E., Ringgaard, M., & Subramanya, A. (2013). FACC1: Freebase annotation of ClueWeb corpora, Version 1 (Release date 2013-06-26, Format version 1, Correction level 0). http://lemurproject.org/clueweb09/FACC1/, June 2013. Grootjen, F. A., & Van Der Weide, T. P. (2006). Conceptual query expansion. Data & Knowledge Engineering, 56(2), 174–193. He, B., & Ounis, I. (2006). Query performance prediction. Information Systems, 31(7), 585–594. Lafferty, J., & Zhai, C. (2003). Probabilistic relevance models based on document and query generation. Language Modeling and Information Retrieval, Kluwer International Series on Information Retrieval. Lavrenko, V., & Croft, W.B. (2001). Relevance-based language models. In SIGIR (pp. 120–127). Lin, T., Pantel, P., Gamon, M., Kannan, A., & Fuxman, A. (2012). Active objects: Actions for entity-centric search. In WWW (pp. 589–598). Liu, X., Chen, F., Fang, H., & Wang, M. (2014a). Exploiting entity relationship for query expansion in enterprise search. Information Retrieval, 17(3), 265–294. Liu, X., Yang, P., & Fang, H. (2014b). Entity came to rescue - leveraging entities to minimize risks in web search. In TREC. Macdonald, C., & Ounis, I. (2006). Voting for candidates: Adapting data fusion techniques for an expert search task. In CIKM (pp. 387–396). Metzler, D., & Croft, W. B. (2005). A Markov random field model for term dependencies. In SIGIR (pp. 472–479). Metzler, D., & Croft, W. B. (2007). Latent concept expansion using Markov random fields. In SIGIR (pp. 311–318). Milne, D. N., Witten, I. H., & Nichols, D. M. (2007). A knowledge-based search engine powered by Wikipedia. In CIKM (pp. 445–454). Petkova, D., & Croft, W. B. (2007). Proximity-based document representation for named entity retrieval. In CIKM (pp. 731–740). Ponte, J. M., & Croft, W. B. (1998). A language modeling approach to information retrieval. In SIGIR (pp. 275–281). Pound, J., Mika, P., & Zaragoza, H. (2010). Ad-hoc object retrieval in the web of data. In WWW (pp. 771–780). Robertson, S. E., & Walker, S. (1994) Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR (pp. 232–241). Salton, G., Wong, A., & Yang, C.-S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620. Soboroff, I., de Vries, A. P., Craswell, N. (2006). Overview of the TREC 2006 enterprise track. In Proceedings of TREC. Styltsvig, H. B. (2006). Ontology-based information retrieval. PhD thesis, Roskilde University, Denmark. Vallet, D., Fernández, M., & Castells, P. (2005) An ontology-based information retrieval model. In The Semantic Web: Research and Applications (pp. 455–470). Springer: Berlin. Wang, L., Bennett, P. N., & Collins-Thompson, K. (2012). Robust ranking models via risk-sensitive optimization. In SIGIR (pp. 761–770). Wei, X., & Croft, W. B. (2006). LDA-Based document models for Ad-hoc retrieval. In SIGIR (pp. 178–185). Xu, Y., Jones, G. J., & Wang, B. (2009). Query dependent pseudo-relevance feedback based on Wikipedia. In SIGIR (pp. 59–66). Yang, P., & Fang, H. (2013). Evaluating the effectiveness of axiomatic approaches in web track. In TREC. Zhai, C., & Lafferty, J. (2001a). A study of smoothing methods for language models applied to ad hoc information retrieval. In SIGIR (pp. 334–342). Zhai, C., & Lafferty, J. (2001b). Model-based feedback in the language modeling approach to information retrieval. In CIKM (pp. 403–410). Zhou, Y., & Croft, W. B. (2007). Query performance prediction in web search environments. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 543–550). ACM.