Distance matters! Cumulative proximity expansions for ranking documents
Tóm tắt
In the information retrieval
process, functions that rank documents according to their estimated relevance to a query typically regard query terms as being independent. However, it is often the joint presence of query terms that is of interest to the user, which is overlooked when matching independent terms. One feature that can be used to express the relatedness of co-occurring terms is their proximity in text. In past research, models that are trained on the proximity information in a collection have performed better than models that are not estimated on data. We analyzed how co-occurring query terms can be used to estimate the relevance of documents based on their distance in text, which is used to extend a unigram ranking function with a proximity model that accumulates the scores of all occurring term combinations. This proximity model is more practical than existing models, since it does not require any co-occurrence statistics, it obviates the need to tune additional parameters, and has a retrieval speed close to competing models. We show that this approach is more robust than existing models, on both Web and newswire corpora, and on average performs equal or better than existing proximity models across collections.
Tài liệu tham khảo
Beeferman, D., Berger, A., & Lafferty, J. (1997). A model of lexical attraction and repulsion. In Proceedings of the 35th annual meeting of the association for computational linguistics and eighth conference of the European chapter of the association for computational linguistics (pp. 373–380). Association for computational linguistics.
Bendersky, M., & Croft, W. B. (2012). Modeling higher-order term dependencies in information retrieval using query hypergraphs. In Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval (pp. 941–950). ACM.
Bendersky, M., Metzler, D. & Croft, W. B. (2010). Learning concept importance using a weighted dependence model. In Proceedings of the third ACM international conference on Web search and data mining (pp. 31–40). ACM.
Büttcher, S., Clarke, C. L., & Lushman, B. (2006). Term proximity scoring for ad-hoc retrieval on very large text collections. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (pp. 621–622). ACM.
Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J. A., & Allan, J. (2008). Evaluation over thousands of queries. In Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval (pp. 651–658). ACM.
Clarke, C. L., Cormack, G. V., & Tudhope, E. A. (2000). Relevance ranking for one to three term queries. Information Processing and Management, 36(2), 291–311.
Collins-Thompson, K., & Callan, J. (2007). Estimation and use of uncertainty in pseudo-relevance feedback. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 303–310). ACM.
Croft, W. B., Turtle, H. R., & Lewis, D. D. (1991). The use of phrases and structured queries in information retrieval. In Proceedings of the 14th annual international ACM SIGIR conference on research and development in information retrieval (pp. 32–45). ACM.
Cummins, R., & O’Riordan, C. (2009). Learning in a pairwise term-term proximity framework for information retrieval. In Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval (pp. 251–258). ACM.
De Kretser, O. & Moffat, A. (1999). Effective document presentation with a locality-based similarity heuristic. In Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (pp. 113–120). ACM.
Fagan, J. (1987). Automatic phrase indexing for document retrieval. In Proceedings of the 10th annual international ACM SIGIR conference on research and development in information retrieval (pp. 91–101). ACM.
Gao, J., Nie, J.-Y., Wu, G., & Cao, G. (2004). Dependence language model for information retrieval. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 170–177). ACM.
Hawking, D., & Thistlewaite, P. (1995). Proximity operators-so near and yet so far. In Proceedings of the 4th text retrieval conference (pp. 131–143).
He, B., Huang, J. X., & Zhou, X. (2011). Modeling term proximity for probabilistic information retrieval models. Information Sciences, 181(14), 3017–3031.
Keen, E. M. (1991). The use of term position devices in ranked output experiments. Journal of Documentation, 47(1), 1–22.
Lavrenko, V., & Croft, W. B. (2001). Relevance based language models. In Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (pp. 120–127). ACM.
Liu, X., & Croft, W. B. (2002). Passage retrieval based on language models. In Proceedings of the eleventh international conference on information and knowledge management (pp. 375–382). ACM.
Lv, Y., & Zhai, C. (2009). Positional language models for information retrieval. In Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval (pp. 299–306). ACM.
Metzler, D., & Croft, W. B. (2005). A markov random field model for term dependencies. In Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval (pp. 472–479). ACM.
Metzler, D., & Croft, W. B. (2007). Latent concept expansion using markov random fields. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 311–318). ACM.
Miao, J., Huang, J. X., & Ye, Z. (2012). Proximity-based rocchio’s model for pseudo relevance. In Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval (pp. 535–544). ACM.
Nallapati, R., & Allan, J. (2002). Capturing term dependencies using a language model based on sentence trees. In Proceedings of the eleventh international conference on information and knowledge management (pp. 383–390). ACM.
Rasolofo, Y., & Savoy, J. (2003). Term proximity scoring for keyword-based retrieval systems. In Advances in information retrieval (pp. 207–218). Springer.
Sakai, T., Manabe, T., & Koyama, M. (2005). Flexible pseudo-relevance feedback via selective sampling. ACM Transactions on Asian Language Information Processing (TALIP), 4(2), 111–135.
Shi, L., & Nie, J.-Y. (2010). Using various term dependencies according to their utilities. In Proceedings of the 19th ACM international conference on Information and knowledge management (pp. 1493–1496). ACM.
Song, F., & Croft, W. B. (1999). A general language model for information retrieval. In Proceedings of the eighth international conference on Information and knowledge management (pp. 316–321). ACM.
Song, R., Taylor, M. J., Wen, J.-R., Hon, H.-W., & Yu, Y. (2008). Viewing term proximity from a different perspective. In Advances in information retrieval (pp. 346–357). Springer.
Svore, K. M., Kanani, P. H., & Khan, N. (2010). How good is a span of terms? Exploiting proximity to improve web retrieval. In Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval (pp. 154–161). ACM.
Tao, T., & Zhai, C. (2007). An exploration of proximity measures in information retrieval. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 295–302). ACM.
Tellex, S., Katz, B., Lin, J., Fernandes, A., & Marton, G. (2003). Quantitative evaluation of passage retrieval algorithms for question answering. In Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval (pp. 41–47). ACM.
Van Rijsbergen, C. J. (1977). A theoretical basis for the use of co-occurrence data in information retrieval. Journal of Documentation, 33(2), 106–119.
Vechtomova, O., & Wang, Y. (2006). A study of the effect of term proximity on query expansion. Journal of Information Science, 2(4), 324–333.
Zhai, C., & Lafferty, J. (2004). A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems (TOIS), 22(2), 179–214.
Zhao, J., & Yun, Y. (2009). A proximity language model for information retrieval. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (pp. 291–298). ACM.