Probabilistic models in IR and their relationships
Tóm tắt
A solid research path towards new information retrieval models is to further develop the theory behind existing models. A profound understanding of these models is therefore essential. In this paper, we revisit probability ranking principle (PRP)-based models, probability of relevance (PR) models, and language models, finding conceptual differences in their definition and interrelationships. The probabilistic model of the PRP has not been explicitly defined previously, but doing so leads to the formulation of two actual principles with different objectives. First, the belief probability ranking principle (BPRP), which considers uncertain relevance between known documents and the current query, and second, the popularity probability ranking principle (PPRP), which considers the probability of relevance of documents among multiple queries with the same features. Our analysis shows how some of the discussed PR models implement the BPRP or the PPRP while others do not. However, for some models the parameter estimation is challenging. Finally, language models are often presented as related to PR models. However, we find that language models differ from PR models in every aspect of a probabilistic model and the effectiveness of language models cannot be explained by the PRP.
Tài liệu tham khảo
Aly, R., & Demeester, T. (2011). Towards a better understanding of the relationship between probabilistic models. In G. Amati & F. Crestani, (Eds.), ICTIR ’11: Proceedings of the 3nd international conference on theory of information retrieval: Advances in information retrieval theory (Vol. 6931, pp. 164–175). doi:10.1007/978-3-642-23318-0_16.
Bishop, C. M. (2006). Pattern recognition and machine learning (Information Science and Statistics). New York: Springer.
Chen, H., & Karger, D. R. (2006). Less is more: Probabilistic models for retrieving fewer relevant documents. In SIGIR’06: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (pp. 429–436). ACM, doi:10.1145/1148170.1148245.
Cooper, W. S. (1994). The formalism of probability theory in ir: A foundation for an encumbrance? In SIGIR’94: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 242–247). ISBN 3-540-19889-X.
Cox, R. (1946). Probability, frequency and reasonable expectation. American Journal of Physics, 14(1), 1–13. doi:10.1119/1.1990764.
Crestani, F., Lalmas, M., Rijsbergen, C. J. V., & Campbell, I. (1998). Is this document relevant?\(\ldots\)probably: A survey of probabilistic models in information retrieval. ACM Computing Surveys 30(4), 528–552.
Fang, H., & Zhai, C. (2005). An exploration of axiomatic approaches to information retrieval. In SIGIR ’05: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval (pp. 480–487). ACM, doi:10.1145/1076034.1076116.
Feller, W. (1968) An introduction to probability theory and its applications (Vol. 1, 3rd Edn). Wiley, ISBN 0471257087.
Fuhr, N. (1992). Probabilistic models in information retrieval. The Computer Journal, 35(3), 243–255.
Hiemstra, D. (2001). Using language models for information retrieval. PhD thesis, University of Twente.
Kullback, S., & Leibler, R. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22, 79–86, ISSN 0003-4851.
Lafferty, J., & Zhai, C. (2003). Probabilistic relevance models based on document and query generation (Vol. 13, , pp. 1–10, chapter 1). Dordrecht: Kluwer Academic Publishers.
Lavrenko, V., & Croft, W. B. (2003). Language modeling for information retrieval, chapter Relevance models in information retrieval (pp. 11–56). Dordrecht: Kluwer Academic Publishers.
Lewis, D. D. (1998). Naive (bayes) at forty: The independence assumption in information retrieval. In ECML-98: Machine learning, Vol. 1398/1998 of Lecture Notes in Computer Science (pp. 4–15). Berlin: Springer. doi:10.1007/BFb0026666.
Liu, T.-Y. (2009). Learning to rank for information retrieval. Foundations and Trends in Information Retrieval 3, 225–331. doi:10.1561/1500000016.
Luk, R. W. P. (2008). On event space and rank equivalence between probabilistic retrieval models. Information Retrieval, 11(6), 539–561.
Lv, Y. (2012). Improving the effectiveness of language modeling approaches to information retrieval: Bridging the theory-effectiveness gap. PhD thesis, University of Illinois at Urbana-Champaign. URL http://hdl.handle.net/2142/34306.
Manning, C. D. & Schuetze, H. (1999). Foundations of statistical natural language processing. The MIT Press, 1 edn, ISBN 0-26213-360-1.
Maron, M. E., & Kuhns, J. L. (1960). On relevance, probabilistic indexing and information retrieval. Journal of the ACM, 7(3), 216–244.
Ponte, J. M., & Croft, W. B. (1998). A language modeling approach to information retrieval. In SIGIR ’98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (pp. 275–281). ACM, doi:10.1145/290941.291008.
Robertson, S. E. (1977). The probability ranking principle in IR. Journal of Documentation, 33, 294–304.
Robertson, S. E. (2005). On event spaces and probabilistic models in information retrieval. Information Retrieval, 8(2), 319–329. ISSN 1386-4564 (Print) 1573–7659 (Online).
Robertson, S. E., & Spärck-Jones, K. (1976). Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3), 129–146. doi:10.1002/asi.4630270302.
Robertson S. E., Maron M. E., & Cooper W. S. (1982) Probability of relevance: A unification of two competing models for document retrieval. Information Technology: Research and Development 1(1):1–21.
Roelleke, T. & Wang, J. (2006). A parallel derivation of probabilistic information retrieval models. In SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 107–114). ACM, doi:10.1145/1148170.1148192.
Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communication of the ACM, 18(11), 613–620. doi:10.1145/361219.361220.
Spärck-Jones, K., Robertson, S. E., Zaragoza, H., & Hiemstra, D. (2003). Language modelling for information retrieval, chapter Language modelling and relevance, pp 57–71. Kluwer.
Voorhees, E., Harman, D., N.I. of Standards, T. (US) (2005). TREC: Experiment and evaluation in information retrieval. Cambridge: MIT Press
Wang, J., & Zhu, J. (2009). Portfolio theory of information retrieval. In SIGIR ’09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (pp. 115–122). ACM. doi:10.1145/1571941.1571963.
Zhai, C. (2008). Statistical language models for information retrieval a critical review. Foundations and Trends in Information Retrieval, 2(3), 137–213.
Zhai, C., & Lafferty, J. (2004). A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems 22(2), 179–214.
Zhai, C., & Lafferty, J. (2006). A risk minimization framework for information retrieval. Information Processing and Management 42(1), 31–55.