Balancing exploration and exploitation in listwise and pairwise online learning to rank for information retrieval
Tóm tắt
As retrieval systems become more complex, learning to rank approaches are being developed to automatically tune their parameters. Using online learning to rank, retrieval systems can learn directly from implicit feedback inferred from user interactions. In such an online setting, algorithms must obtain feedback for effective learning while simultaneously utilizing what has already been learned to produce high quality results. We formulate this challenge as an exploration–exploitation dilemma and propose two methods for addressing it. By adding mechanisms for balancing exploration and exploitation during learning, each method extends a state-of-the-art learning to rank method, one based on listwise learning and the other on pairwise learning. Using a recently developed simulation framework that allows assessment of online performance, we empirically evaluate both methods. Our results show that balancing exploration and exploitation can substantially and significantly improve the online retrieval performance of both listwise and pairwise approaches. In addition, the results demonstrate that such a balance affects the two approaches in different ways, especially when user feedback is noisy, yielding new insights relevant to making online learning to rank effective in practice.
Tài liệu tham khảo
Agarwal, D., Chen, B., Elango, P., Motgi, N., Park, S., Ramakrishnan, et al. (2008). Online models for content optimization. In: NIPS’08, pp 17–24.
Auer, P. (2003). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3, 397–422.
Barto, A. G., Sutton, R. S., & Brouwer, P. S. (1981). Associative search network: A reinforcement learning associative memory. IEEE Transaction on System, Man, and Cybernetics, 40, 201–211.
Broder, A. (2002). A taxonomy of web search. SIGIR Forum, 36(2), 3–10.
Cohen, J. D., McClure, S. M., & Yu, A. J. (2007). Should I stay or should I go? How the human brain manages the trade-off between exploitation and exploration. Philosophical Transactions of the Royal Society B: Biological Sciences, 362(1481), 933–942.
Craswell, N., Zoeter, O., Taylor, M., & Ramsey, B. (2008). An experimental comparison of click position-bias models. In: WSDM ’08, pp 87–94.
Donmez, P., & Carbonell, J. (2009). Active sampling for rank learning via optimizing the area under the ROC curve. In: ECIR’09, pp 78–89.
Gittins, J. C. (1979). Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society Series B (Methodological) 41(2), 148–177.
Guo, F., Liu, C., & Wang, Y. M. (2009b). Efficient multiple-click models in web search. In: WSDM ’09, pp 124–131.
Guo, F., Li, L., & Faloutsos, C. (2009a). Tailoring click models to user goals. In: WSCD ’09, pp 88–92.
He, J., Zhai, C., & Li, X. (2009). Evaluation of methods for relative comparison of retrieval systems based on clickthroughs. In: CIKM ’09, pp 2029–2032.
Herbrich, R., Graepel, T., & Obermayer, K. (1999). Support vector learning for ordinal regression. In: ICANN ’99, vol 1, pp 97–102.
Hofmann, K., Whiteson, S., & de Rijke, M. (2011a). Balancing exploration and exploitation in learning to rank online. In: ECIR’11, pp 251–263.
Hofmann, K., Whiteson, S., & de Rijke, M. (2011b). A probabilistic method for inferring preferences from clicks. In: CIKM ’11, pp 249–258.
Joachims, T. (2002). Optimizing search engines using clickthrough data. In: KDD ’02, pp 133–142.
Joachims, T., Granka, L., Pan, B., Hembrooke, H., Radlinski, F., & Gay, G. (2007). Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM Transaction of Information and System 25(2):7+.
Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996) Reinforcement learning: A survey. Journal Artifical Intelligence Research 4(1), 237–285.
Kalyanakrishnan, S., & Stone, P. (2010). Efficient selection of multiple bandit arms: Theory and practice. In: ICML’10, pp 511–518.
Karimzadehgan, M., & Zhai, C. (2010). Exploration-exploitation tradeoff in interactive relevance feedback. In: CIKM ’10, ACM, New York, NY, USA, pp 1397–1400.
Langford, J., & Zhang, T. (2008). The epoch-greedy algorithm for multi-armed bandits with side information. In: NIPS’08, pp 817–824.
Langford, J., Strehl, A., & Wortman, J. (2008). Exploration scavenging. In: ICML ’08, pp 528–535.
Li, L., Chu, W., Langford, J., & Wang, X. (2011). Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In: WSDM ’11, pp 297–306.
Li, L., Chu, W., Langford, J., & Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. In: WWW ’10, pp 661–670.
Liu, T. Y. (2009). Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3), 225–331.
Liu, T. Y., Xu, J., Qin, T., Xiong, W., & Li, H. (2007). Letor: Benchmark dataset for research on learning to rank for information retrieval. In: LR4IR ’07.
Mahajan, A., & Teneketzis, D. (2008). Multi-armed bandit problems. Foundations and Applications of Sensor Management pp 121–151.
Minka, T., & Robertson, S. (2008). Selection bias in the LETOR datasets. In: SIGIR Workshop on Learning to Rank for Information Retrieval, pp 48–51.
Radlinski, F., & Craswell, N. (2010). Comparing the sensitivity of information retrieval metrics. In: SIGIR ’10, pp 667–674.
Radlinski, F., Kleinberg, R., & Joachims, T. (2008a). Learning diverse rankings with multi-armed bandits. In: ICML ’08, ACM, pp 784–791.
Radlinski, F., Kurup, M., & Joachims, T. (2008b). How does clickthrough data reflect retrieval quality? In: CIKM ’08, pp 43–52.
Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society 58, 527–535.
Sanderson, M. (2010). Test collection based evaluation of information retrieval systems. Foundations and Trends in Information Retrieval 4(4), 247–375.
Sculley, D. (2009). Large scale learning to rank. In: NIPS 2009 Workshop on Advances in Ranking.
Silverstein, C., Marais, H., Henzinger, M., & Moricz, M. (1999). Analysis of a very large web search engine query log. SIGIR Forum 33(1), 6–12.
Strehl, A., Mesterharm, C., Littman, M., & Hirsh, H. (2006). Experience-efficient learning in associative bandit problems. In: ICML ’06, pp 889–896.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. MIT Press, Cambridge, MA, USA.
Taylor, M., Guiver, J., Robertson, S., & Minka, T. (2008). Softrank: optimizing non-smooth rank metrics. In: WSDM ’08, ACM, pp 77–86.
Tian, A., & Lease, M. (2011). Active learning to maximize accuracy vs. effort in interactive information retrieval. In: SIGIR ’11, pp 145–154.
Watkins, C. (1989). Learning from delayed rewards. PhD thesis, Cambridge University.
Whiteson, S., & Stone, P. (2006). On-line evolutionary computation for reinforcement learning in stochastic domains. In: GECCO 2006: Proceedings of the genetic and evolutionary computation conference, pp 1577–1584.
Xu, Z., & Akella, R. (2008). A Bayesian logistic regression model for active relevance feedback. In: SIGIR ’08, pp 227–234.
Xu, Z., Akella, R., & Zhang, Y. (2007). Incorporating diversity and density in active learning for relevance feedback. In: ECIR’07, pp 246–257.
Xu, Z., Kersting, K., & Joachims, T. (2010). Fast active exploration for link-based preference learning using gaussian processes. In: ECML PKDD’10, pp 499–514.
Yue, Y., & Joachims, T. (2009). Interactively optimizing information retrieval systems as a dueling bandits problem. In: ICML’09, pp 1201–1208.
Yue, Y., Broder, J., Kleinberg, R., & Joachims, T. (2009). The k-armed dueling bandits problem. In: COLT’09.
Zhang, T. (2004). Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: ICML ’04, ACM, pp 116+.
Zhang, Y., Xu, W., & Callan, J. (2003). Exploration and exploitation in adaptive filtering based on bayesian active learning. In: ICML ’03, pp 896–904.