Efficient query processing techniques for next-page retrieval

Springer Science and Business Media LLC - Tập 25 - Trang 27-43 - 2022
Joel Mackenzie1,2, Matthias Petri3, Alistair Moffat1
1The University of Melbourne, Melbourne, Australia
2University of Queensland, Brisbane, Australia
3Amazon Alexa, Manhattan Beach, USA

Tóm tắt

In top-k ranked retrieval the goal is to efficiently compute an ordered list of the highest scoring k documents according to some stipulated similarity function such as the well-known BM25 approach. In most implementation techniques a min-heap of size k is used to track the top scoring candidates. In this work we consider the question of how best to retrieve the second page of search results, given that a first page has already been computed; that is, identification of the documents at ranks $$k+1$$ to 2k for some query. Our goal is to understand what information is available as a by-product of the first-page scoring, and how it can be employed to accelerate the second-page computation, assuming that the second-page of results is required for only a fraction of the query load. We propose a range of simple, yet efficient, next-page retrieval techniques which are suitable for accelerating Document-at-a-Time mechanisms, and demonstrate their performance on three large text collections.

Tài liệu tham khảo

Allan, J., Carterette, B., Aslam, J. A., Pavlu, V., Dachev, B., & Kanoulas, E. (2007). Million query track 2007 overview. In Proceedings of the text retrieval conference (TREC). Allan, J., Aslam, J. A., Carterette, B., Pavlu, V., & Kanoulas, E. (2008). Million query track 2008 overview. In Proceedings of the text retrieval conference (TREC). Azzopardi, L., & Zuccon, G. (2016). Two scrolls or one click: A cost model for browsing search results. In Proceedings of the European conference on information retrieval (ECIR) (pp. 696–702). Broder, A. Z., Carmel, D., Herscovici, M., Soffer, A., & Zien, J. (2003). Efficient query evaluation using a two-level retrieval process. In Proceedings of the ACM international conference on information and knowledge management (CIKM) (pp. 426–434). Carterette, B., Pavlu, V., Fang, H., & Kanoulas, E. (2009). Million query track 2009 overview. In Proceedings of the text retrieval conference (TREC). Chen, R.-C., Gallagher, L., Blanco, R., & Culpepper, J. S. (2017). Efficient cost-aware cascade ranking in multi-stage retrieval. In Proceedings of the ACM international conference on research and development in information retrieval (SIGIR) (pp. 445–454). Costa, M., & Silva, M. A. (2010). Search log analysis of a Portuguese web search engine. In INForum – Simpósio de Informática (pp. 525–536). Crane, M., Culpepper, J. S., Lin, J., Mackenzie, J., & Trotman, A. (2017). A comparison of Document-at-a-Time and Score-at-a-Time query evaluation. In Proceedings of the ACM international conference on web search and data mining (WSDM) (pp. 201–210). de Carvalho, L. L. S., de Moura, E. S., Daoud, C. M., & da Silva, A. S. (2015). Heuristics to improve the BMW method and its variants. Journal of Information and Data Management, 6(3), 178–191. Dhulipala, L., Kabiljo, I., Karrer, B., Ottaviano, G., Pupyrev, S., & Shalita, A. (2016). Compressing graphs and indexes with recursive graph bisection. In Proceedings of the ACM international conference on knowledge discovery and data mining (KDD)(pp. 1535–1544). Ding, S., & Suel, T. (2011). Faster top-\(k\) document retrieval using block-max indexes. In Proceedings of the ACM international conference on research and development in information retrieval (SIGIR) (pp. 993–1002). Fontoura, M., Josifovski, V., Liu, J., Venkatesan, S., Zhu, X., & Zien, J. (2011). Evaluation strategies for top-\(k\) queries over memory-resident inverted indexes. Proceedings of the Very Large Databases (VLDB), 4(12), 1213–1224. Jansen, B. J., & Spink, A. (2006). How are we searching the world wide web? A comparison of nine search engine transaction logs. Information Processing & Management, 42(1), 248–263. Jansen, B. J., Spink, A., & Saracevic, T. (2000). Real life, real users, and real needs: a study and analysis of user queries on the web. Information Processing & Management, 36(2), 207–227. Kamphuis, C., de Vries, A. P., Boytsov, L., & Lin, J. (2020). Which BM25 do you mean? A large-scale reproducibility study of scoring variants. In Proceedings of the European conference on information retrieval (ECIR) (pp 28–34). Kane, A., & Tompa, F. W. (2018). Split-lists and initial thresholds for WAND-based search. In Proceedings of the ACM international conference on research and development in information retrieval (SIGIR) (pp. 877–880). Kelly, D., & Azzopardi, L. (2015). How many results per page? A study of SERP size, search behavior and user experience. In Proceedings of the ACM international conference on research and development in information retrieval (SIGIR) (pp. 183–192). Lin, J., & Yang, P. (2019). The impact of score ties on repeatability in document ranking. In Proceedings of the ACM international conference on research and development in information retrieval (SIGIR) (pp. 1125–1128). Lin, J., Mackenzie, J., Kamphuis, C., Macdonald, C., Mallia, A., Siedlaczek, M., Trotman, A., & de Vries, A. (2020). Supporting interoperability between open-source search engines with the common index file format. In Proceedings of the ACM international conference on research and development in information retrieval (SIGIR) (pp. 2149–2152). Mackenzie, J., & Moffat, A. (2020). Examining the additivity of top-\(k\) query processing innovations. In Proceedings of the ACM international conference on information and knowledge management (CIKM) (pp. 1085–1094). Mackenzie, J., Mallia, A., Petri, M., Culpepper, J. S., & Suel, T. (2019). Compressing inverted indexes with recursive graph bisection: A reproducibility study. In Proceedings of the European conference on information retrieval (ECIR) (pp. 339–352). Mackenzie, J., Benham, R., Petri, M., Trippas, J. R., Culpepper, J. S., & Moffat, A. (2020). CC-News-En: A large English news corpus. In Proceedings of the ACM international conference on information and knowledge management (CIKM) (pp. 3077–3084). Mallia, A., Ottaviano, G., Porciani, E., Tonellotto, N., & Venturini, R. (2017). Faster BlockMax WAND with variable-sized blocks. In Proceedings of the ACM international conference on research and development in information retrieval (SIGIR) (pp. 625–634). Mallia, A., Siedlaczek, M., Mackenzie, J., & Suel, T. (2019a). PISA: Performant indexes and search for academia. Proceedings of the OSIRRC workshop at SIGIR, 2019, 50–56. Mallia, A., Siedlaczek, M., & Suel, T. (2019b). An experimental study of index compression and DAAT query processing methods. In Proceedings of the European conference on information retrieval (ECIR) (pp. 353–368). Mallia, A., Siedlaczek, M., Sun, M., & Suel, T. (2020). A comparison of top-k threshold estimation techniques for disjunctive query processing. In Proceedings of the ACM international conference on information and knowledge management (CIKM) (pp. 2141–2144). Mansouri, B., Zahedi, M. S., Campos, R., & Farhoodi, M. (2018). Online job search: Study of users’ search behavior using search engine query logs. In Proceedings of the ACM international conference on research and development in information retrieval (SIGIR) (pp. 1185–1188). Matveeva, I., Burges, C., Burkard, T., Laucius, A., & Wong, L. (2006). High accuracy retrieval with multiple nested ranker. In Proceedings of the ACM international conference on research and development in information retrieval (SIGIR) (pp. 437–444). Petri, M., Moffat, A., Mackenzie, J., Culpepper, J. S., & Beck, D. (2019). Accelerated query processing via similarity score prediction. In Proceedings of the ACM international conference on research and development in information retrieval (SIGIR) (pp. 485–494). Robertson, S. E., & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3, 333–389. Silverstein, C., Marais, H., Henzinger, M., & Moricz, M. (1999). Analysis of a very large web search engine query log. SIGIR Forum, 33(1), 6–12. Spina, D., Maistro, M., Ren, Y., Sadeghi, S., Wong, W., Baldwin, T., Cavedon, L., Moffat, A., Sanderson, M., Scholer, F., & Zobel, J. (2017). Understanding user behavior in job and talent search: An initial investigation. In Proceedings of the SIGIR workshop on eCommerce. http://ceur-ws.org/Vol-2311/paper_2.pdf Spink, A., Wolfram, D., Jansen, B. J., & Saracevic, T. (2001). Searching the web: The public and their queries. Journal of the American Society for Information Science, 52(3), 226–234. Trotman, A., Jia, X.-F., & Crane, M. (2012). Towards an efficient and effective search engine. Proceedings of the OSIR workshop at SIGIR, 2012, 40–47. Trotman, A., Puurula, A., & Burgess, B. (2014). Improvements to BM25 and language models examined. In Proceedings of the Australasian document computing symposium (ADCS) (pp. 58–65). Wang, L., Lin, J., & Metzler, D. (2011). A cascade ranking model for efficient ranked retrieval. In Proceedings of the ACM international conference on research and development in information retrieval (SIGIR) (pp. 105–114). White, R. W., Diaz, F., & Guo, Q. (2017). Search result prefetching on desktop and mobile. ACM Transactions on Information Systems, 35(3), 1–34. Wicaksono, A. F., & Moffat, A. (2018). Empirical evidence for search effectiveness models. In Proceedings of the ACM international conference on information and knowledge management (CIKM) (pp. 1571–1574). Yafay, E., & Altingovde, I. S. (2019). Caching scores for faster query processing with dynamic pruning in search engines. In Proceedings of the ACM international conference on information and knowledge management (CIKM) (pp. 2457–2460). Yang, P., Fang, H., & Lin, J. (2018). Anserini: Reproducible ranking baselines using Lucene. ACM Journal of Data and Information Quality, 10(4), 1–20. Yang, Z., Moffat, A., & Turpin, A. (2016). How precise does document scoring need to be? In Proceedings of the Asia information retrieval societies conference (AIRS) (pp. 279–291). Zhang, Y., & Moffat, A. (2006). Some observations on user search behavior. In Proceedings of the Australasian document computing symposium (ADCS) (pp. 1–8). Zobel, J., & Moffat, A. (2006). Inverted files for text search engines. ACM Computing Surveys, 38(2), 6.1-6.56.