Retrievability in an integrated retrieval system: an extended study

Dwaipayan Roy1, Zeljko Carevic2, Philipp Mayr2
1Indian Institute of Science Education and Research, Kolkata, India
2GESIS - Leibniz Institute for the Social Sciences, Cologne, Germany

Tóm tắt

Retrievability measures the influence a retrieval system has on the access to information in a given collection of items. This measure can help in making an evaluation of the search system based on which insights can be drawn. In this paper, we investigate the retrievability in an integrated search system consisting of items from various categories, particularly focussing on datasets, publications and variables in a real-life digital library. The traditional metrics, that is, the Lorenz curve and Gini coefficient, are employed to visualise the diversity in retrievability scores of the three retrievable document types (specifically datasets, publications, and variables). Our results show a significant popularity bias with certain items being retrieved more often than others. Particularly, it has been shown that certain datasets are more likely to be retrieved than other datasets in the same category. In contrast, the retrievability scores of items from the variable or publication category are more evenly distributed. We have observed that the distribution of document retrievability is more diverse for datasets as compared to publications and variables.

Tài liệu tham khảo

Adali, S., Emery, R.: A uniform framework for integrating knowledge in heterogeneous knowledge systems. In: Proceedings of the Eleventh International Conference on Data Engineering, Taipei, Taiwan, 6–10 March 1995. IEEE Computer Society, pp. 513–520 (1995). https://doi.org/10.1109/ICDE.1995.380362

Arguello, J.: Federated search in heterogeneous environments. SIGIR Forum 46(1), 78–79 (2012). https://doi.org/10.1145/2215676.2215686

Azzopardi, L., Vinay, V.: Retrievability: an evaluation measure for higher order information access tasks. In: Shanahan JG., Amer-Yahia S., Manolescu I., et al. (eds) Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM 2008, Napa Valley, California, USA, 26–30 Oct 2008. ACM, pp. 561–570 (2008). https://doi.org/10.1145/1458082.1458157

Bache, R., Azzopardi, L.: Improving Access to Large Patent Corpora, pp. 103–121. Springer-Verlag, Berlin, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16175-9_4

Bashir, S., Rauber, A.: Analyzing document retrievability in patent retrieval settings. In: International Conference on Database and Expert Systems Applications, pp. 753–760. Springer (2009a). https://doi.org/10.1007/978-3-642-03573-9_63

Bashir, S., Rauber, A.: Identification of low/high retrievable patents using content-based features. In: Proceedings of the 2nd International Workshop on Patent Information Retrieval. Association for Computing Machinery, New York, NY, USA, PaIR ’09, pp. 9–16 (2009b). https://doi.org/10.1145/1651343.1651346

Bashir, S., Rauber, A.: Improving retrievability of patents with cluster-based pseudo-relevance feedback documents selection. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management. Association for Computing Machinery, New York, NY, USA, CIKM ’09, pp. 1863–1866 (2009c). https://doi.org/10.1145/1645953.1646250

Bashir, S., Rauber, A.: On the relationship between query characteristics and ir functions retrieval bias. J. Am. Soc. Inf. Sci. Technol. 62(8), 1515–1532 (2011). https://doi.org/10.1002/asi.21549

Callan, J., Connell, M.: Query-based sampling of text databases. ACM Trans. Inf. Syst. (TOIS) 19(2), 97–130 (2001). https://doi.org/10.1145/382979.383040

Carevic, Z., Schüller, S., Mayr, P., et al.: Contextualised browsing in a digital library’s living lab. In: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, pp. 89–98 (2018). https://doi.org/10.1145/3197026.3197054

Carevic, Z., Roy, D., Mayr, P.: Characteristics of dataset retrieval sessions: experiences from a real-life digital library. In: International Conference on Theory and Practice of Digital Libraries, pp. 185–193. Springer (2020). https://doi.org/10.1007/978-3-030-54956-5_14

Carmel, D., Yom-Tov, E.: Estimating the Query Difficulty for Information Retrieval. Synthesis Lectures on Information Concepts, Retrieval, and Services. Morgan & Claypool Publishers (2010). https://doi.org/10.2200/S00235ED1V01Y201004ICR015

Carmel, D., Yom-Tov, E., Darlow, A., et al.: What makes a query difficult? In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, New York, NY, USA, SIGIR ’06, pp. 390–397 (2006). https://doi.org/10.1145/1148170.1148238

Cole, M., Liu, J., Belkin, N., et al.: Usefulness as the criterion for evaluation of interactive information retrieval. in: Proc HCIR, pp. 1–4 (2009)

Friedrich, T.: Looking for data. PhD thesis, Humboldt-Universität zu Berlin, Philosophische Fakultät (2020). https://doi.org/10.18452/22173

Gregory, K., Groth, P., Cousijn, H., et al.: Searching data: a review of observational data retrieval practices in selected disciplines. J. Assoc. Inf. Sci. Technol. 70(5), 419–432 (2019). https://doi.org/10.1002/asi.24165

Hienert, D., Mutschke, P.: A usefulness-based approach for measuring the local and global effect of IIR services. In: Proceedings of the 2016 ACM on Conference on Human Information Interaction and Retrieval, CHIIR ’16, pp. 153–162 (2016). https://doi.org/10.1145/2854946.2854962

Hienert, D., Kern, D., Boland, K., et al.: A digital library for research data and related information in the social sciences. In: 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 148–157. IEEE, Champaign, IL, USA (2019). https://doi.org/10.1109/JCDL.2019.00030

Kacprzak, E., Koesten, L.M., Ibáñez, L.D., et al.: A query log analysis of dataset search. In: International Conference on Web Engineering, pp. 429–436. Springer (2017). https://doi.org/10.1007/978-3-319-60131-1_29

Kacprzak, E., Koesten, L., Tennison, J., et al.: Characterising dataset search queries. In: Companion Proceedings of the The Web Conference 2018. International World Wide Web Conferences Steering Committee, WWW ’18, pp. 1485–1488 (2018). https://doi.org/10.1145/3184558.3191597

Kern, D., Mathiak, B.: Are there any differences in data set retrieval compared to well-known literature retrieval? In: International Conference on Theory and Practice of Digital Libraries, pp. 197–208. Springer (2015). https://doi.org/10.1007/978-3-319-24592-8_15

Kunze, S.R., Auer, S.: Dataset retrieval. In: 2013 IEEE Seventh International Conference on Semantic Computing, Irvine, CA, USA, 16–18 Sep 2013. IEEE Computer Society, pp. 1–8 (2013). https://doi.org/10.1109/ICSC.2013.12

Lalmas, M.: Aggregated search. In: Advanced Topics in Information Retrieval, The Information Retrieval Series, vol. 33, pp. 109–123. Springer (2011). https://doi.org/10.1007/978-3-642-20946-8_5

Roy, D., Carevic, Z., Mayr, P.: Studying retrievability of publications and datasets in an integrated retrieval system. In: JCDL ’22: The ACM/IEEE Joint Conference on Digital Libraries in 2022, Cologne, Germany, 20– 24 June 2022. ACM, p. 8 (2022). https://doi.org/10.1145/3529372.3530931

Samar, T., Traub, M.C., Ossenbruggen, J., et al.: Quantifying retrieval bias in web archive search. Int. J. Digit. Libr. 19(1), 57–75 (2018). https://doi.org/10.1007/s00799-017-0215-9

Sparck Jones, K., Walker, S., Robertson, S.: A probabilistic model of information retrieval: development and comparative experiments: part 1. Inf. Process. Manag. 36(6), 779–808 (2000). https://doi.org/10.1016/S0306-4573(00)00015-7

Traub, M.C., Samar, T., van Ossenbruggen, J., et al.: Querylog-based assessment of retrievability bias in a large newspaper corpus. In: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, JCDL 2016, Newark, NJ, USA, 19–23 June 2016. ACM, pp. 7–16 (2016). https://doi.org/10.1145/2910896.2910907

Webber, W., Moffat, A., Zobel, J.: A similarity measure for indefinite rankings. ACM Trans. Inf. Syst. (2010). https://doi.org/10.1145/1852102.1852106

Wilkie, C., Azzopardi, L.: Best and fairest: an empirical analysis of retrieval system bias. In: Proceedings of the 36th European Conference on IR Research on Advances in Information Retrieval, vol. 8416, pp. 13–25. Springer-Verlag, Berlin, Heidelberg, ECIR 2014 (2014a). https://doi.org/10.1007/978-3-319-06028-6_2

Wilkie, C., Azzopardi, L.: A retrievability analysis: exploring the relationship between retrieval bias and retrieval performance. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. Association for Computing Machinery, New York, NY, USA, CIKM ’14, pp. 81–90 (2014b). https://doi.org/10.1145/2661829.2661948

Wilkie, C., Azzopardi, L.: A topical approach to retrievability bias estimation. In: Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval. Association for Computing Machinery, New York, NY, USA, ICTIR ’16, pp. 119–122 (2016). https://doi.org/10.1145/2970398.2970437

Wilkie, C., Azzopardi, L.: Algorithmic bias: do good systems make relevant documents more retrievable? In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. Association for Computing Machinery, New York, NY, USA, CIKM ’17, pp. 2375–2378 (2017). https://doi.org/10.1145/3132847.3133135