Improving accountability in recommender systems research through reproducibility

User Modeling and User-Adapted Interaction - Tập 31 - Trang 941-977 - 2021
Alejandro Bellogín1, Alan Said2
1Universidad Autónoma de Madrid, Madrid, Spain
2University of Gothenburg, Gothenburg, Sweden

Tóm tắt

Reproducibility is a key requirement for scientific progress. It allows the reproduction of the works of others, and, as a consequence, to fully trust the reported claims and results. In this work, we argue that, by facilitating reproducibility of recommender systems experimentation, we indirectly address the issues of accountability and transparency in recommender systems research from the perspectives of practitioners, designers, and engineers aiming to assess the capabilities of published research works. These issues have become increasingly prevalent in recent literature. Reasons for this include societal movements around intelligent systems and artificial intelligence striving toward fair and objective use of human behavioral data (as in Machine Learning, Information Retrieval, or Human–Computer Interaction). Society has grown to expect explanations and transparency standards regarding the underlying algorithms making automated decisions for and around us. This work surveys existing definitions of these concepts and proposes a coherent terminology for recommender systems research, with the goal to connect reproducibility to accountability. We achieve this by introducing several guidelines and steps that lead to reproducible and, hence, accountable experimental workflows and research. We additionally analyze several instantiations of recommender system implementations available in the literature and discuss the extent to which they fit in the introduced framework. With this work, we aim to shed light on this important problem and facilitate progress in the field by increasing the accountability of research.

Tài liệu tham khảo

Abdul, A.M., Vermeulen, J., Wang, D., Lim, B.Y., Kankanhalli, M.S.: Trends and trajectories for explainable, accountable and intelligible systems: An HCI research agenda. In: Mandryk, R.L., Hancock, M., Perry, M., Cox, A.L. (eds.) Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, CHI 2018, Montreal, QC, Canada, April 21–26, 2018, p. 582. ACM (2018) Abel, F., Deldjoo, Y., Elahi, M., Kohlsdorf, D.: Recsys challenge 2017: Offline and online evaluation. In: Proceedings of the Eleventh ACM Conference on Recommender Systems, RecSys ’17, pp. 372–373. ACM, New York, NY, USA (2017) Arguello, J., Crane, M., Diaz, F., Lin, J.J., Trotman, A.: Report on the SIGIR 2015 workshop on reproducibility, inexplicability, and generalizability of results (RIGOR). SIGIR Forum 49(2), 107–116 (2015) Armstrong, T.G., Moffat, A., Webber, W., Zobel, J.: Improvements that don’t add up: Ad-hoc retrieval results since 1998. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM ’09, pp. 601–610. ACM, New York, NY, USA (2009) Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval: The Concepts and Technology Behind Search, 2nd edn. Pearson Education Ltd., Harlow, England (2011) Bajpai, V., Kühlewind, M., Ott, J., Schönwälder, J., Sperotto, A., Trammell, B.: Challenges with reproducibility. In: Proceedings of the Reproducibility Workshop, Reproducibility@SIGCOMM 2017, Los Angeles, CA, USA, August 25, 2017, pp. 1–4. ACM (2017) Basu, C., Hirsh, H., Cohen, W.W.: Recommendation as classification: Using social and content-based information in recommendation. In: Mostow, J., Rich, C. (eds.) AAAI/IAAI, pp. 714–720. AAAI Press/The MIT Press (1998) Bellogín, A., Castells, P., Cantador, I.: Precision-oriented evaluation of recommender systems: An algorithmic comparison. In: Proceedings of the Fifth ACM Conference on Recommender Systems, RecSys ’11, pp. 333–336. ACM, New York, NY, USA (2011) Bellogín, A., Castells, P., Said, A., Tikk, D.: Workshop on reproducibility and replication in recommender systems evaluation: Repsys. In: Proceedings of the 7th ACM Conference on Recommender Systems, RecSys ’13, pp. 485–486. ACM, New York, NY, USA (2013) Bellogín, A., Cantador, I., Díez, F., Castells, P., Chavarriaga, E.: An empirical comparison of social, collaborative filtering, and hybrid recommenders. ACM TIST 4(1), 14:1-14:29 (2013) Bellogín, A., Castells, P., Cantador, I.: Statistical biases in information retrieval metrics for recommender systems. Inf. Retr. J. 20(6), 606–634 (2017) Bennett, J., Lanning, S., Netflix, N.: The netflix prize. In: In KDD Cup and Workshop in conjunction with KDD (2007) Bouckaert, R.R.: Choosing between two learning algorithms based on calibrated tests. In: Fawcett, T., Mishra, N. (eds.) Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21–24, 2003, Washington, DC, USA, pp. 51–58. AAAI Press (2003) Breese, J., Heckerman, D., Kadie, C.: Empirical analysis of predictive algorithms for collaborative filtering. In: Proceedings of the Fourteenth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-98), pp. 43–52. Morgan Kaufmann, San Francisco, CA (1998) Breuer, T., Ferro, N., Fuhr, N., Maistro, M., Sakai, T., Schaer, P., Soboroff, I.: How to measure the reproducibility of system-oriented IR experiments. In: Huang, J., Chang, Y., Cheng, X., Kamps, J., Murdock, V., Wen, J., Liu, Y. (eds.) Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25–30, 2020, pp. 349–358. ACM (2020) Brinckman, A., Chard, K., Gaffney, N., Hategan, M., Jones, M.B., Kowalik, K., Kulasekaran, S., Ludäscher, B., Mecum, B.D., Nabrzyski, J., Stodden, V., Taylor, I.J., Turk, M.J., Turner, K.: Computing environments for reproducibility: capturing the whole tale. Fut. Gener. Comput. Syst. 94, 854–867 (2019) Campos, P.G., Díez, F., Cantador, I.: Time-aware recommender systems: a comprehensive survey and analysis of existing evaluation protocols. User Model. User Adapt. Interact. 24(1–2), 67–119 (2014) Carterette, B., Sabhnani, K.: Using simulation to analyze the potential for reproducibility. In: Proceedings of the SIGIR Workshop on Reproducibility, Inexplicability, and Generalizability of Results (RIGOR 2015) (2015) Carterette, B.: But is it statistically significant?: Statistical significance in IR research, 1995–2014. In: Kando, N., Sakai, T., Joho, H., Li, H., de Vries, A.P., White, R.W. (eds.) Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan, August 7–11, 2017, pp. 1125–1128. ACM (2017) Charisi, V., Dennis, L.A., Fisher, M., Lieck, R., Matthias, A., Slavkovik, M., Sombetzki, J., Winfield, A.F.T., Yampolskiy, R.: Towards moral autonomous systems. CoRR arXiv:abs/1703.04741 (2017) Collberg, C.S., Proebsting, T.A.: Repeatability in computer systems research. Commun. ACM 59(3), 62–69 (2016) Cremonesi, P., Koren, Y., Turrin, R.: Performance of recommender algorithms on top-n recommendation tasks. In: Proceedings of the Fourth ACM Conference on Recommender Systems, RecSys ’10, pp. 39–46. ACM, New York, NY, USA (2010) Desrosiers, C., Karypis, G.: A comprehensive survey of neighborhood-based recommendation methods. In: Ricci, F., Rokach, L., Shapira, B., Kantor, P.B. (eds.) Recommender Systems Handbook, pp. 107–144. Springer, New York (2011) Di Buccio, E., Di Nunzio, G.M., Ferro, N., Harman, D., Maistro, M., Silvello, G.: Unfolding off-the-shelf IR systems for reproducibility. In: Proceedings of the SIGIR Workshop on Reproducibility, Inexplicability, and Generalizability of Results (RIGOR 2015) (2015) Diakopoulos, N.: Accountability in algorithmic decision making. Commun. ACM 59(2), 56–62 (2016) Dooms, S., De Pessemier, T., Martens, L.: Movietweetings: a movie rating dataset collected from twitter. In: Workshop on Crowdsourcing and Human Computation for Recommender Systems, CrowdRec at RecSys 2013 (2013) Dooms, S., Bellogín, A., Pessemier, T.D., Martens, L.: A framework for dataset benchmarking and its application to a new movie rating dataset. ACM TIST 7(3), 41:1-41:28 (2016) Dragicevic, P., Jansen, Y., Sarma, A., Kay, M., Chevalier, F.: Increasing the transparency of research papers with explorable multiverse analyses. In: Brewster, S.A., Fitzpatrick, G., Cox, A.L., Kostakos, V. (eds.) Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI 2019, Glasgow, Scotland, UK, May 04–09, 2019, p. 65. ACM (2019) Ekstrand, M.D., Kluver, D., Harper, F.M., Konstan, J.A.: Letting users choose recommender algorithms: An experimental study. In: Werthner, H., Zanker, M., Golbeck, J., Semeraro, G. (eds.) Proceedings of the 9th ACM Conference on Recommender Systems, RecSys 2015, Vienna, Austria, September 16–20, 2015, pp. 11–18. ACM (2015) Ekstrand, M.D., Ludwig, M., Konstan, J.A., Riedl, J.T.: Rethinking the recommender research ecosystem: Reproducibility, openness, and lenskit. In: Proceedings of the Fifth ACM Conference on Recommender Systems, RecSys ’11, pp. 133–140. ACM, New York, NY, USA (2011) Ferrari Dacrema, M., Cremonesi, P., Jannach, D.: Are we really making much progress? a worrying analysis of recent neural recommendation approaches. In: Proceedings of the 13th ACM Conference on Recommender Systems (RecSys 2019) (2019). Source: https://github.com/MaurizioFD/RecSys2019_DeepLearning_Evaluation Ferro, N.: Reproducibility challenges in information retrieval evaluation. ACM J. Data Inf. Qual. 8(2), 8:1-8:4 (2017) Ferro, N., Fuhr, N., Grefenstette, G., Konstan, J.A., Castells, P., Daly, E.M., Declerck, T., Ekstrand, M.D., Geyer, W., Gonzalo, J., Kuflik, T., Lindén, K., Magnini, B., Nie, J., Perego, R., Shapira, B., Soboroff, I., Tintarev, N., Verspoor, K., Willemsen, M.C., Zobel, J.: From evaluating to forecasting performance: How to turn information retrieval, natural language processing and recommender systems into predictive sciences (dagstuhl perspectives workshop 17442). Dagstuhl Manifestos 7(1), 96–139 (2018) Ferro, N., Fuhr, N., Rauber, A.: Introduction to the special issue on reproducibility in information retrieval: evaluation campaigns, collections, and analyses. ACM J. Data Inf. Qual. 10(3), 9:1-9:4 (2018) Ferro, N., Fuhr, N., Rauber, A.: Introduction to the special issue on reproducibility in information retrieval: tools and infrastructures. ACM J. Data Inf. Qual. 10(4), 14:1-14:4 (2018) Freire, J., Fuhr, N., Rauber, A.: Reproducibility of data-oriented experiments in e-science (dagstuhl seminar 16041). Dagstuhl Rep. 6(1), 108–159 (2016) Gantner, Z., Rendle, S., Freudenthaler, C., Schmidt-Thieme, L.: Mymedialite: A free recommender system library. In: Proceedings of the Fifth ACM Conference on Recommender Systems, RecSys ’11, pp. 305–308. ACM, New York, NY, USA (2011) Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J.W., Wallach, H.M., III, H.D., Crawford, K.: Datasheets for datasets. CoRR arXiv:abs/1803.09010 (2018) Goldberg, D., Nichols, D.A., Oki, B.M., Terry, D.B.: Using collaborative filtering to weave an information tapestry. Commun. ACM 35(12), 61–70 (1992) Goldberg, K., Roeder, T., Gupta, D., Perkins, C.: Eigentaste: a constant time collaborative filtering algorithm. Inf. Retr. 4(2), 133–151 (2001) Gunawardana, A., Shani, G.: A survey of accuracy evaluation metrics of recommendation tasks. J. Mach. Learn. Res. 10, 2935–2962 (2009) Guo, G., Zhang, J., Sun, Z., Yorke-Smith, N.: Librec: A java library for recommender systems. In: Cristea, A.I., Masthoff, J., Said, A., Tintarev, N. (eds.) Posters, Demos, Late-breaking Results and Workshop Proceedings of the 23rd Conference on User Modeling, Adaptation, and Personalization (UMAP 2015), Dublin, Ireland, June 29–July 3, 2015, CEUR Workshop Proceedings, vol. 1388. CEUR-WS.org (2015) Herlocker, J.L., Konstan, J.A., Terveen, L.G., Riedl, J.T.: Evaluating collaborative filtering recommender systems. ACM Trans. Inf. Syst. 22(1), 5–53 (2004) Ivie, P., Thain, D.: Reproducibility in scientific computing. ACM Comput. Surv. 51(3), 63:1-63:36 (2018) Jambor, T., Wang, J.: Goal-driven collaborative filtering: a directional error based approach. In: Proceedings of the 32Nd European Conference on Advances in Information Retrieval, ECIR’2010, pp. 407–419. Springer, Berlin, Heidelberg (2010a) Jambor, T., Wang, J.: Optimizing multiple objectives in collaborative filtering. In: Proceedings of the Fourth ACM Conference on Recommender Systems, RecSys ’10, pp. 55–62. ACM, New York, NY, USA (2010b) Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20(4), 422–446 (2002) Jobin, A., Ienca, M., Vayena, E.: Artificial intelligence: the global landscape of ethics guidelines. CoRR abs/1906.11668 (2019) Kamphuis, C., de Vries, A.P., Boytsov, L., Lin, J.: Which BM25 do you mean? A large-scale reproducibility study of scoring variants. In: Jose, J.M., Yilmaz, E., Magalhães, J., Castells, P., Ferro, N., Silva, M.J., Martins, F. (eds.) Advances in Information Retrieval: 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14–17, 2020, Proceedings, Part II, Lecture Notes in Computer Science, vol. 12036, pp. 28–34. Springer (2020) Kluver, D., Konstan, J.A.: Evaluating recommender behavior for new users. In: Kobsa, A., Zhou, M.X., Ester, M., Koren, Y. (eds.) Eighth ACM Conference on Recommender Systems, RecSys ’14, Foster City, Silicon Valley, CA, USA; October 06–10, 2014, pp. 121–128. ACM (2014) Koene, A., Clifton, C., Hatada, Y., Webb, H., Patel, M., Machado, C., LaViolette, J., Richardson, R., Reisman, D.: A governance framework for algorithmic accountability and transparency. http://www.europarl.europa.eu/thinktank/en/document.html?reference=EPRS_STU(2019)624262 (retrieved August, 2019) (2019) Konstan, J.A., Adomavicius, G.: Toward identification and adoption of best practices in algorithmic recommender systems research. In: Bellogín, A., Castells, P., Said, A., Tikk, D. (eds.) Proceedings of the International Workshop on Reproducibility and Replication in Recommender Systems Evaluation, RepSys 2013, Hong Kong, China, October 12, 2013, pp. 23–28. ACM (2013) Kosir, A., Odic, A., Tkalcic, M.: How to improve the statistical power of the 10-fold cross validation scheme in recommender systems. In: Bellogín, A., Castells, P., Said, A., Tikk, D. (eds.) Proceedings of the International Workshop on Reproducibility and Replication in Recommender Systems Evaluation, RepSys 2013, Hong Kong, China, October 12, 2013, pp. 3–6. ACM (2013) Kowald, D., Schedl, M., Lex, E.: The unfairness of popularity bias in music recommendation: A reproducibility study. In: Jose, J.M., Yilmaz, E., Magalhães, J., Castells, P., Ferro, N., Silva, M.J., Martins, F. (eds.) Advances in Information Retrieval - 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14–17, 2020, Proceedings, Part II, Lecture Notes in Computer Science, vol. 12036, pp. 35–42. Springer (2020) Krebs, L.M., Rodriguez, O.L.A., Dewitte, P., Ausloos, J., Geerts, D., Naudts, L., Verbert, K.: Tell me what you know: GDPR implications on designing transparency and accountability for news recommender systems. In: Mandryk, R.L., Brewster, S.A., Hancock, M., Fitzpatrick, G., Cox, A.L., Kostakos, V., Perry, M. (eds.) Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems, CHI 2019, Glasgow, Scotland, UK, May 04–09, 2019. ACM (2019) Krichene, W., Rendle, S.: On sampled metrics for item recommendation. In: Gupta, R., Liu, Y., Tang, J., Prakash, B.A. (eds.) KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23–27, 2020, pp. 1748–1757. ACM (2020) Lepri, B., Oliver, N., Letouzé, E., Pentland, A., Vinck, P.: Fair, transparent, and accountable algorithmic decision-making processes. Philos. Technol. 31(4), 611–627 (2018) Li, D., Jin, R., Gao, J., Liu, Z.: On sampling top-k recommendation evaluation. In: Gupta, R., Liu, Y., Tang, J., Prakash, B.A. (eds.) KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23–27, 2020, pp. 2114–2124. ACM (2020) Lin, J., Zhang, Q.: Reproducibility is a process, not an achievement: The replicability of IR reproducibility experiments. In: Jose, J.M., Yilmaz, E., Magalhães, J., Castells, P., Ferro, N., Silva, M.J., Martins, F. (eds.) Advances in Information Retrieval - 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14–17, 2020, Proceedings, Part II, Lecture Notes in Computer Science, vol. 12036, pp. 43–49. Springer (2020) Mansoury, M., Burke, R.: Algorithm selection with librec-auto. In: Beel, J., Kotthoff, L. (eds.) Proceedings of the 1st Interdisciplinary Workshop on Algorithm Selection and Meta-Learning in Information Retrieval co-located with the 41st European Conference on Information Retrieval (ECIR 2019), Cologne, Germany, April 14, 2019, CEUR Workshop Proceedings, vol. 2360, pp. 11–17. CEUR-WS.org (2019) Mesas, R.M., Bellogín, A.: Evaluating decision-aware recommender systems. In: Cremonesi, P., Ricci, F., Berkovsky, S., Tuzhilin, A. (eds.) Proceedings of the Eleventh ACM Conference on Recommender Systems, RecSys 2017, Como, Italy, August 27–31, 2017, pp. 74–78. ACM (2017) Olteanu, A., Kiciman, E., Castillo, C.: A critical review of online social data: Biases, methodological pitfalls, and ethical boundaries. In: Chang, Y., Zhai, C., Liu, Y., Maarek, Y. (eds.) Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM 2018, Marina Del Rey, CA, USA, February 5–9, 2018, pp. 785–786. ACM (2018) Owen, S., Anil, R., Dunning, T., Friedman, E.: Mahout in Action. Manning Publications Co., Greenwich, CT, USA (2011) Plesser, H.E.: Reproducibility vs. replicability: a brief history of a confused terminology. Front. Neuroinform. 11, 76 (2017) Polatidis, N., Pimenidis, E., Fish, A., Kapetanakis, S.: A guideline-based approach for assisting with the reproducibility of experiments in recommender systems evaluation. Int. J. Artif. Intell. Tools 28(8), 1960011:1 (2019) Polatidis, N., Papaleonidas, A., Pimenidis, E., Iliadis, L.: An explanation-based approach for experiment reproducibility in recommender systems. Neural Comput. Appl. 32(16), 12259–12266 (2020) Proceedings of the Reproducibility Workshop, Reproducibility@SIGCOMM 2017, Los Angeles, CA, USA, August 25, 2017. ACM (2017) Raji, I.D., Smart, A., White, R.N., Mitchell, M., Gebru, T., Hutchinson, B., Smith-Loud, J., Theron, D., Barnes, P.: Closing the AI accountability gap: defining an end-to-end framework for internal algorithmic auditing. In: Hildebrandt, M., Castillo, C., Celis, E., Ruggieri, S., Taylor, L., Zanfir-Fortuna, G. (eds.) FAT* ’20: Conference on Fairness, Accountability, and Transparency, Barcelona, Spain, January 27–30, 2020, pp. 33–44. ACM (2020). https://doi.org/10.1145/3351095.3372873 Said, A., Bellogín, A.: Comparative recommender system evaluation: Benchmarking recommendation frameworks. In: Proceedings of the 8th ACM Conference on Recommender Systems, RecSys ’14, pp. 129–136. ACM, New York, NY, USA (2014) Said, A., Bellogín, A.: Replicable evaluation of recommender systems. In: Proceedings of the 9th ACM Conference on Recommender Systems, RecSys ’15, pp. 363–364. ACM, New York, NY, USA (2015) Said, A., Tikk, D., Cremonesi, P.: Benchmarking: A methodology for ensuring the relative quality of a recommendation system for software engineering. In: Robillard, M., Maalej, W., Walker, R., Zimmermann, T. (eds.) Recommendation Systems in Software Engineering, chap. 11, pp. 275–300. Springer (2014) Sakai, T.: Statistical reform in information retrieval? SIGIR Forum 48(1), 3–12 (2014) Sarwar, B., Karypis, G., Konstan, J., Riedl, J.: Item-based collaborative filtering recommendation algorithms. In: Proceedings of the 10th International Conference on World Wide Web, WWW ’01, pp. 285–295. ACM, New York, NY, USA (2001) Shani, G., Gunawardana, A.: Evaluating recommendation systems. In: Ricci, F., Rokach, L., Shapira, B., Kantor, P.B. (eds.) Recommender Systems Handbook, pp. 257–297. Springer, New York (2011) Shin, D., Park, Y.J.: Role of fairness, accountability, and transparency in algorithmic affordance. Comput. Hum. Behav. 98, 277–284 (2019) Sonboli, N., Burke, R., Liu, Z., Mansoury, M.: Fairness-aware recommendation with librec-auto. In: Santos, R.L.T., Marinho, L.B., Daly, E.M., Chen, L., Falk, K., Koenigstein, N., de Moura, E.S. (eds.) RecSys 2020: Fourteenth ACM Conference on Recommender Systems, Virtual Event, Brazil, September 22–26, 2020, pp. 594–596. ACM (2020) Stodden, V., McNutt, M., Bailey, D.H., Deelman, E., Gil, Y., Hanson, B., Heroux, M.A., Ioannidis, J.P., Taufer, M.: Enhancing reproducibility for computational methods. Science 354(6317), 1240–1241 (2016) Sun, Z., Yu, D., Fang, H., Yang, J., Qu, X., Zhang, J., Geng, C.: Are we evaluating rigorously? benchmarking recommendation for reproducible evaluation and fair comparison. In: Santos, R.L.T., Marinho, L.B., Daly, E.M., Chen, L., Falk, K., Koenigstein, N., de Moura, E.S. (eds.) RecSys 2020: Fourteenth ACM Conference on Recommender Systems, Virtual Event, Brazil, September 22–26, 2020, pp. 23–32. ACM (2020) Vargas, S.: Novelty and diversity evaluation and enhancement in recommender systems. Ph.D. thesis, Universidad Autónoma de Madrid (2015) Wang, J., de Vries, A.P., Reinders, M.J.T.: Unified relevance models for rating prediction in collaborative filtering. ACM Trans. Inf. Syst. 26(3), 16:1-16:42 (2008) Zhao, X., Niu, Z., Chen, W.: Opinion-based collaborative filtering to solve popularity bias in recommender systems. In: Decker, H., Lhotská, L., Link, S., Basl, J., Tjoa, A.M. (eds.) Database and Expert Systems Applications: 24th International Conference, DEXA 2013, Prague, Czech Republic, August 26–29, 2013. Proceedings, Part II, Lecture Notes in Computer Science, vol. 8056, pp. 426–433. Springer (2013)