Using crowdsourcing for TREC relevance assessment

Information Processing & Management - Tập 48 - Trang 1053-1066 - 2012
Omar Alonso1, Stefano Mizzaro2
1Microsoft Corp., 1065 La Avenida, Mountain View, CA 94043, USA
2Dept. of Maths and Computer Science, University of Udine, Via delle Scienze, 206, 33100 Udine, Italy

Tài liệu tham khảo

Alonso, O., & Baeza-Yates, R. (2011). Design and implementation of relevance assessments using crowdsourcing. In Proceedings of the European conference on Information Retrieval (ECIR) (pp. 153–164). Alonso, O., & Lease, M. (2011). Crowdsourcing for information retrieval: Principles, methods and applications, SIGIR tutorial. In: Proceedings of the 34th ACM SIGIR conference (pp. 1299–1300). Alonso, O., & Mizzaro, S. (2009a). Relevance criteria for E-commerce. A crowdsourcing-based experimental analysis. In Proceedings of the 32nd ACM SIGIR conference (pp. 760–761). Alonso, O., & Mizzaro, S. (2009b). Can we get rid of TREC assessors? Using mechanical Turk for relevance assessment. In Proceedings of the 32nd ACM SIGIR workshop on the future of IR, evaluation (pp. 15–16). Alonso, 2008, Crowdsourcing for relevance evaluation, SIGIR Forum, 42, 9, 10.1145/1480506.1480508 Alonso, 2010, Crowdsourcing assessments for XML ranked retrieval, Proceedings of the European Conference on Information Retrieval (ECIR), 2010, 623 Aslam, J.A., Pavlu, V., & Yilmaz, E. (2006). A statistical method for system evaluation using incomplete judgments. In Proceedings of the 29th ACM SIGIR conference (pp. 541–548). Bailey, P., Craswell, N., Soboroff, I., Thomas, P., de Vries, A.P., & Yilmaz, E. (2008). Relevance assessment: Are judges exchangeable and does it matter. In Proceedings of the 31st ACM SIGIR conference (pp. 667–674). Bradburn, 2004 Callan, 2007, Meeting of the MINDS: An information retrieval research agenda, SIGIR Forum, 41, 25, 10.1145/1328964.1328967 Callison-Burch, C. (2009). Fast, cheap, and creative: Evaluating translation quality using Amazon’s mechanical turk. In Proceedings of the 2009 conference on empirical methods in natural language processing (pp. 286–295). Carterette, B., & Soboroff, I. (2010). The effect of assessor error on IR system evaluation. In Proceedings of the 33rd ACM SIGIR conference (pp. 539–546). Carterette, 2008, Here or there: Preference judgments for relevance, Proceedings of the European Conference on Information Retrieval (ECIR), 2008, 16 Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J.A., & Allan, J. (2008). Evaluation over thousands of queries. In Proceedings of the 31st ACM SIGIR conference (pp. 651–658). Carvalho, V., Lease, M., & Yilmaz, E. (Eds.) (2010). Proceedings of the 32nd ACM SIGIR workshop on crowdsourcing for relevance, evaluation, 2010. Cohen, 1960, A coefficient for agreement for nominal scales, Education and Psychological Measurement, 20, 37, 10.1177/001316446002000104 Cormack, G.V., Palmer, C.R., & Clarke, C.L.A. (1998). Efficient construction of large test collections. In Proceedings of the 21st ACM SIGIR conference (pp. 282–289). Fleiss, 1971, Measuring nominal scale agreement among many raters, Psychological Bulletin, 76, 378, 10.1037/h0031619 Grady, C., & Lease, M. (2010). Crowdsourcing document relevance assessment with mechanical turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk (pp. 172–179). Guiver, 2009, A few good topics: Experiments in topic set reduction for retrieval evaluation, ACM Transactions on Information Systems, 27, 1, 10.1145/1629096.1629099 Howe, 2008 Kazai, G., Milic-Frayling, N., & Costello, J. (2009). Towards methods for the collective gathering and quality control of relevance assessments. In Proceedings of the 32nd ACM SIGIR conference (pp: 452–459). Kazai, G., Kamps, J., Koolen, M., & Milic-Frayling, N. (2011). Crowdsourcing for book search evaluation: Impact of HIT design on comparative system ranking. In Proceedings of the 34th ACM SIGIR conference (pp. 205–214). Beijing, China, ACM. Kittur, A., Chi, E.H., & Suh, B. (2008). Crowdsourcing user studies with mechanical turk. In CHI ’08: Proceeding of the 26th ACM SIGCHI conference (pp. 453–456). Krippendorff, 1970, Estimating the reliability, systematic error, and random error of interval data, Educational and Psychological Measurement, 30, 61, 10.1177/001316447003000105 Lease, M., Sorokin, A., & Yilmaz, E. (Eds.) (2011). Proceedings of the 33rd ACM SIGIR workshop on crowdsourcing for information retrieval. Lease, 2011 McCreadie, R., Macdonald, C., & Ounis, I. (2010). Crowdsourcing a News query classification dataset. In Proceedings of CSE 2010 workshop at SIGIR. McCreadie, R., Macdonald, C., & Ounis, I. (2011). Crowdsourcing blog track top news judgments at TREC. In Proceedings of CSDM workshop at WSDM 2011. Nielsen, 1993 Nowak, S., Rüger, S. (2010). How reliable are annotations via crowdsourcing? A study about inter-annotator agreement for multi-label image annotation. In Proceedings of the international ACM conference on multimedia, information retrieval (pp. 557–566). Sanderson, M. & Zobel, J. (2005). Information retrieval system evaluation: Effort, sensitivity, and reliability. In Proceedings of the 28th ACM SIGIR conference (pp. 162–169). Sanderson, 2010, Test collection based evaluation of information retrieval systems, Foundations and Trends in Information Retrieval, 4, 247, 10.1561/1500000009 Smucker, M., & Prakash Jethani, C. (2011). Measuring assessor accuracy: A comparison of NIST assessors and user study participants. In: Proceedings of the 34th ACM SIGIR conference (pp. 1231–1232). Snow, R., O’Connor, B., Jurafsky, D., & Ng, A.Y. (2008). Cheap and fast but is it good? Evaluating non-expert annotations for natural language tasks. In Conference on empirical methods on natural language processing (pp. 254–263). Soboroff, I., Nicholas, C., & Cahan, P. (2001). Ranking retrieval systems without relevance judgments. In Proceedings of the 24th ACM SIGIR conference (pp. 66–73). Sormunen, E. (2002). Liberal relevance criteria of TREC: Counting on negligible documents? In Proceedings of the 25th ACM SIGIR conference (pp. 324–330). Stemler, S.E. (2004). A comparison of consensus, consistency, and measurement approaches to estimating interrater reliability. Practical Assessment, Research & Evaluation, 9(4). <http://PAREonline.net/getvn.asp?v=9&n=4> Retrieved 01.10.10. Voorhees, 2000, Variations in relevance judgments and the measurement of retrieval effectiveness, Information Processing and Management, 36, 697, 10.1016/S0306-4573(00)00010-8 Voorhees, E. (2001). The philosophy of information retrieval evaluation. In CLEF ’01 proceedings (pp. 355–370).