Using crowdsourcing for TREC relevance assessment
Tài liệu tham khảo
Alonso, O., & Baeza-Yates, R. (2011). Design and implementation of relevance assessments using crowdsourcing. In Proceedings of the European conference on Information Retrieval (ECIR) (pp. 153–164).
Alonso, O., & Lease, M. (2011). Crowdsourcing for information retrieval: Principles, methods and applications, SIGIR tutorial. In: Proceedings of the 34th ACM SIGIR conference (pp. 1299–1300).
Alonso, O., & Mizzaro, S. (2009a). Relevance criteria for E-commerce. A crowdsourcing-based experimental analysis. In Proceedings of the 32nd ACM SIGIR conference (pp. 760–761).
Alonso, O., & Mizzaro, S. (2009b). Can we get rid of TREC assessors? Using mechanical Turk for relevance assessment. In Proceedings of the 32nd ACM SIGIR workshop on the future of IR, evaluation (pp. 15–16).
Alonso, 2008, Crowdsourcing for relevance evaluation, SIGIR Forum, 42, 9, 10.1145/1480506.1480508
Alonso, 2010, Crowdsourcing assessments for XML ranked retrieval, Proceedings of the European Conference on Information Retrieval (ECIR), 2010, 623
Aslam, J.A., Pavlu, V., & Yilmaz, E. (2006). A statistical method for system evaluation using incomplete judgments. In Proceedings of the 29th ACM SIGIR conference (pp. 541–548).
Bailey, P., Craswell, N., Soboroff, I., Thomas, P., de Vries, A.P., & Yilmaz, E. (2008). Relevance assessment: Are judges exchangeable and does it matter. In Proceedings of the 31st ACM SIGIR conference (pp. 667–674).
Bradburn, 2004
Callan, 2007, Meeting of the MINDS: An information retrieval research agenda, SIGIR Forum, 41, 25, 10.1145/1328964.1328967
Callison-Burch, C. (2009). Fast, cheap, and creative: Evaluating translation quality using Amazon’s mechanical turk. In Proceedings of the 2009 conference on empirical methods in natural language processing (pp. 286–295).
Carterette, B., & Soboroff, I. (2010). The effect of assessor error on IR system evaluation. In Proceedings of the 33rd ACM SIGIR conference (pp. 539–546).
Carterette, 2008, Here or there: Preference judgments for relevance, Proceedings of the European Conference on Information Retrieval (ECIR), 2008, 16
Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J.A., & Allan, J. (2008). Evaluation over thousands of queries. In Proceedings of the 31st ACM SIGIR conference (pp. 651–658).
Carvalho, V., Lease, M., & Yilmaz, E. (Eds.) (2010). Proceedings of the 32nd ACM SIGIR workshop on crowdsourcing for relevance, evaluation, 2010.
Cohen, 1960, A coefficient for agreement for nominal scales, Education and Psychological Measurement, 20, 37, 10.1177/001316446002000104
Cormack, G.V., Palmer, C.R., & Clarke, C.L.A. (1998). Efficient construction of large test collections. In Proceedings of the 21st ACM SIGIR conference (pp. 282–289).
Fleiss, 1971, Measuring nominal scale agreement among many raters, Psychological Bulletin, 76, 378, 10.1037/h0031619
Grady, C., & Lease, M. (2010). Crowdsourcing document relevance assessment with mechanical turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk (pp. 172–179).
Guiver, 2009, A few good topics: Experiments in topic set reduction for retrieval evaluation, ACM Transactions on Information Systems, 27, 1, 10.1145/1629096.1629099
Howe, 2008
Kazai, G., Milic-Frayling, N., & Costello, J. (2009). Towards methods for the collective gathering and quality control of relevance assessments. In Proceedings of the 32nd ACM SIGIR conference (pp: 452–459).
Kazai, G., Kamps, J., Koolen, M., & Milic-Frayling, N. (2011). Crowdsourcing for book search evaluation: Impact of HIT design on comparative system ranking. In Proceedings of the 34th ACM SIGIR conference (pp. 205–214). Beijing, China, ACM.
Kittur, A., Chi, E.H., & Suh, B. (2008). Crowdsourcing user studies with mechanical turk. In CHI ’08: Proceeding of the 26th ACM SIGCHI conference (pp. 453–456).
Krippendorff, 1970, Estimating the reliability, systematic error, and random error of interval data, Educational and Psychological Measurement, 30, 61, 10.1177/001316447003000105
Lease, M., Sorokin, A., & Yilmaz, E. (Eds.) (2011). Proceedings of the 33rd ACM SIGIR workshop on crowdsourcing for information retrieval.
Lease, 2011
McCreadie, R., Macdonald, C., & Ounis, I. (2010). Crowdsourcing a News query classification dataset. In Proceedings of CSE 2010 workshop at SIGIR.
McCreadie, R., Macdonald, C., & Ounis, I. (2011). Crowdsourcing blog track top news judgments at TREC. In Proceedings of CSDM workshop at WSDM 2011.
Nielsen, 1993
Nowak, S., Rüger, S. (2010). How reliable are annotations via crowdsourcing? A study about inter-annotator agreement for multi-label image annotation. In Proceedings of the international ACM conference on multimedia, information retrieval (pp. 557–566).
Sanderson, M. & Zobel, J. (2005). Information retrieval system evaluation: Effort, sensitivity, and reliability. In Proceedings of the 28th ACM SIGIR conference (pp. 162–169).
Sanderson, 2010, Test collection based evaluation of information retrieval systems, Foundations and Trends in Information Retrieval, 4, 247, 10.1561/1500000009
Smucker, M., & Prakash Jethani, C. (2011). Measuring assessor accuracy: A comparison of NIST assessors and user study participants. In: Proceedings of the 34th ACM SIGIR conference (pp. 1231–1232).
Snow, R., O’Connor, B., Jurafsky, D., & Ng, A.Y. (2008). Cheap and fast but is it good? Evaluating non-expert annotations for natural language tasks. In Conference on empirical methods on natural language processing (pp. 254–263).
Soboroff, I., Nicholas, C., & Cahan, P. (2001). Ranking retrieval systems without relevance judgments. In Proceedings of the 24th ACM SIGIR conference (pp. 66–73).
Sormunen, E. (2002). Liberal relevance criteria of TREC: Counting on negligible documents? In Proceedings of the 25th ACM SIGIR conference (pp. 324–330).
Stemler, S.E. (2004). A comparison of consensus, consistency, and measurement approaches to estimating interrater reliability. Practical Assessment, Research & Evaluation, 9(4). <http://PAREonline.net/getvn.asp?v=9&n=4> Retrieved 01.10.10.
Voorhees, 2000, Variations in relevance judgments and the measurement of retrieval effectiveness, Information Processing and Management, 36, 697, 10.1016/S0306-4573(00)00010-8
Voorhees, E. (2001). The philosophy of information retrieval evaluation. In CLEF ’01 proceedings (pp. 355–370).