Scaling up the learning-from-crowds GLAD algorithm using instance-difficulty clustering

Progress in Artificial Intelligence - Tập 8 - Trang 389-399 - 2019

Enrique González Rodrigo¹, Juan A. Aledo¹, Jose A. Gamez¹

¹Albacete, Spain

Tóm tắt

The main goal of this article is to improve the results obtained by the GLAD algorithm in cases with large data. This algorithm is able to learn from instances labeled by multiple annotators taking into account both the quality of the annotators and the difficulty of the instances. Despite its many advantages, this study shows that GLAD does not scale well when dealing with large number of instances, as it estimates one parameter per instance of the dataset. Clustering is an alternative to reduce the number of parameters to be estimated, making the learning process more efficient. However, as the features of crowdsourced datasets are not usually available, classical clustering procedures cannot be applied directly. To solve this issue, we propose using clustering from vectors created by matrix factorization. Our analysis shows that this clustering process improves the results obtained by GLAD both regarding accuracy and execution time, especially in large data scenarios. We also compare this approach against other algorithms with a similar goal.

Tài liệu tham khảo

Aydin, B.I., Yilmaz, Y.S., Li, Y., Li, Q., Gao, J., Demirbas, M.: Crowdsourcing for multiple-choice question answering. In: Twenty-Sixth IAAI Conference (2014) Charte, D., Charte, F., García, S., Herrera, F.: A snapshot on nonstandard supervised learning problems: taxonomy, relationships, problem transformations and algorithm adaptations. Prog. Artif. Intell. (2018). https://doi.org/10.1007/s13748-018-00167-7 Chen, X., Bennett, P.N., Collins-Thompson, K., Horvitz, E.: Pairwise ranking aggregation in a crowdsourced setting. In: Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, pp. 193–202. ACM (2013) Dawid, A.P., Skene, A.M.: Maximum likelihood estimation of observer error-rates using the EM algorithm. Appl. Stat. 2, 20–28 (1979) Demartini, G., Difallah, D.E., Cudré-Mauroux, P.: Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In: Proceedings of the 21st International Conference on World Wide Web, pp. 469–478. ACM (2012) Hernández-González, J., Inza, I., Lozano, J.A.: Weak supervision and other non-standard classification problems: a taxonomy. Pattern Recognit. Lett. 69, 49–55 (2016) Ipeirotis, P.G., Provost, F., Wang, J.: Quality management on Amazon Mechanical Turk. In: Proceedings of the ACM SIGKDD Workshop on Human Computation, HCOMP ’10, pp. 64–67. ACM, New York (2010). https://doi.org/10.1145/1837885.1837906 Karger, D.R., Oh, S., Shah, D.: Iterative learning for reliable crowdsourcing systems. In: Advances in Neural Information Processing Systems, pp. 1953–1961 (2011) Kim, H.C., Ghahramani, Z.: Bayesian classifier combination. In: Artificial Intelligence and Statistics, pp. 619–627 (2012) Li, Q., Li, Y., Gao, J., Su, L., Zhao, B., Demirbas, M., Fan, W., Han, J.: A confidence-aware approach for truth discovery on long-tail data. Proc. VLDB Endow. 8(4), 425–436 (2014) Li, Q., Li, Y., Gao, J., Zhao, B., Fan, W., Han, J.: Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 1187–1198. ACM (2014) Liu, Q., Peng, J., Ihler, A.T.: Variational inference for crowdsourcing. In: Advances in Neural Information Processing Systems, pp. 692–700 (2012) Luna-Romera, J.M., García-Gutiérrez, J., Martínez-Ballesteros, M., Riquelme Santos, J.C.: An approach to validity indices for clustering techniques in big data. Prog. Artif. Intell. 7(2), 81–94 (2018) Raykar, V.C., Yu, S., Zhao, L.H., Valadez, G.H., Florin, C., Bogoni, L., Moy, L.: Learning from crowds. J. Mach. Learn. Res. 11(Apr), 1297–1322 (2010) Rodrigo, G., Aledo, E., Gámez, J.A.: CGLAD: using GLAD in crowdsourced large datasets. In: Lecture Notes in Computer Science, vol. 11314 (IDEAL 2018), pp. 783–791 (2018) Rodrigo, E.G., Aledo, J.A., Gamez, J.A.: spark-crowd: a spark package for learning from crowdsourced big data. J. Mach. Learn. Res. 20(19), 1–5 (2019) Rodrigo, G., Aledo, E., Gámez, J.A.: Machine learning from crowds: a systematic review of its applications. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. (2019). https://doi.org/10.1002/widm.1288 Snow, R., O’Connor, B., Jurafsky, D., Ng, A.Y.: Cheap and fast—but is it good? Evaluating non-expert annotations for natural language tasks. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 254–263. Association for Computational Linguistics, Honolulu (2008) Venanzi, M., Guiver, J., Kazai, G., Kohli, P., Shokouhi, M.: Community-based Bayesian aggregation models for crowdsourcing. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 155–164. ACM (2014) Whitehill, J., Wu, T.f., Bergsma, J., Movellan, J.R., Ruvolo, P.L.: Whose vote should count more: optimal integration of labels from labelers of unknown expertise. In: Advances in Neural Information Processing Systems, pp. 2035–2043 (2009) Xu, R., Wunsch, D.: Survey of clustering algorithms. IEEE Trans. Neural Netw. 16(3), 645–678 (2005) Zhang, J., Wu, X.: Multi-label inference for crowdsourcing. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’18, pp. 2738–2747. ACM, New York (2018). https://doi.org/10.1145/3219819.3219958 Zhang, J., Wu, X., Sheng, V.S.: Learning from crowdsourced labeled data: a survey. Artif. Intell. Rev. 46(4), 543–576 (2016) Zheng, Y., Li, G., Li, Y., Shan, C., Cheng, R.: Truth inference in crowdsourcing: is the problem solved? Proc. VLDB Endow. 10(5), 541–552 (2017) Zhou, Y., Wilkinson, D., Schreiber, R., Pan, R.: Large-scale parallel collaborative filtering for the netflix prize. In: International Conference on Algorithmic Applications in Management, pp. 337–348. Springer, Berlin (2008)

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA