Neuro-Symbolic Reasoning for Multimodal Referring Expression Comprehension in HMI Systems
New Generation Computing - Trang 1-20 - 2024
Tóm tắt
Conventional Human–Machine Interaction (HMI) interfaces have predominantly relied on GUI and voice commands. However, natural human communication also consists of non-verbal communication, including hand gestures like pointing. Thus, recent works in HMI systems have tried to incorporate pointing gestures as an input, making significant progress in recognizing and integrating them with voice commands. However, existing approaches often treat these input modalities independently, limiting their capacity to handle complex multimodal instructions requiring intricate reasoning of language and gestures. On the other hand, multimodal tasks requiring complex reasoning are being challenged in the language and vision domain, but these typically do not include gestures like pointing. To bridge this gap, we explore one of the challenging multimodal tasks, called Referring Expression Comprehension (REC), within multimodal HMI systems incorporating pointing gestures. We present a virtual setup in which a robot shares an environment with a user and is tasked with identifying objects based on the user’s language and gestural instructions. Furthermore, to address this challenge, we propose a hybrid neuro-symbolic model combining deep learning’s versatility with symbolic reasoning’s interpretability. Our contributions include a challenging multimodal REC dataset for HMI systems, an interpretable neuro-symbolic model, and an assessment of its ability to generalize the reasoning to unseen environments, complemented by an in-depth qualitative analysis of the model’s inner workings.
Tài liệu tham khảo
Lv, X., Zhang, M., Li, H.: Robot control based on voice command. In: 2008 IEEE International Conference on Automation and Logistics, pp. 2490–2494 (2008). https://doi.org/10.1109/ICAL.2008.4636587
Butterworth, G.: Pointing is the royal road to language for babies. Where Language, Culture, and Cognition Meet, Pointing (2003)
Bolt, R.A.: “put-that-there”: Voice and gesture at the graphics interface. SIGGRAPH ’80, pp. 262–270. Association for Computing Machinery, New York, NY, USA (1980). https://doi.org/10.1145/800250.807503
Kaiser, E., Olwal, A., McGee, D., Benko, H., Corradini, A., Li, X., Cohen, P., Feiner, S.: Mutual disambiguation of 3d multimodal interaction in augmented and virtual reality. In: Proceedings of the 5th International Conference on Multimodal Interfaces. ICMI ’03, pp. 12–19. Association for Computing Machinery, New York, NY, USA (2003). https://doi.org/10.1145/958432.958438
Paolillo, A., Abbate, G., Giusti, A., Trakić, Dzafic, H., Fritz, A., Guzzi, J.: Towards the integration of a pointing-based human-machine interface in an industrial control system compliant with the iec 61499 standard. Procedia CIRP 107, 1077–1082 (2022) https://doi.org/10.1016/j.procir.2022.05.111 . (Leading manufacturing systems transformation - Proceedings of the 55th CIRP Conference on Manufacturing Systems 2022)
Ende, T., Haddadin, S., Parusel, S., Wüsthoff, T., Hassenzahl, M., Albu-Schäffer, A.: A human-centered approach to robot gesture based communication within collaborative working processes. In: 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3367–3374 (2011). https://doi.org/10.1109/IROS.2011.6094592
Wu, Q., Wu, C.-J., Zhu, Y., Joo, J.: Communicative learning with natural gestures for embodied navigation agents with human-in-the-scene. In: International Conference on Intelligent Robotics and Systems (IROS) (2021)
Sato, E., Yamaguchi, T., Harashima, F.: Natural interface using pointing behavior for human-robot gestural interaction. IEEE Trans. Ind. Electron. 54(2), 1105–1112 (2007). https://doi.org/10.1109/TIE.2007.892728
Hu, J., Jiang, Z., Ding, X., Mu, T., Hall, P.: Vgpn: Voice-guided pointing robot navigation for humans. In: 2018 IEEE International Conference on Robotics and Biomimetics (ROBIO), pp. 1107–1112 (2018). IEEE
Stiefelhagen, R., Fugen, C., Gieselmann, R., Holzapfel, H., Nickel, K., Waibel, A.: Natural human-robot interaction using speech, head pose and gestures. In: 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566), vol. 3, pp. 2422–24273 (2004). https://doi.org/10.1109/IROS.2004.1389771
Islam, M.M., Mirzaiee, R.M., Gladstone, A., Green, H.N., Iqbal, T.: CAESAR: An embodied simulator for generating multimodal referring expression datasets. In: Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2022). https://openreview.net/forum?id=SKE_J-B3e9X
Chen, Y., Li, Q., Kong, D., Kei, Y.L., Zhu, S.-C., Gao, T., Zhu, Y., Huang, S.: Yourefit: embodied reference understanding with language and gesture. In: The IEEE International Conference on Computer Vision (ICCV) (2021)
Weerakoon, D., Subbaraju, V., Karumpulli, N., Tran, T., Xu, Q., Tan, U.-X., Lim, J.H., Misra, A.: Gesture enhanced comprehension of ambiguous human-to-robot instructions. In: Proceedings of the 2020 International Conference on Multimodal Interaction. ICMI ’20, pp. 251–259. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3382507.3418863
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inform. Process. Syst. 30 (2017)
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: VQA: Visual question answering. In: International Conference on Computer Vision (ICCV) (2015)
Kong, C., Lin, D., Bansal, M., Urtasun, R., Fidler, S.: What are you talking about? text-to-image coreference. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3558–3565 (2014). https://doi.org/10.1109/CVPR.2014.455
Barrault, L., Bougares, F., Specia, L., Lala, C., Elliott, D., Frank, S.: Findings of the third shared task on multimodal machine translation. In: Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pp. 304–323 (2018)
Liu, R., Liu, C., Bai, Y., Yuille, A.L.: Clevr-ref+: Diagnosing visual reasoning with referring expressions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4185–4194 (2019)
Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pp. 69–85 (2016). Springer
Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, pp. 13–23 (2019)
Chen, Y.-C., Li, L., Yu, L., Kholy, A.E., Ahmed, F., Gan, Z., Cheng, Y., Liu, J.: Uniter: Universal image-text representation learning. In: ECCV (2020)
Yan, B., Jiang, Y., Wu, J., Wang, D., Yuan, Z., Luo, P., Lu, H.: Universal instance perception as object discovery and retrieval. In: CVPR (2023)
Ding, H., Liu, C., Wang, S., Jiang, X.: Vision-language transformer and query generation for referring segmentation. In: Proceedings of the IEEE International Conference on Computer Vision (2021)
Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: a survey. IEEE Trans. Pattern Anal. Mach. Intell. (2023)
Wang, X., Chen, G., Qian, G., Gao, P., Wei, X.-Y., Wang, Y., Tian, Y., Gao, W.: Large-scale multi-modal pre-trained models: a comprehensive survey (2022)
Jabri, A., Joulin, A., Van Der Maaten, L.: Revisiting visual question answering baselines. In: European Conference on Computer Vision, pp. 727–739 (2016). Springer
Das, A., Agrawal, H., Zitnick, L., Parikh, D., Batra, D.: Human attention in visual question answering: do humans and deep networks look at the same regions? Comput. Vis. Image Understand. 163, 90–100 (2017)
Agrawal, A., Batra, D., Parikh, D.: Analyzing the behavior of visual question answering models (2016). arXiv preprint arXiv:1606.07356
Johnson, J., Hariharan, B., Maaten, L., Hoffman, J., Fei-Fei, L., Zitnick, C.L., Girshick, R.: Inferring and executing programs for visual reasoning. In: ICCV (2017)
Hudson, D., Manning, C.D.: Learning by abstraction: the neural state machine. Adv. Neural Inform. Process. Syst. 32 (2019)
Yi, K., Wu, J., Gan, C., Torralba, A., Kohli, P., Tenenbaum, J.: Neural-symbolic vqa: disentangling reasoning from vision and language understanding. Adv. Neural Inform. Process. Syst. 31 (2018)
Mao, J., Gan, C., Kohli, P., Tenenbaum, J.B., Wu, J.: The neuro-symbolic concept learner: interpreting scenes, words, and sentences from natural supervision. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=rJgMlhRctm
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: Vqa: Visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Johnson, J., Hariharan, B., Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.: Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR (2017)
Calli, B., Singh, A., Walsman, A., Srinivasa, S., Abbeel, P., Dollar, A.M.: The ycb object and model set: Towards common benchmarks for manipulation research. In: 2015 International Conference on Advanced Robotics (ICAR), pp. 510–517 (2015). https://doi.org/10.1109/ICAR.2015.7251504
Stanford Artificial Intelligence Laboratory: Robot Operating System. http://www.ros.org/
Kam, H.R., Lee, S.-H., Park, T., Kim, C.-H.: Rviz: a toolkit for real domain data visualization. Telecommun. Syst. 60(2), 337–345 (2015). https://doi.org/10.1007/s11235-015-0034-5
Somatic: Rviz Vive Plugin. https://github.com/getsomatic/rviz_vive_plugin
Zhang, A.: SpeechRecognition: python package index. https://pypi.org/project/SpeechRecognition/. Version 3.10.0
Nouri, A., Charrier, C., Lézoray, O.: Technical report : Greyc 3D colored mesh database. In: Technical report, Normandie Université, Unicaen, EnsiCaen, CNRS, GREYC UMR 6072 (2017). https://hal.science/hal-01441721
Johnson, J., Krishna, R., Stark, M., Li, L.-J., Shamma, D.A., Bernstein, M.S., Fei-Fei, L.: Image retrieval using scene graphs. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3668–3678 (2015). https://doi.org/10.1109/CVPR.2015.7298990
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The stanford CoreNLP natural language processing toolkit. In: Association for Computational Linguistics (ACL) System Demonstrations, pp. 55–60 (2014). http://www.aclweb.org/anthology/P/P14/P14-5010
Marneffe, M.-C., Manning, C.D., Nivre, J., Zeman, D.: Universal dependencies. Comput. Linguist. 47(2), 255–308 (2021). https://doi.org/10.1162/coli_a_00402
Kondo, K., Mizuno, G., Nakamura, Y.: Analysis of human pointing behavior in vision-based pointing interface system—difference of two typical pointing styles -. IFAC-PapersOnLine 49, 367–372 (2016)
Nickel, K., Stiefelhagen, R.: Pointing gesture recognition based on 3d-tracking of face, hands and head orientation. In: Proceedings of the 5th International Conference on Multimodal Interfaces. ICMI ’03, pp. 140–146. Association for Computing Machinery, New York, NY, USA (2003). https://doi.org/10.1145/958432.958460
Rosenblatt, M.: Remarks on Some nonparametric estimates of a density function. Ann. Math. Stat. 27(3), 832–837 (1956). https://doi.org/10.1214/aoms/1177728190
Parzen, E.: On estimation of a probability density function and mode. Ann. Math. Stat. 33(3), 1065–1076 (1962). https://doi.org/10.1214/aoms/1177704472
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D.A., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 32–73 (2017)
Zhu, G., Zhang, L., Jiang, Y., Dang, Y., Hou, H., Shen, P., Feng, M., Zhao, X., Miao, Q., Shah, S.A.A., et al.: Scene graph generation: a comprehensive survey (2022). arXiv preprint arXiv:2201.00443
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Computat. Linguist. 5, 135–146 (2017)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015). http://arxiv.org/abs/1412.6980