Neuro-Symbolic Reasoning for Multimodal Referring Expression Comprehension in HMI Systems

New Generation Computing - Trang 1-20 - 2024
Aman Jain1,2, Anirudh Reddy Kondapally1,2, Kentaro Yamada2, Hitomi Yanaka1
1The University of Tokyo, Tokyo, Japan
2Honda R &D Co. Ltd., Tokyo, Japan

Tóm tắt

Conventional Human–Machine Interaction (HMI) interfaces have predominantly relied on GUI and voice commands. However, natural human communication also consists of non-verbal communication, including hand gestures like pointing. Thus, recent works in HMI systems have tried to incorporate pointing gestures as an input, making significant progress in recognizing and integrating them with voice commands. However, existing approaches often treat these input modalities independently, limiting their capacity to handle complex multimodal instructions requiring intricate reasoning of language and gestures. On the other hand, multimodal tasks requiring complex reasoning are being challenged in the language and vision domain, but these typically do not include gestures like pointing. To bridge this gap, we explore one of the challenging multimodal tasks, called Referring Expression Comprehension (REC), within multimodal HMI systems incorporating pointing gestures. We present a virtual setup in which a robot shares an environment with a user and is tasked with identifying objects based on the user’s language and gestural instructions. Furthermore, to address this challenge, we propose a hybrid neuro-symbolic model combining deep learning’s versatility with symbolic reasoning’s interpretability. Our contributions include a challenging multimodal REC dataset for HMI systems, an interpretable neuro-symbolic model, and an assessment of its ability to generalize the reasoning to unseen environments, complemented by an in-depth qualitative analysis of the model’s inner workings.

Tài liệu tham khảo

Lv, X., Zhang, M., Li, H.: Robot control based on voice command. In: 2008 IEEE International Conference on Automation and Logistics, pp. 2490–2494 (2008). https://doi.org/10.1109/ICAL.2008.4636587 Butterworth, G.: Pointing is the royal road to language for babies. Where Language, Culture, and Cognition Meet, Pointing (2003) Bolt, R.A.: “put-that-there”: Voice and gesture at the graphics interface. SIGGRAPH ’80, pp. 262–270. Association for Computing Machinery, New York, NY, USA (1980). https://doi.org/10.1145/800250.807503 Kaiser, E., Olwal, A., McGee, D., Benko, H., Corradini, A., Li, X., Cohen, P., Feiner, S.: Mutual disambiguation of 3d multimodal interaction in augmented and virtual reality. In: Proceedings of the 5th International Conference on Multimodal Interfaces. ICMI ’03, pp. 12–19. Association for Computing Machinery, New York, NY, USA (2003). https://doi.org/10.1145/958432.958438 Paolillo, A., Abbate, G., Giusti, A., Trakić, Dzafic, H., Fritz, A., Guzzi, J.: Towards the integration of a pointing-based human-machine interface in an industrial control system compliant with the iec 61499 standard. Procedia CIRP 107, 1077–1082 (2022) https://doi.org/10.1016/j.procir.2022.05.111 . (Leading manufacturing systems transformation - Proceedings of the 55th CIRP Conference on Manufacturing Systems 2022) Ende, T., Haddadin, S., Parusel, S., Wüsthoff, T., Hassenzahl, M., Albu-Schäffer, A.: A human-centered approach to robot gesture based communication within collaborative working processes. In: 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3367–3374 (2011). https://doi.org/10.1109/IROS.2011.6094592 Wu, Q., Wu, C.-J., Zhu, Y., Joo, J.: Communicative learning with natural gestures for embodied navigation agents with human-in-the-scene. In: International Conference on Intelligent Robotics and Systems (IROS) (2021) Sato, E., Yamaguchi, T., Harashima, F.: Natural interface using pointing behavior for human-robot gestural interaction. IEEE Trans. Ind. Electron. 54(2), 1105–1112 (2007). https://doi.org/10.1109/TIE.2007.892728 Hu, J., Jiang, Z., Ding, X., Mu, T., Hall, P.: Vgpn: Voice-guided pointing robot navigation for humans. In: 2018 IEEE International Conference on Robotics and Biomimetics (ROBIO), pp. 1107–1112 (2018). IEEE Stiefelhagen, R., Fugen, C., Gieselmann, R., Holzapfel, H., Nickel, K., Waibel, A.: Natural human-robot interaction using speech, head pose and gestures. In: 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566), vol. 3, pp. 2422–24273 (2004). https://doi.org/10.1109/IROS.2004.1389771 Islam, M.M., Mirzaiee, R.M., Gladstone, A., Green, H.N., Iqbal, T.: CAESAR: An embodied simulator for generating multimodal referring expression datasets. In: Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2022). https://openreview.net/forum?id=SKE_J-B3e9X Chen, Y., Li, Q., Kong, D., Kei, Y.L., Zhu, S.-C., Gao, T., Zhu, Y., Huang, S.: Yourefit: embodied reference understanding with language and gesture. In: The IEEE International Conference on Computer Vision (ICCV) (2021) Weerakoon, D., Subbaraju, V., Karumpulli, N., Tran, T., Xu, Q., Tan, U.-X., Lim, J.H., Misra, A.: Gesture enhanced comprehension of ambiguous human-to-robot instructions. In: Proceedings of the 2020 International Conference on Multimodal Interaction. ICMI ’20, pp. 251–259. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3382507.3418863 Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inform. Process. Syst. 30 (2017) Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: VQA: Visual question answering. In: International Conference on Computer Vision (ICCV) (2015) Kong, C., Lin, D., Bansal, M., Urtasun, R., Fidler, S.: What are you talking about? text-to-image coreference. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3558–3565 (2014). https://doi.org/10.1109/CVPR.2014.455 Barrault, L., Bougares, F., Specia, L., Lala, C., Elliott, D., Frank, S.: Findings of the third shared task on multimodal machine translation. In: Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pp. 304–323 (2018) Liu, R., Liu, C., Bai, Y., Yuille, A.L.: Clevr-ref+: Diagnosing visual reasoning with referring expressions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4185–4194 (2019) Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pp. 69–85 (2016). Springer Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, pp. 13–23 (2019) Chen, Y.-C., Li, L., Yu, L., Kholy, A.E., Ahmed, F., Gan, Z., Cheng, Y., Liu, J.: Uniter: Universal image-text representation learning. In: ECCV (2020) Yan, B., Jiang, Y., Wu, J., Wang, D., Yuan, Z., Luo, P., Lu, H.: Universal instance perception as object discovery and retrieval. In: CVPR (2023) Ding, H., Liu, C., Wang, S., Jiang, X.: Vision-language transformer and query generation for referring segmentation. In: Proceedings of the IEEE International Conference on Computer Vision (2021) Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: a survey. IEEE Trans. Pattern Anal. Mach. Intell. (2023) Wang, X., Chen, G., Qian, G., Gao, P., Wei, X.-Y., Wang, Y., Tian, Y., Gao, W.: Large-scale multi-modal pre-trained models: a comprehensive survey (2022) Jabri, A., Joulin, A., Van Der Maaten, L.: Revisiting visual question answering baselines. In: European Conference on Computer Vision, pp. 727–739 (2016). Springer Das, A., Agrawal, H., Zitnick, L., Parikh, D., Batra, D.: Human attention in visual question answering: do humans and deep networks look at the same regions? Comput. Vis. Image Understand. 163, 90–100 (2017) Agrawal, A., Batra, D., Parikh, D.: Analyzing the behavior of visual question answering models (2016). arXiv preprint arXiv:1606.07356 Johnson, J., Hariharan, B., Maaten, L., Hoffman, J., Fei-Fei, L., Zitnick, C.L., Girshick, R.: Inferring and executing programs for visual reasoning. In: ICCV (2017) Hudson, D., Manning, C.D.: Learning by abstraction: the neural state machine. Adv. Neural Inform. Process. Syst. 32 (2019) Yi, K., Wu, J., Gan, C., Torralba, A., Kohli, P., Tenenbaum, J.: Neural-symbolic vqa: disentangling reasoning from vision and language understanding. Adv. Neural Inform. Process. Syst. 31 (2018) Mao, J., Gan, C., Kohli, P., Tenenbaum, J.B., Wu, J.: The neuro-symbolic concept learner: interpreting scenes, words, and sentences from natural supervision. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=rJgMlhRctm Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: Vqa: Visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015) Johnson, J., Hariharan, B., Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.: Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR (2017) Calli, B., Singh, A., Walsman, A., Srinivasa, S., Abbeel, P., Dollar, A.M.: The ycb object and model set: Towards common benchmarks for manipulation research. In: 2015 International Conference on Advanced Robotics (ICAR), pp. 510–517 (2015). https://doi.org/10.1109/ICAR.2015.7251504 Stanford Artificial Intelligence Laboratory: Robot Operating System. http://www.ros.org/ Kam, H.R., Lee, S.-H., Park, T., Kim, C.-H.: Rviz: a toolkit for real domain data visualization. Telecommun. Syst. 60(2), 337–345 (2015). https://doi.org/10.1007/s11235-015-0034-5 Somatic: Rviz Vive Plugin. https://github.com/getsomatic/rviz_vive_plugin Zhang, A.: SpeechRecognition: python package index. https://pypi.org/project/SpeechRecognition/. Version 3.10.0 Nouri, A., Charrier, C., Lézoray, O.: Technical report : Greyc 3D colored mesh database. In: Technical report, Normandie Université, Unicaen, EnsiCaen, CNRS, GREYC UMR 6072 (2017). https://hal.science/hal-01441721 Johnson, J., Krishna, R., Stark, M., Li, L.-J., Shamma, D.A., Bernstein, M.S., Fei-Fei, L.: Image retrieval using scene graphs. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3668–3678 (2015). https://doi.org/10.1109/CVPR.2015.7298990 Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The stanford CoreNLP natural language processing toolkit. In: Association for Computational Linguistics (ACL) System Demonstrations, pp. 55–60 (2014). http://www.aclweb.org/anthology/P/P14/P14-5010 Marneffe, M.-C., Manning, C.D., Nivre, J., Zeman, D.: Universal dependencies. Comput. Linguist. 47(2), 255–308 (2021). https://doi.org/10.1162/coli_a_00402 Kondo, K., Mizuno, G., Nakamura, Y.: Analysis of human pointing behavior in vision-based pointing interface system—difference of two typical pointing styles -. IFAC-PapersOnLine 49, 367–372 (2016) Nickel, K., Stiefelhagen, R.: Pointing gesture recognition based on 3d-tracking of face, hands and head orientation. In: Proceedings of the 5th International Conference on Multimodal Interfaces. ICMI ’03, pp. 140–146. Association for Computing Machinery, New York, NY, USA (2003). https://doi.org/10.1145/958432.958460 Rosenblatt, M.: Remarks on Some nonparametric estimates of a density function. Ann. Math. Stat. 27(3), 832–837 (1956). https://doi.org/10.1214/aoms/1177728190 Parzen, E.: On estimation of a probability density function and mode. Ann. Math. Stat. 33(3), 1065–1076 (1962). https://doi.org/10.1214/aoms/1177704472 Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D.A., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 32–73 (2017) Zhu, G., Zhang, L., Jiang, Y., Dang, Y., Hou, H., Shen, P., Feng, M., Zhao, X., Miao, Q., Shah, S.A.A., et al.: Scene graph generation: a comprehensive survey (2022). arXiv preprint arXiv:2201.00443 Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Computat. Linguist. 5, 135–146 (2017) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015). http://arxiv.org/abs/1412.6980