A system for the generation of in-car human body pose datasets

Machine Vision and Applications - Tập 32 - Trang 1-15 - 2020
João Borges1, Sandro Queirós1,2,3, Bruno Oliveira1, Helena Torres1, Nelson Rodrigues1, Victor Coelho4, Johannes Pallauf5, José Henrique Brito6, José Mendes1, Jaime C. Fonseca1
1Algoritmi Center, University of Minho, Guimarães, Portugal
2Life and Health Sciences Research Institute (ICVS), School of Medicine, University of Minho, Braga, Portugal
3ICVS/3B’s - PT Government Associate Laboratory, Braga/Guimarães, Portugal
4Bosch, Braga, Portugal
5Bosch, Abstatt, Germany
62Ai - Polytechnical Institute of Cávado and Ave, Barcelos, Portugal

Tóm tắt

With the advent of autonomous vehicles, detection of the occupants’ posture is crucial to tackle the needs of infotainment interaction or passive safety systems. Generative approaches have been recently proposed for human body pose in-car detection, but this type of approaches requires a large training dataset for a feasible accuracy. This requirement poses a difficulty, given the substantial time required to annotate such large amount of data. In the in-car scenario, this requirement risk increases even further, since a robust human body pose ground-truth system capable of working in it is needed but inexistent. Currently, the gold standard for human body pose capture is based on optical systems, requiring up to 39 visible markers for a plug-in gait model, which in this case are not feasible given the occlusions inside the car. Other solutions, such as inertial suits, also have limitations linked to magnetic sensitivity and global positioning drift. In this paper, a system for the generation of images for human body pose detection in an in-car environment is proposed. To this end, we propose to smartly combine inertial and optical systems to suppress their individual limitations: By combining the global positioning of 3 visible head markers provided by the optical system with the inertial suit’s relative human body pose, we obtain an occlusion-ready, drift-free full-body global positioning system. This system is then spatially and temporally calibrated with a time-of-flight sensor, automatically obtaining in-car image data with (multi-person) pose annotations. Besides quantifying the inertial suit inherent sensitivity and accuracy, the feasibility of the overall system for human body pose capture in the in-car scenario was demonstrated. Our results quantify the errors associated with the inertial suit, pinpoint some sources of the system’s uncertainty and propose how to minimize some of them. Finally, we demonstrate the feasibility of using system generated data (which was made publicly available), independently or mixed with two publicly available generic datasets (not in-car), to train 2 machine learning algorithms, demonstrating the improvement in their algorithmic accuracy for the in-car scenario.

Tài liệu tham khảo

Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2014). https://doi.org/10.1109/CVPR.2014.471 Baak, A., Müller, M., Bharaj, G., Seidel, H.P., Theobalt, C.: A data-driven approach for real-time full body pose reconstruction from a depth camera. In: Consumer Depth Cameras for Computer Vision, pp. 71–98 (2013). https://doi.org/10.1007/978-1-4471-4640-7_5 Borges, J., Queirós, S., Oliveira, B., Torres, H., Rodrigues, N., Coelho, V., Pallauf, J., Henrique, Brito J., Mendes, J., C Fonseca J.: MoLa R8.7k InCar Dataset (2019). https://doi.org/10.17632/724C998H9C.1 Borghi, G., Venturelli, M., Vezzani, R., Cucchiara, R.: POSEidon: Face-from-depth for driver pose estimation. In: Proceedings 30th IEEE conference on computer vision and pattern recognition, CVPR 2017 2017-Janua, pp. 5494–5503 (2017). https://doi.org/10.1109/CVPR.2017.583, arXiv:1611.10195 Buys, K., Cagniart, C., Baksheev, A., De Laet, T., De Schutter, J., Pantofaru, C.: An adaptable system for RGB-D based human body detection and pose estimation. J. Vis. Commun. Image Represent. 25(1), 39–52 (2014). https://doi.org/10.1016/j.jvcir.2013.03.011 Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 2017-Janua, pp. 1302–1310 (2017). https://doi.org/10.1109/CVPR.2017.143, arXiv:1611.08050 Chen, W., Wang, H., Li, Y., Su, H., Wang, Z., Tu, C., Lischinski, D., Cohen-Or, D., Chen B.: Synthesizing training images for boosting human 3D pose estimation. In: Proceedings 2016 4th International Conference on 3D Vision, 3DV 2016, pp. 479–488 (2016). https://doi.org/10.1109/3DV.2016.58, http://irc.cs.sdu, arXiv:1604.02703 CMU (2016) CMU [email protected]. http://mocap.cs.cmu.edu/ Demirdjian, D., Varri C.: Driver pose estimation with 3D Time-of-Flight sensor. In: 2009 IEEE Workshop on Computational Intelligence in Vehicles and Vehicular Systems, IEEE, pp. 16–22 (2009). https://doi.org/10.1109/CIVVS.2009.4938718, http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4938718 Ganapathi, V., Plagemann, C., Koller, D., Thrun, S.: Real time motion capture using a single time-of-flight camera. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR2010), pp. 755–762 (2010). https://doi.org/10.1109/CVPR.2010.5540141, http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=5540141 Ganapathi, V., Plagemann, C., Koller, D., Thrun, S.: Real-time human pose tracking from range data. In: European Conference on Computer Vision, pp. 738–751 (2012). https://doi.org/10.1007/978-3-642-33783-3_53, http://link.springer.com/10.1007/978-3-642-33783-3_53 Haque, A., Peng, B., Luo, Z., Alahi, A., Yeung, S., Fei-Fei, L.: Towards viewpoint invariant 3D human pose estimation. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 9905 LNCS, pp. 160–177 (2016). https://doi.org/10.1007/978-3-319-46448-0_10, arXiv:1603.07076 Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2014). https://doi.org/10.1109/TPAMI.2013.248 Joo, H., Simon, T., Sheikh, Y.: Total capture: a 3D deformation model for tracking faces, hands, and bodies. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 8320–8329 (2018). https://doi.org/10.1109/CVPR.2018.00868, http://www.cs.cmu.edu/, arXiv:1801.01615 Joo, H., Simon, T., Li, X., Liu, H., Tan, L., Gui, L., Banerjee, S., Godisart, T., Nabbe, B., Matthews, I., Kanade, T., Nobuhara, S., Sheikh, Y.: Panoptic studio: a massively multiview system for social interaction capture. IEEE Trans. Pattern Anal. Mach. Intell. 41(1), 190–204 (2019). https://doi.org/10.1109/TPAMI.2017.2782743 Jung, H.Y., Lee, S., Heo, Y.S., Yun, I.D.: Random tree walk toward instantaneous 3D human pose estimation. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 07-12-June, pp. 2467–2474 (2015). https://doi.org/10.1109/CVPR.2015.7298861 Kroon, D.J.: Segmentation of the mandibular canal in cone-beam CT data. Ph.D. thesis, University of Twente, Enschede, The Netherlands (2011). https://doi.org/10.3990/1.9789036532808, http://purl.org/utwente/doi/10.3990/1.9789036532808 Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2556–2563 (2011). https://doi.org/10.1109/ICCV.2011.6126543 Lee, S.J., Motai, Y., Choi, H.: Tracking human motion with multichannel interacting multiple model. IEEE Trans. Ind. Inform. 9(3), 1751–1763 (2013) Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3d human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision (2017). https://doi.org/10.1109/ICCV.2017.288 Martinez-Gonzalez, A., Villamizar, M., Canevet, O., Odobez, J.M.: Efficient convolutional neural networks for depth-based multi-person pose estimation. IEEE Trans. Circuits Syst. Video Technol. (2019). https://doi.org/10.1109/tcsvt.2019.2952779 Mcneal, J.R.D.P., Eastern, H.A.S., Education, P.: The united states olympic committee uses the polhemus LIBERTY \(^{TM}\) to research the effects of acute static stretch on joint position sense in the shoulder. Computer 4777, 800–802 (2003) Mitobe, K., Kaiga, T., Yukawa, T., Miura, T., Tamamoto, H., Rodgers, A., Yoshimura, N.: Development of a motion capture system for a hand using a magnetic three dimensional position sensor. In: ACM SIGGRAPH 2006 research posters on: SIGGRAPH ’06, p. 102 (2006). https://doi.org/10.1145/1179622.1179740, http://dl.acm.org/citation.cfm?id=1179622.1179740 Orozco, M.: Assessment of postural deviations associated errors in the analysis of kinematics using inertial and magnetic sensors and a correction technique proposal by assessment of postural deviations associated errors in the analysis of kinematics using inertial. Ph.D. thesis, University of Toronto (2015) Pekelny, Y., Gotsman, C.: Articulated object reconstruction and markerless motion capture from depth video. Comput. Graph. Forum 27(2), 399–408 (2008). https://doi.org/10.1111/j.1467-8659.2008.01137.x Plagemann, C., Ganapathi, V., Koller, D., Thrun, S.: Real-time identification and localization of body parts from depth images. In: Proceedings—IEEE International Conference on Robotics and Automation, pp. 3108–3113 (2010). https://doi.org/10.1109/ROBOT.2010.5509559 Rahmatalla, S., Xia, T., Contratto, M., Kopp, G., Wilder, D., Frey Law, L., Ankrum, J.: Three-dimensional motion capture protocol for seated operator in whole body vibration. Int. J. Ind. Ergon. 38(5–6), 425–433 (2008). https://doi.org/10.1016/j.ergon.2007.08.015 Roetenberg, D., Luinge, H., Slycke, P.: Xsens MVN: full 6DOF human motion tracking using inertial sensors. Technical report, Xsens Technologies (2013) Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D: a large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2016-Decem, pp. 1010–1019 (2016). https://doi.org/10.1109/CVPR.2016.115, arXiv:1604.02808 Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., Blake, A.: Real-time human pose recognition in parts from single depth images. In: Conference on Computer Vision and Pattern Recognition 2011, pp. 1297–1304 (2011). https://doi.org/10.1109/CVPR.2011.5995316, http://ieeexplore.ieee.org/document/5995316/, arXiv:1111.6189v1 Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., Blake, A.: Real-time human pose recognition in parts from single depth images. Stud. Comput. Intell. 411, 119–135 (2013). https://doi.org/10.1007/978-3-642-28661-2-5 Sigal, L., Balan, A.O., Black, M.J.: HumanEva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. Int. J. Comput. Vis. 87(1–2), 4–27 (2010). https://doi.org/10.1007/s11263-009-0273-6 Torres, H.R., Oliveira, B., Fonseca, J., Queirós, S., Borges, J., Rodrigues, N., Coelho, V., Pallauf, J., Brito, J., Mendes, J.: Real-time human body pose estimation for in-car depth images. In: IFIP Advances in Information and Communication Technology. Springer, New York LLC, vol. 553, pp. 169–182 (2019). https://doi.org/10.1007/978-3-030-17771-3_14 Whitehead, A., Laganiere, R., Bose, P.: Temporal synchronization of video sequences in theory and in practice. In: Proceedings—IEEE Workshop on Motion and Video Computing, MOTION 2005 (2007). https://doi.org/10.1109/ACVMOT.2005.114 Wu, G., Siegler, S., Allard, P., Kirtley, C., Leardini, A., Rosenbaum, D., Whittle, M., D’Lima, D.D., Cristofolini, L., Witte, H., Schmid, O., Stokes, I.: ISB recommendation on definitions of joint coordinate system of various joints for the reporting of human joint motion-part I: ankle, hip, and spine. J. Biomech. 35(4), 543–548 (2002). https://doi.org/10.1016/S0021-9290(01)00222-6 Xing, T., Yu, Y., Zhou, Y., Du, S.: Markerless motion capture of human body using PSO with single depth camera. In: 2012 Second International Conference on 3D Imaging, Modeling, Processing, Visualization and Transmission, pp. 192–197 (2012). https://doi.org/10.1109/3DIMPVT.2012.21, http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6374994 Yan, C., Shao, B., Zhao, H., Ning, R., Zhang, Y., Xu, F.: 3D Room layout estimation from a single RGB image. IEEE Trans. Multimed. 14(8), 1–1 (2020). https://doi.org/10.1109/tmm.2020.2967645 Ye, M., Shen, Y., Du, C., Pan, Z., Yang, R.: Real-time simultaneous pose and shape estimation for articulated objects using a single depth camera. IEEE Trans. Pattern Anal. Mach. Intell. 38(8), 1517–1532 (2016). https://doi.org/10.1109/TPAMI.2016.2557783 Zhang, Z.: A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. (2000). https://doi.org/10.1109/34.888718