Temporally Consistent Depth Map Prediction Using Deep Convolutional Neural Network and Spatial-Temporal Conditional Random Field
Tóm tắt
Deep convolutional neural networks (DCNNs) based methods recently keep setting new records on the tasks of predicting depth maps from monocular images. When dealing with video-based applications such as 2D (2-dimensional) to 3D (3-dimensional) video conversion, however, these approaches tend to produce temporally inconsistent depth maps, since their CNN models are optimized over single frames. In this paper, we address this problem by introducing a novel spatial-temporal conditional random fields (CRF) model into the DCNN architecture, which is able to enforce temporal consistency between depth map estimations over consecutive video frames. In our approach, temporally consistent superpixel (TSP) is first applied to an image sequence to establish the correspondence of targets in consecutive frames. A DCNN is then used to regress the depth value of each temporal superpixel, followed by a spatial-temporal CRF layer to model the relationship of the estimated depths in both spatial and temporal domains. The parameters in both DCNN and CRF models are jointly optimized with back propagation. Experimental results show that our approach not only is able to significantly enhance the temporal consistency of estimated depth maps over existing single-frame-based approaches, but also improves the depth estimation accuracy in terms of various evaluation metrics.
Tài liệu tham khảo
Saxena A, Sun M, Ng A. Learning 3-D scene structure from a single still image. In Proc. the 11th IEEE International Conference on Computer Vision, October 2007.
Shotton J, Sharp T, Kipman A et al. Real-time human pose recognition in parts from single depth images. Communications of the ACM, 2013, 56(1): 116-124.
Cheng K L, Ju X, Tong R F, Tang M, Chang J, Zhang J J. A linear approach for depth and colour camera calibration using hybrid parameters. Journal of Computer Science and Technology, 2016, 31(3): 479-488.
Fanello S R, Keskin C, Izadi S, Kohli P, Kim D, Sweeney D, Criminisi A, Shotton J, Kang S B, Paek T. Learning to be a depth camera for close-range human capture and interaction. ACM Transactions on Graphics, 2014, 33(4): 86:1-86:11.
Zhang L, Vázquez C, Knorr S. 3D-TV content creation: Automatic 2D-to-3D video conversion. IEEE Transactions on Broadcasting, 2011, 57(2): 372-383.
Zhang G F, Jia J, Wong T T, Bao H J. Consistent depth maps recovery from a video sequence. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(6): 974-988.
Tsai Y M, Chang Y L, Chen L G. Block-based vanishing line and vanishing point detection for 3D scene reconstruction. In Proc. International Symposium on Intelligent Signal Processing and Communications, December 2006, pp.586-589.
Zhang R, Tsai P S, Cryer J E, Shah M. Shape-from-shading: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1999, 21(8): 690-706.
Eigen D, Puhrsch C, Fergus R. Depth map prediction from a single image using a multi-scale deep network. In Proc. Advances in Neural Information Processing Systems, December 2014, pp.2366-2374.
Eigen D, Fergus R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proc. IEEE International Conference on Computer Vision, December 2015, pp.2650-2658.
Li L, Shen C H, Dai Y C, van den Hengel A, He M. Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2015, pp.1119-1127.
Liu F, Shen C, Lin G. Deep convolutional neural fields for depth estimation from a single image. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2015, pp.5162-5170.
Liu F, Shen C H, Lin G S, Reid I. Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(10): 2024-2039.
Chang J,Wei D, Fisher J. A video representation using temporal superpixels. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2013, pp.2051-2058.
Azarbayejani A, Pentland A P. Recursive estimation of motion, structure, and focal length. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1995, 17(6): 562-575.
Pollefeys M, van Gool L V, Vergauwen M, Verbiest F, Cornelis K, Tops J, Koch R. Visual modeling with a hand-held camera. International Journal of Computer Vision, 2004, 59(3): 207-232.
Zhang G F, Jia J, Hua W, Bao H J. Robust bilayer segmentation and motion/depth estimation with a handheld camera. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(3): 603-617.
Saxena A, Chung S, Ng A Y. 3-D depth reconstruction from a single still image. International Journal of Computer Vision, 2008, 76(1): 53-69.
Saxena A, Sun M, Ng A. Make3D: Learning 3D scene structure from a single still image. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(5): 824-840.
Krizhevsky A, Sutskever I, Hinton G. Imagenet classification with deep convolutional neural networks. In Proc. the 26th Advances in Neural Information Processing Systems, December 2012, pp.1106-1114.
Zhu Z, Liang D, Zhang S, Huang X, Li B L, Hu S M. Trafficsign detection and classification in the wild. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2016, pp.2110-2118.
Nakajima Y, Saito H. Robust camera pose estimation by viewpoint classification using deep learning. Computational Visual Media, 2016.
Karsch K, Liu C, Kang S B. Depth extraction from video using non-parametric sampling. In Proc. European Conference on Computer Vision, October 2012, pp.775-788.
Karsch K, Liu C, Kang S B. Depth transfer: Depth extraction from video using non-parametric sampling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(11): 2144-2158.
Farabet C, Couprie C, Najman L, LeCun Y. Learning hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(8): 1915-1929.
Zheng S, Jayasumana S, Romera-Paredes B, Vineet V, Su Z, Du D L, Huang C, Torr P. Conditional random fields as recurrent neural networks. In Proc. the IEEE International Conference on Computer Vision, December 2015, pp.1529-1537.
Achanta B, Shaji A, Smith K, Lucchi A, Fua P, Süsstrunk S. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(11): 2274-2282.
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. https://arxiv.org/abs/1409.1556, March 2017.
Vedaldi A, Lenc K. MatConvNet: Convolutional neural networks for MATLAB. In Proc. the 23rd ACM International Conference on Multimedia, October 2015, pp.689-692.
Silberman N, Hoiem D, Kohli P, Fergus R. Indoor segmentation and support inference from RGBD images. In Proc. the 12th European Conference on Computer Vision, October 2012, pp.746-760.
Liu M M, Salzmann M, He X. Discrete-continuous depth estimation from a single image. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2014, pp.716-723.
Fehn C, de la Barré R, Pastoor S. Interactive 3-DTVconcepts and key technologies. Proceedings of the IEEE, 2006, 94(3): 524-538.
Cao X, Zheng Li, Dai Q H. Semi-automatic 2D-to-3D conversion using disparity propagation. IEEE Transactions on Broadcasting, 2011, 57(2): 491-499.
Phan R, Androutsos D. Robust semi-automatic depth map generation in unconstrained images and video sequences for 2D to stereoscopic 3D conversion. IEEE Transactions on Multimedia, 2014, 16(1): 122-136.
Mikolov T, Kombrink S, Burget L, Cernocky J, Khudanpur S. Extensions of recurrent neural network language model. In Proc. the IEEE International Conference on Acoustics, Speech and Signal Processing, May 2011, pp.5528-5531.
Graves A, Mohamed A, Hinton G. Speech recognition with deep recurrent neural networks. In Proc. International Conference on Acoustics, Speech and Signal Processing, May 2013, pp.6645-6649.