3DNN: 3D Nearest Neighbor
Tóm tắt
In this paper, we describe a data-driven approach to leverage repositories of 3D models for scene understanding. Our ability to relate what we see in an image to a large collection of 3D models allows us to transfer information from these models, creating a rich understanding of the scene. We develop a framework for auto-calibrating a camera, rendering 3D models from the viewpoint an image was taken, and computing a similarity measure between each 3D model and an input image. We demonstrate this data-driven approach in the context of geometry estimation and show the ability to find the identities, poses and styles of objects in a scene. The true benefit of 3DNN compared to a traditional 2D nearest-neighbor approach is that by generalizing across viewpoints, we free ourselves from the need to have training examples captured from all possible viewpoints. Thus, we are able to achieve comparable results using orders of magnitude less data, and recognize objects from never-before-seen viewpoints. In this work, we describe the 3DNN algorithm and rigorously evaluate its performance for the tasks of geometry estimation and object detection/segmentation, as well as two novel applications: affordance estimation and photorealistic object insertion.
Tài liệu tham khảo
Arbelaez, P., Maire, M., Fowlkes, C., & Malik, J. (2011). Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 33(5), 898–916.
Baatz, G., Saurer, O., Köser, K., & Pollefeys, M. (2012). Large scale visual geo-localization of images in mountainous terrain. Proceedings of the European Conference on Computer Vision (ECCV).
Baboud, L., Cadik, M., Eisemann, E., & Seidel, H.-P. (2011). Automatic photo-to-terrain alignment for the annotation of mountain pictures. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Bao, S. Y., Sun, M., & Savarese, S. (2010). Toward coherent object detection and scene layout understanding. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Brooks, R. (1981). Symbolic reasoning among 3D models and 2D images. Artificial Intelligence, Vol. 17.
Choi, W., Pantofaru, C., & Savarese, S. (2013). Understanding indoor scenes using 3D geometric phrases. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Colburn, A., Agarwala, A., Hertzmann, A., Curless, B., & Cohen, M. F. (2013). Image-based remodeling. IEEE Transactions on Visualization and Computer Graphics, 19(1), 56–66.
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Debevec, P. (1998). Rendering synthetic objects into real scenes: Bridging traditional and image-based graphics with global illumination and high dynamic range photography. Proceedings of the 25th annual conference on Computer graphics and interactive techniques, SIGGRAPH ’98.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A Large-Scale Hierarchical Image Database. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.
Fisher, M., & Hanrahan, P. (2010). Context-based search for 3D models. ACM Transactions on Graphics, 29, 182:1–182:10.
Fisher, M., Savva, M., & Hanrahan, P. (2011). Characterizing structural relationships in scenes using graph Kernels. ACM Transactions on Graphics, 30, 34:1–34:12.
Fouhey, D., Gupta, A., & Hebert, M. (2013). Data-driven 3d primatives. Submission to the Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Fouhey, D.F., Delaitre, V., Gupta, A., Efros, A. A., Laptev, I., & Sivic, J. (2012). People watching: Human actions as a cue for single-view geometry. Proceedings of the European Conference on Computer Vision (ECCV).
Girshick, R.B., Felzenszwalb, P.F., & McAllester, D. (2012). Discriminatively trained deformable part models, release 5. http://people.cs.uchicago.edu/rbg/latent-release5/.
Google Inc. (2000). Google SketchUp. http://sketchup.google.com/.
Grabner, H., Gall, J., & Gool, L.J.V. (2011). What makes a chair a chair? Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1529–1536.
Grimson, W., Lozano-Pérez, T., & Huttenlocher, D. (1990). Object recognition by computer. Cambridge: MIT Press.
Grosse, R., Johnson, M. K., Adelson, E. H., & Freeman, W. T. (2009). Ground truth dataset and baseline evaluations for intrinsic image algorithms. Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2335–2342.
Guo, R., Dai, Q., & Hoiem, D. (2011). Single-image shadow detection and removal using paired regions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Gupta, A., Efros, A. A., & Hebert, M. (2010). Blocks world revisited: Image understanding using qualitative geometry and mechanics. Proceedings of the European Conference on Computer Vision (ECCV).
Gupta, A., Satkin, S., Efros, A. A., & Hebert, M. (2011). From 3D scene geometry to human workspace. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Hays, J., & Efros, A. A. (2007). Scene completion using millions of photographs. ACM Transactions on Graphics (SIGGRAPH 2007). doi:10.1145/1275808.1276382.
Hays, J., & Efros, A. A. (2008). im2gps: estimating geographic information from a single image. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Hedau, V., Hoiem, D., & Forsyth, D. (2009). Recovering the spatial layout of cluttered rooms. Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Hedau, V., Hoiem, D., & Forsyth, D. (2010). Thinking inside the box: Using appearance models and context based on room geometry. Proceedings of the European Conference on Computer Vision (ECCV).
Hedau, V., Hoiem, D., & Forsyth, D. (2012). Recovering free space of indoor scenes from a single image. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Herbrich, R., Graepel, T., & Obermayer, K. (1999). Support vector learning for ordinal regression. Proceedings of the International Conference on Artificial Neural Networks (ANN), 1, 97–102.
Hoiem, D., Efros, A. A., & Hebert, M. (2007). Recovering surface layout from an image. International Journal of Computer Vision (IJCV), 75(1), 151–172.
Hoiem, D., Efros, A. A., & Hebert, M. (2008). Closing the loop on scene interpretation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
IKEA (2013). IKEA Catalog.
Karsch, K., Hedau, V., Forsyth, D., & Hoiem, D. (2011). Rendering synthetic objects into legacy photographs. Proceedings of the 2011 SIGGRAPH Asia Conference. ACM.
Karsch, K., Sunkavalli, K., Hadap, S., Carr, N., Jin, H., Fonte, R., et al. (2014). Automatic scene inference for 3D object compositing. ACM Transactions on Graphics (to appear).
Lai, K., & Fox, D. (2009). 3D laser scan classification using web data and domain adaptation. Proceedings of Robotics: Science and Systems, Seattle, USA.
Lalonde, J.-F., Hoiem, D., Efros, A. A., Rother, C., Winn, J., & Criminisi, A. (2007). Photo clip art. ACM Transactions on Graphics (SIGGRAPH 2007), 26(3), 3.
Lee, D., Gupta, A., Hebert, M., & Kanade, T. (2010). Estimating spatial layout of rooms using volumetric reasoning about objects and surfaces. Proceedings of the Conference on Neural Information Processing Systems (NIPS).
Lee, D., Hebert, M., & Kanade, T. (2009). Geometric reasoning for single image structure recovery. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Levin, A., Rav-Acha, A., & Lischinski, D. (2008). Spectral matting. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 30(10), 1699–1712.
Li, L.-J., Su, H., Xing, E. P., & Fei-Fei, L. (2010). Object bank: A high-level image representation for scene classification & semantic feature sparsification. Neural Information Processing Systems (NIPS), 2(3), 5.
Lim, J. J., Pirsiavash, H., & Torralba, A. (2013). Parsing ikea objects: Fine pose estimation. Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Liu, C., Yuen, J., Torralba, A., Sivic, J., & Freeman, W. T. (2008). SIFT flow: Dense correspondence across different scenes. Proceedings of the European Conference on Computer Vision (ECCV).
Liu, M.-Y., Tuzel, O., Veeraraghavan, A., & Chellappa, R. (2010). Fast directional chamfer matching. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Longuet-Higgins, H. C. (1981). A computer algorithm for reconstructing a scene from two projections. Nature, 293, 133–135.
Lowe, D. (1987). Three-dimensional object recognition from single two-dimensional images. Artificial Intelligence, 31(3), 355–395.
Microsoft Corporation (2010). Kinect for Xbox 360.
Munoz, D., Bagnell, J. A., & Hebert, M. (2010). Stacked hierarchical labeling. Proceedings of the European Conference on Computer Vision (ECCV).
Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision (IJCV), 42, 145–175.
Oliva, A., & Torralba, A. (2006). Building the gist of a scene: the role of global image features in recognition. Progress in Brain Research, 42, 145–175.
Pero, L. D., Bowdish, J., Fried, D., Kermgard, B., Hartley, E. & Barnard, K. (2012). Bayesian geometric modeling of indoor scenes. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Pero, L. D., Bowdish, J., Hartley, E., Kermgard, B., & Barnard, K. (2013). Understanding bayesian rooms using composite 3D object models. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Pero, L. D., Guan, J., Brau, E., Schlecht, J., & Barnard, K. (2011). Sampling bedrooms. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Polantis (2010). Polantis 3D Catalog. http://www.polantis.com/ikea.
Ramalingam, S., Bouaziz, S., Sturm, P., & Brand, M. (2010). SKYLINE2GPS: Localization in Urban Canyons using Omni-Skylines. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
Satkin, S. (2013). Data-Driven Geometric Scene Understanding. PhD thesis, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, August.
Satkin, S., & Hebert, M. (2013). 3DNN: Viewpoint invariant 3D geometry matching for scene understanding. Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Satkin, S., Lin, J., & Hebert, M. (2012). CMU 3D-Annotated Scene Database. http://cmu.satkin.com/bmvc2012/.
Satkin, S., Lin, J., & Hebert, M. (2012). Data-driven scene understanding from 3D models. Proceedings of the 23rd British Machine Vision Conference (BMVC).
Saxena, A., Sun, M., & Ng, A. Y. (2009). Make3D: Learning 3D scene structure from a single still image. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 31(5), 824–840.
Schwing, A. G., & Urtasun, R. (2012). Efficient Exact Inference for 3D Indoor Scene Understanding. Proceedings of the European Conference on Computer Vision (ECCV).
Shao, T., Xu, W., Zhou, K., Wang, J., Li, D., & Guo, B. (2012). An interactive approach to semantic modeling of indoor scenes with an rgbd camera. ACM Transactions on Graphics, 31(6), 136:1–136:11.
Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from rgbd images. Proceedings of the European Conference on Computer Vision (ECCV).
Sing, G., & Košecká, J. (2013). Nonparametric scene parsing with adaptive feature relevance and semantic context. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Tighe, J., & Lazebnik, S. (2010). Superparsing: Scalable nonparametric image parsing with superpixels. Proceedings of the European Conference on Computer Vision (ECCV).
Torralba, A., Fergus, R., & Freeman, W. T. (Nov. 2008). 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 30(11), 1958–1970.
Torralba, A., Russell, B. C., & Yuen, J. (2010). Labelme: Online image annotation and applications. Proceedings of the IEEE, 98(8), 1467–1484.
Trimble Inc. (2012). Trimble 3D Warehouse. http://sketchup.google.com/3dwarehouse.
Vondrick, C., Khosla, A., & Malisiewicz, T., & Torralba, A. (2013). Inverting and visualizing features for object detection. MIT Technical Report.
Wang, H., Gould, S., & Koller, D. (2010). Discriminative learning with latent variables for cluttered indoor scene understanding. Proceedings of the European Conference on Computer Vision (ECCV).
Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba, A. (2010). Sun database: Large-scale scene recognition from abbey to zoo. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Yu, S., Zhang, H., & Malik, J. (2008). Inferring spatial layout from a single image via depth-ordered grouping. Proceedings of the IEEE Workshop on Perceptual Organization in Computer Vision (CVPR-POCV).
Zhao, Y., & Zhu, S.-C. (2013). Scene parsing by integrating function, geometry and appearance models. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).