Scale-invariant localization using quasi-semantic object landmarks
Tóm tắt
This work presents Object Landmarks, a new type of visual feature designed for visual localization over major changes in distance and scale. An Object Landmark consists of a bounding box
$${\mathbf {b}}$$
defining an object, a descriptor
$${\mathbf {q}}$$
of that object produced by a Convolutional Neural Network, and a set of classical point features within
$${\mathbf {b}}$$
. We evaluate Object Landmarks on visual odometry and place-recognition tasks, and compare them against several modern approaches. We find that Object Landmarks enable superior localization over major scale changes, reducing error by as much as 18% and increasing robustness to failure by as much as 80% versus the state-of-the-art. They allow localization under scale change factors up to 6, where state-of-the-art approaches break down at factors of 3 or more.
Tài liệu tham khảo
Bay, H., Ess, A., Tuytelaars, T., & Van Gool, L. (2008). Speeded-up robust features (surf). Computer Vision and Image Understanding, 110(3), 346–359. https://doi.org/10.1016/j.cviu.2007.09.014.
Bowman, S. L., Atanasov, N., Daniilidis, K., & Pappas, G. J. (2017). Probabilistic data association for semantic slam. In: 2017 IEEE international conference on robotics and automation (ICRA), pp. 1722–1729.
Brown, M., & Lowe, D. G. (2002). Invariant features from interest point groups. In: BMVC, Vol. 4.
Chen, Z., Liu, L., Sa, I., Ge, Z., & Chli, M. (2018). Learning context flexible attention model for long-term visual place recognition. IEEE Robotics and Automation Letters, 3(4), 4015–4022.
Cummins, M., & Newman, P. (2010) Invited Applications Paper FAB-MAP: Appearance-based place recognition and mapping using a learned visual vocabulary model. In: 27th International conference on machine learning (ICML2010).
Cummins, M., & Newman, P. (2011). Appearance-only slam at large scale with fab-map 20. The International Journal of Robotics Research, 30(9), 1100–1123. https://doi.org/10.1177/0278364910385483.
DeTone, D., Malisiewicz, T., & Rabinovich, A. (2018). Superpoint: Self-supervised interest point detection and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 224–236.
Dudek, G., & Jenkin, M. (2010). Computational principles of mobile robotics. Cambridge: Cambridge University Press.
Dudek, G., & Zhang, C. (1995). Pose estimation from image data without explicit object models. In: Research in computer and robot vision. World Scientific, pp. 19–35.
Dudek, G., & Zhang, C. (1996). Vision-based robot localization without explicit object models. In: IEEE International conference on robotics and automation, Vol. 1. IEEE, pp. 76–82.
Dusmanu, M., Rocco, I., Pajdla, T., Pollefeys, M., Sivic, J., Torii, A., & Sattler, T. (2019). D2-Net: A Trainable CNN for joint detection and description of local features. In: Proceedings of the 2019 IEEE/CVF conference on computer vision and pattern recognition
Engel, J., Schöps, T., & Cremers, D. (2014). Lsd-slam: Large-scale direct monocular slam. In: European conference on computer vision. Springer, pp. 834–849.
Faugeras, O., Luong, Q.-T., & Papadopoulo, T. (2001). The geometry of multiple images: The laws that govern the formation of multiple images of a scene and some of their applications. MIT press
Fox, D., Burgard, W., & Thrun, S. (1999). Markov localization for mobile robots in dynamic environments. Journal of Artificial Intelligence Research, 11, 391–427.
Galindo, C., Saffiotti, A., Coradeschi, S., Buschka, P., Fernandez-Madrigal, J. A., & Gonzalez, J. (2005). Multi-hierarchical semantic maps for mobile robotics. In: 2005 IEEE/RSJ international conference on intelligent robots and systems, pp. 2278–2283.
Garg, S., Suenderhauf, N., & Milford, M. (2018). Lost? appearance-invariant place recognition for opposite viewpoints using visual semantics. arXiv preprintarXiv:1804.05526.
Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? the kitti vision benchmark suite. In: Conference on computer vision and pattern recognition (CVPR)
Glover, A., Maddern, W., Warren, M., Reid, S., Milford, M., & Wyeth, G. (2011). Openfabmap: An open source toolbox for appearance-based loop closure detection. In: The international conference on robotics and automation. St Paul, Minnesota: IEEE.
Hartley, R. I. (1993). Cheirality invariants. In: Proceedings of DARPA image understanding workshop, pp. 745–753.
He, K., Zhang, X., Ren, S., & Sun, J. (2015) Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385
Holliday, A., & Dudek, G. (2018). Scale-robust localization using general object landmarks. In: 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp. 1688–1694
Holliday, A., & Dudek, G. (2020). Pre-trained cnns as visual feature extractors: A broad evaluation. In: 2020 17th conference on computer and robot vision (CRV). IEEE, pp. 78–84.
Huang, G., Liu, Z., Maaten, L. v. d., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp. 2261–2269.
Kaeli, J. W., Leonard, J. J., & Singh, H. (2014). Visual summaries for low-bandwidth semantic mapping with autonomous underwater vehicles. In: 2014 IEEE/OES Autonomous underwater vehicles (AUV), pp. 1–7.
Kendall, A., & Cipolla, R. (2016). Modelling uncertainty in deep learning for camera relocalization. In: Proceedings of the international conference on robotics and automation (ICRA)
Khaliq, A., Ehsan, S., Chen, Z., Milford, M., & McDonald-Maier, K. (2019). A holistic visual place recognition approach using lightweight cnns for significant viewpoint and appearance changes. IEEE Transactions on Robotics, pp. 1–9.
Klein, G., & Murray, D. (2007). Parallel tracking and mapping for small ar workspaces. In: Proceedings of the 2007 6th IEEE and ACM international symposium on mixed and augmented reality, ser. ISMAR ’07. Washington, DC: IEEE Computer Society, pp. 1–10. https://doi.org/10.1109/ISMAR.2007.4538852
Konolige, K. (1998). Small vision systems: Hardware and implementation. Robotics Research. Springer, pp. 203–212.
Kriegman, D. J., Triendl, E., & Binford, T. O. (1989). Stereo vision and navigation in buildings for mobile robots. IEEE Transactions on Robotics and Automation, 5(6), 792–803.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, & K. Q. Weinberger (Eds.), Advances in neural information processing systems (Vol. 25, pp. 1097–1105). Curran Associates Inc.
Leonard, J. J., & Durrant-Whyte, H. F. (1991). Mobile robot localization by tracking geometric beacons. IEEE Transactions on Robotics and Automation, 7(3), 376–382.
Li, J., Eustice, R. M., & Johnson-Roberson, M. (2015). High-level visual features for underwater place recognition.
Li, J., Meger, D., Dudek, G. (2019). Semantic mapping for view-invariant relocalization. In: Proceedings of the. (2019). IEEE international conference on robotics and automation (ICRA 19), Montreal, Canada
Lindeberg, T. (1994). Scale-space theory: a basic tool for analyzing structures at different scales. Journal of Applied Statistics, 21(1–2), 225–270. https://doi.org/10.1080/757582976.
Linegar, C., Churchill, W., Newman, P. (2016). Made to measure: Bespoke landmarks for 24-hour, all-weather localisation with a camera. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 787–794.
Lowe, D. G. (1999). Object recognition from local scale-invariant features. In: Proceedings of the international conference on computer vision-volume 2, ser. ICCV ’99. IEEE Computer Society, Washington, pp. 1150, http://dl.acm.org/citation.cfm?id=850924.851523
MacKenzie, P., & Dudek, G. (1994). Precise positioning using model-based maps. In: 1994 IEEE international conference on robotics and automation. IEEE, pp. 1615–1621.
Merrill, N., & Huang, G. (2018). Lightweight unsupervised deep loop closure. In: Proceedings of robotics: science and systems (RSS), Pittsburgh
Mirowski, P., Grimes, M. Koichi, Malinowski, M., Moritz Hermann, K., Anderson, K., Teplyashin, D., Simonyan, K., Kavukcuoglu, K. Zisserman, A., & Hadsell, R. (2018). Learning to navigate in cities without a map, 03.
Mur-Artal, R., Montiel, J. M. M., & Tardós, J. D. (2015). Orb-slam: A versatile and accurate monocular slam system. CoRR
Nistér, D. (2004). An efficient solution to the five-point relative pose problem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6), 756–777. https://doi.org/10.1109/TPAMI.2004.17.
Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3), 145–175. https://doi.org/10.1023/A:1011139631724.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., & Lerer, A. (2017). Automatic differentiation in PyTorch. In: NIPS Autodiff workshop
Pronobis, A., & Caputo, B. (2009). COLD: COsy localization database. The International Journal of Robotics Research (IJRR), 28(5):588–594 http://www.pronobis.pro/publications/pronobis2009ijrr
Rublee, E., Rabaud, V., Konolige, K., & Bradski, G. (2011). Orb: An efficient alternative to sift or surf. In: proceedings of the 2011 international conference on computer vision, ser. ICCV ’11. Washington, DC: IEEE Computer Society, pp. 2564–2571. https://doi.org/10.1109/ICCV.2011.6126544
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3), 211–252.
Salas-Moreno, R. F., Newcombe, R. A., Strasdat, H., Kelly, P. H., & Davison, A. J. (2013). Slam++: Simultaneous localisation and mapping at the level of objects. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1352–1359.
Simo-Serra, E., Trulls, E., Ferraz, L., Kokkinos, I., Fua, P., & Moreno-Noguer, F. (2015). Discriminative learning of deep convolutional feature point descriptors. In: 2015 IEEE international conference on computer vision (ICCV), pp. 118–126.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, arXiv:1409.1556
Spencer, C., & Darvizeh, Z. (1981). The case for developing a cognitive environmental psychology that does not underestimate the abilities of young children. Journal of Environmental Psychology, 1(1), 21–31.
Sünderhauf, N., Dayoub, F., Shirazi, S., Upcroft, B., & Milford, M. (2015). On the performance of convnet features for place recognition. CoRR, arXiv:1501.04158
Sünderhauf, N., Shirazi, S., Jacobson, A., Dayoub, F., Pepperell, E., Upcroft, B., & Milford, M. (2015). Place recognition with convnet landmarks: Viewpoint-robust, condition-robust, training-free. In: Proceedings of robotics: science and systems (RSS)
Uijlings, J. R. R., van de Sande, K. E. A., Gevers, T., & Smeulders, A. W. M. (2013) Selective search for object recognition. International Journal of Computer Vision, 104(2):154–171, https://ivi.fnwi.uva.nl/isis/publications/2013/UijlingsIJCV2013
Yi, K. M., Trulls, E., Lepetit, V., & Fua, P. (2016). Lift: Learned invariant feature transform. In: European conference on computer vision (ECCV), pp. 467–483.
Zitnick, L., & Dollar, P. (2014) Edge boxes: Locating object proposals from edges. In: ECCV: European conference on computer vision, https://www.microsoft.com/en-us/research/publication/edge-boxes-locating-object-proposals-from-edges/