The Costs and Benefits of Goal-Directed Attention in Deep Convolutional Neural Networks

Xiaoliang Luo1, Brett D. Roads1, Bradley C. Love1
1Department of Experimental Psychology, University College London, 26 Bedford Way, London WC1H 0AP, UK

Tóm tắt

AbstractPeople deploy top-down, goal-directed attention to accomplish tasks, such as finding lost keys. By tuning the visual system to relevant information sources, object recognition can become more efficient (a benefit) and more biased toward the target (a potential cost). Motivated by selective attention in categorisation models, we developed a goal-directed attention mechanism that can process naturalistic (photographic) stimuli. Our attention mechanism can be incorporated into any existing deep convolutional neural networks (DCNNs). The processing stages in DCNNs have been related to ventral visual stream. In that light, our attentional mechanism incorporates top-down influences from prefrontal cortex (PFC) to support goal-directed behaviour. Akin to how attention weights in categorisation models warp representational spaces, we introduce a layer of attention weights to the mid-level of a DCNN that amplify or attenuate activity to further a goal. We evaluated the attentional mechanism using photographic stimuli, varying the attentional target. We found that increasing goal-directed attention has benefits (increasing hit rates) and costs (increasing false alarm rates). At a moderate level, attention improves sensitivity (i.e. increases $d^{\prime }$ d ) at only a moderate increase in bias for tasks involving standard images, blended images and natural adversarial images chosen to fool DCNNs. These results suggest that goal-directed attention can reconfigure general-purpose DCNNs to better suit the current task goal, much like PFC modulates activity along the ventral stream. In addition to being more parsimonious and brain consistent, the mid-level attention approach performed better than a standard machine learning approach for transfer learning, namely retraining the final network layer to accommodate the new task.

Từ khóa


Tài liệu tham khảo

Ahlheim, C., & Love, B.C. (2018). Estimating the functional dimensionality of neural representations. NeuroImage, 179, 51–62. https://doi.org/10.1016/j.neuroimage.2018.06.015.

Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In ICLR. arXiv:1409.0473.

Bar, M. (2006). A cortical mechanism for triggering top-down facilitation in visual object recognition. Journal of Cognitive Neuroscience, 15(4), 600–609.

Braunlich, K., & Love, B.C. (2019). Occipitotemporal representations reflect individual differences in conceptual knowledge. Journal of Experimental Psychology: General, 148(7), 1192–1203. https://doi.org/10.1037/xge0000501.

Cao, C., Liu, X., Yang, Y., Yu, Y., Wang, J., Wang, Z., Huang, Y., Wang, L., Huang, C., Xu, W., Ramanan, D., & Huang, T.S. (2015). Look and think twice: capturing top-down visual attention with feedback convolutional neural networks. In 2015 IEEE International conference on computer vision (ICCV). https://doi.org/10.1109/ICCV.2015.338. http://ieeexplore.ieee.org/document/7410695/ (pp. 2956–2964): IEEE.

Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Liu, W., & Chua, T.S. (2017). SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings - 30th IEEE Conference on computer vision and pattern recognition, CVPR 2017, Institute of Electrical and Electronics Engineers Inc. https://doi.org/10.1109/CVPR.2017.667, (Vol. 2017-January pp. 6298–6306).

Chikkerur, S., Serre, T., Tan, C., & Poggio, T. (2010). What and where : a Bayesian inference theory of attention. Vision Research, 50(22), 2233–2247. https://doi.org/10.1016/j.visres.2010.05.013.

Connor, C.E., Egeth, H.E., & Yantis, S. (2004). Visual attention: bottom-up versus top-down. https://doi.org/10.1016/j.cub.2004.09.041.

Deng, J., Dong, W., Socher, R., Li, L.J., & Li, K. (2009). ImageNet: a large-scale hierarchical image database. In CVPR. IEEE.

Folstein, J.R., Palmeri, T.J., & Gauthier, I. (2013). Category learning increases discriminability of relevant object dimensions in visual cortex. Cerebral Cortex, 23, 814–823. https://doi.org/10.1093/cercor/bhs067. www.doschdesign.com/products/3d/Lo-PolyCarsV1-2.html.

Goodfellow, I.J., Shlens, J., & Szegedy, C. (2015). Explaining and harnessing adversarial examples. In International conference on learning representations. arXiv:1412.6572.

Guest, O., & Love, B.C. (2019). Levels of representation in a deep learning model of categorization. p. 626374, https://doi.org/10.1101/626374.

Hebart, M.N., Dickter, A.H., Kidder, A., Kwok, W.Y., Corriveau, A., Van Wicklin, C., & Baker, C.I. (2019). THINGS: a database of 1,854 object concepts and more than 26,000 naturalistic object images. PLOS ONE, 14(10), e0223792. https://doi.org/10.1371/journal.pone.0223792.

Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., & Song, D. (2019). Natural adversarial examples. In ICML. arXiv:1907.07174.

Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE Computer society conference on computer vision and pattern recognition. https://doi.org/10.1109/CVPR.2018.00745(pp. 7132–7141): IEEE Computer Society.

Itti, L., & Koch, C. (2001). Computational modelling of visual attention. Nature Reviews Neuroscience, 2(3), 194–203. https://doi.org/10.1038/35058500.

Itti, L., Koch, C., & Niebur, E. (1998). Short papers a model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20 (11).

James, R. (1965). Sight for sharp eyes. Life Magazine p 120. https://books.google.fr/books?id=KUEEAAAAMBAJ&lpg=PP1&hl=fr&pg=PA120#v=onepage&q&f=false.

Jetley, S., Lord, N.A., Lee, N., & Torr, P.H.S. (2018). Learn to pay attention. In ICLR. arXiv:1804.02391.

Katsuki, F. (2014). Constantinidis C. Different processes and overlapping neural systems: Bottom-up and top-down attention. https://doi.org/10.1177/1073858413514136.

Kingma, D.P., & Ba, J.L. (2015). Adam: a method for stochastic optimization. In 3rd international conference on learning representations, ICLR 2015 - Conference track proceedings international conference on learning representations: ICLR.

Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, (Vol. 2 pp. 1097–1105).

Kruschke, J.K. (1992). ALCOVE: A connectionist model of human category learning. Psychological Review, 99(1), 22–44.

Lindsay, G.W., & Miller, K.D. (2018). How biological attention mechanisms improve task performance in a large-scale visual system model. eLife 7. https://doi.org/10.7554/eLife.38105.

Love, B.C., Medin, D.L., & Gureckis, T.M. (2004). SUSTAIN: a network model of category learning. Psychological Review, 111(2), 309–332. https://doi.org/10.1037/0033-295X.111.2.309.

Mack, M.L., Love, B.C., & Preston, A.R. (2016). Dynamic updating of hippocampal object representations reflects new conceptual knowledge. Proceedings of the National Academy of Sciences of the United States of America, 113(46), 13203–13208. https://doi.org/10.1073/pnas.1614048113.

Mack, M.L., Preston, A.R., & Love, B.C. (2020). Ventromedial prefrontal cortex compression during concept learning. Nature Communications, 11(1). https://doi.org/10.1038/s41467-019-13930-8.

Macmillan, N.A., & Douglas, C. (2005). Detection theory: a user’s guide detection theory: a user’s guide, 2nd edn. Mahwah: Lawrence Erlbaum Associates, Inc., Publishers. http://digitus.itk.ppke.hu/~banko/VisionGroup/SignalDetectionTheory.pdf.

Miconi, T., Groomes, L., & Kreiman, G. (2016). There’s Waldo! a normalization model of visual search predicts single-trial human fixations in an object search task. Cerebral Cortex, 26(7), 3064–3082. https://doi.org/10.1093/cercor/bhv129. https://academic.oup.com/cercor/article-abstract/26/7/3064/1745172.

Miller, E.K., & Cohen, J.D. (2001). An integrative theory of prefrontal cortex function. Annual Review of Neuroscience, 24(1), 167–202. https://doi.org/10.1146/annurev.neuro.24.1.167.

Miller, G.A. (1995). WordNet: a lexical database for english. Communications of the ACM, 38 (11), 39–41. https://doi.org/10.1145/219717.219748.

Nguyen, A., Yosinski, J., & Clune, J. (2015). Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition. https://doi.org/10.1109/CVPR.2015.7298640, (Vol. 07-12-June-2015 pp. 427–436): IEEE Computer Society.

Nosofsky, R.M. (1986). Attention, similarity, and the identification-categorization relationship. Journal of Experimental Psychology: General, 115(1), 39–57. https://doi.org/10.1037//0096-3445.115.1.39.

Nosofsky, R.M., Sanders, C.A., & McDaniel, M.A. (2018). A formal psychological model of classification applied to natural-science category learning. Current Directions in Psychological Science, 27(2), 129–135. https://doi.org/10.1177/0963721417740954.

Perez, E., Strub, F., De Vries, H., Dumoulin, V., & Courville, A. (2018). FiLM: visual reasoning with a general conditioning layer. In 32nd AAAI Conference on artificial intelligence, AAAI 2018 (pp. 3942–3951): AAAI Press.

Peterson, J.C., Abbott, J.T., & Griffiths, T.L. (2018). Evaluating (and improving) the correspondence between deep neural networks and human representations. Cognitive Science, 42(8), 2648–2669. https://doi.org/10.1111/cogs.12670.

Plebanek, D.J., & Sloutsky, V.M. (2017). Costs of selective attention: when children notice what adults miss. Psychological Science, 28(6), 723–732. https://doi.org/10.1177/0956797617693005.

Schrimpf, M., Kubilius, J., Hong, H., Majaj, N.J., Rajalingham, R., Issa, E.B., Kar, K., Bashivan, P., Prescott-Roy, J., Schmidt, K., Yamins, D.L.K., & Dicarlo, J.J. (2018). Brain-score: which artificial neural network for object recognition is most brain-like?. https://doi.org/10.1101/407007.

Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In Bengio, Y., & LeCun, Y. (Eds.) International conference on learning representations. arXiv:1409.1556.

Song, Y., Kushman, N., Shu, R., & Ermon, S. (2018). Constructing unrestricted adversarial examples with generative models. In Advances in neural information processing systems, neural information processing systems foundation, (Vol. 2018-December pp. 8312–8323).

Stollenga, M.F., Masci, J., Gomez, F., & Schmidhuber, J. (2014). Deep networks with internal selective attention through feedback connections. arXiv:1407.3068.

Thorat, S., van Gerven, M., & Peelen, M. (2018). The functional role of cue-driven feature-based feedback in object recognition. In Conference on cognitive computational neurosciene, cognitive computational neuroscience. https://doi.org/10.32470/CCN.2018.1044-0. arXiv:1903.10446.

Treue, S., & Trujillo, J. (1999). Feature-based attention influences motion processing gain in macaque visual cortex. Nature, 399(6736), 575–579. https://doi.org/10.1038/21176.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems, neural information processing systems foundation, (Vol. 2017-December pp. 5999–6009).

Wang, Q., Zhang, J., Song, S., & Zhang, Z. (2014). Attentional neural network: feature selection using cognitive feedback. In Advances in neural information processing systems. https://github.com/qianwangthu/feedback-nips2014-wq.git (pp. 2033–2041).

Wolfe, J.M. (1994). Guided search 2.0 a revised model of visual search. Psychonomic Bulletin & Review, 1(2), 202–238. https://doi.org/10.3758/BF03200774.

Xu, K., Ba, J.L., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R.S., & Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In 32nd International conference on machine learning, ICML 2015, International Machine Learning Society (IMLS), (Vol. 3 pp. 2048–2057).

Yi, D.J., Woodman, G.F., Widders, D., Marois, P., & Chun, M.M. (2004). Neural fate of ignored stimuli: dissociable effects of perceptual and working memory load. Nature Neuroscience, 7(9), 992–996. https://doi.org/10.1038/nn1294.

Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). How transferable are features in deep neural networks?. In Advances in neural information processing systems, Vol. 27: NIPS.

Zhang, J., Bargal, S.A., Lin, Z., Brandt, J., Shen, X., & Sclaroff, S. (2018). Top-down neural attention by excitation backprop. International Journal of Computer Vision, 126(10), 1084–1102. https://doi.org/10.1007/s11263-017-1059-x.