Trans-SVNet: hybrid embedding aggregation Transformer for surgical workflow analysis

Springer Science and Business Media LLC - Tập 17 Số 12 - Trang 2193-2202 - 2022
Jin, Yueming1, Long, Yonghao2, Gao, Xiaojie2, Stoyanov, Danail1, Dou, Qi2,3, Heng, Pheng-Ann2,3
1Wellcome/EPSRC Centre for Interventional and Surgical Sciences (WEISS), Department of Computer Science, University College, London, UK
2Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, China
3Institute of Medical Intelligence and XR, The Chinese University of Hong Kong, Shatin, China

Tóm tắt

Real-time surgical workflow analysis has been a key component for computer-assisted intervention system to improve cognitive assistance. Most existing methods solely rely on conventional temporal models and encode features with a successive spatial–temporal arrangement. Supportive benefits of intermediate features are partially lost from both visual and temporal aspects. In this paper, we rethink feature encoding to attend and preserve the critical information for accurate workflow recognition and anticipation. We introduce Transformer in surgical workflow analysis, to reconsider complementary effects of spatial and temporal representations. We propose a hybrid embedding aggregation Transformer, named Trans-SVNet, to effectively interact with the designed spatial and temporal embeddings, by employing spatial embedding to query temporal embedding sequence. We jointly optimized by loss objectives from both analysis tasks to leverage their high correlation. We extensively evaluate our method on three large surgical video datasets. Our method consistently outperforms the state-of-the-arts across three datasets on workflow recognition task. Jointly learning with anticipation, recognition results can gain a large improvement. Our approach also shows its effectiveness on anticipation with promising performance achieved. Our model achieves a real-time inference speed of 0.0134 second per frame. Experimental results demonstrate the efficacy of our hybrid embeddings integration by rediscovering the crucial cues from complementary spatial–temporal embeddings. The better performance by multi-task learning indicates that anticipation task brings the additional knowledge to recognition task. Promising effectiveness and efficiency of our method also show its promising potential to be used in operating room.

Tài liệu tham khảo

Maier-Hein L, Vedula SS, Speidel S, Navab N, Kikinis R, Park A, Eisenmann M, Feussner H, Forestier G, Giannarou S (2017) Surgical data science for next-generation interventions. Nature Biomedical Engineering citation_journal_title=Minimally Invasive Therapy & Allied Technol; citation_title=Machine and deep learning for workflow recognition during surgery; citation_author=N Padoy; citation_volume=28; citation_issue=2; citation_publication_date=2019; citation_pages=82-90; citation_doi=10.1080/13645706.2019.1584116; citation_id=CR2 citation_journal_title=Med image anal; citation_title=Surgical data science-from concepts toward clinical translation; citation_author=L Maier-Hein, M Eisenmann, D Sarikaya, K März, T Collins, A Malpani, J Fallert, H Feussner, S Giannarou, P Mascagni; citation_volume=76; citation_publication_date=2022; citation_doi=10.1016/j.media.2021.102306; citation_id=CR3 Rivoir D, Bodenstedt S, Funke I, Bechtolsheim Fv, Distler M, Weitz J, Speidel S (2020) Rethinking anticipation tasks: Uncertainty-aware anticipation of sparse surgical instrument usage for context-aware assistance. In: MICCAI, pp 752–762. Springer Yuan K, Holden M, Gao S, Lee W-S (2021) Surgical workflow anticipation using instrument interaction. In: MICCAI, pp 615–625. Springer citation_journal_title=IJCARS; citation_title=Automatic phase prediction from low-level surgical activities; citation_author=G Forestier, L Riffaud, P Jannin; citation_volume=10; citation_issue=6; citation_publication_date=2015; citation_pages=833-841; citation_id=CR6 citation_journal_title=IEEE TMI; citation_title=Endonet: a deep architecture for recognition tasks on laparoscopic videos; citation_author=AP Twinanda, S Shehata, D Mutter, J Marescaux, M Mathelin, N Padoy; citation_volume=36; citation_issue=1; citation_publication_date=2017; citation_pages=86-97; citation_id=CR7 citation_journal_title=IJCARS; citation_title=Automatic knowledge-based recognition of low-level tasks in ophthalmological procedures; citation_author=F Lalys, D Bouget, L Riffaud, P Jannin; citation_volume=8; citation_issue=1; citation_publication_date=2013; citation_pages=39-49; citation_id=CR8 citation_journal_title=IEEE TMI; citation_title=SV-RCNet: workflow recognition from surgical videos using recurrent convolutional network; citation_author=Y Jin, Q Dou, H Chen, L Yu, J Qin, C-W Fu, P-A Heng; citation_volume=37; citation_issue=5; citation_publication_date=2018; citation_pages=1114-1126; citation_id=CR9 Yi F, Jiang T (2019) Hard frame detection and online mapping for surgical phase recognition. In: MICCAI citation_journal_title=IEEE TMI; citation_title=Rsdnet: Learning to predict remaining surgery duration from laparoscopic videos without manual annotations; citation_author=AP Twinanda, G Yengera, D Mutter, J Marescaux, N Padoy; citation_volume=38; citation_issue=4; citation_publication_date=2018; citation_pages=1069-1078; citation_id=CR11 Funke I, Bodenstedt S, Oehme F, von Bechtolsheim F, Weitz J, Speidel S (2019) Using 3D convolutional neural networks to learn spatiotemporal features for automatic surgical gesture recognition in video. In: MICCAI Lea C, Flynn MD, Vidal R, Reiter A, Hager GD (2017) Temporal convolutional networks for action segmentation and detection. In: CVPR, pp 156–165 Czempiel T, Paschali M, Keicher M, Simson W, Feussner H, Kim ST, Navab N (2020) Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: MICCAI Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in Neural Information Processing Systems, pp 5998–6008 Han K, Wang Y, Chen H, Chen X, Guo J, Liu Z, Tang Y, Xiao A (2020) A survey on visual transformer. arXiv preprint arXiv:2012.12556 Gabeur V, Sun C, Alahari K, Schmid C (2020) Multi-modal transformer for video retrieval. In: ECCV, pp 214–229. Springer Wang Y, Solomon JM (2019) Deep closest point: Learning representations for point cloud registration. In: CVPR, pp 3523–3532 citation_journal_title=Med Image Anal; citation_title=Multi-task recurrent convolutional network with correlation loss for surgical video analysis; citation_author=Y Jin, H Li, Q Dou, H Chen, J Qin, C-W Fu, P-A Heng; citation_volume=59; citation_publication_date=2020; citation_doi=10.1016/j.media.2019.101572; citation_id=CR19 citation_journal_title=IJCARS; citation_title=Sd-net: joint surgical gesture recognition and skill assessment; citation_author=J Zhang, Y Nie, Y Lyu, X Yang, J Chang, JJ Zhang; citation_volume=16; citation_issue=10; citation_publication_date=2021; citation_pages=1675-1682; citation_id=CR20 Franke S, Neumuth T (2015) Adaptive surgical process models for prediction of surgical work steps from surgical low-level activities. In: 6th Workshop on M2CAI at MICCAI Gao X, Jin Y, Long Y, Dou Q, Heng P-A (2021) Trans-svnet: accurate phase recognition from surgical videos via hybrid embedding aggregation transformer. In: MICCAI, pp 593–603. Springer He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, pp 770–778 Twinanda AP, Shehata S, Mutter D, Marescaux J, De Mathelin M, Padoy N MICCAI M2CAI Challenge. http://camma.u-strasbg.fr/m2cai2016/ citation_journal_title=Med image anal; citation_title=Cataracts: Challenge on automatic tool annotation for cataract surgery; citation_author=H Al Hajj, M Lamard, P-H Conze, S Roychowdhury, X Hu, G Maršalkaitė, O Zisimopoulos; citation_volume=52; citation_publication_date=2019; citation_pages=24-41; citation_doi=10.1016/j.media.2018.11.008; citation_id=CR25