Surgical workflow recognition with temporal convolution and transformer for action segmentation
Tóm tắt
Automatic surgical workflow recognition enabled by computer vision algorithms plays a key role in enhancing the learning experience of surgeons. It also supports building context-aware systems that allow better surgical planning and decision making which may in turn improve outcomes. Utilizing temporal information is crucial for recognizing context; hence, various recent approaches use recurrent neural networks or transformers to recognize actions. We design and implement a two-stage method for surgical workflow recognition. We utilize R(2+1)D for video clip modeling in the first stage. We propose Action Segmentation Temporal Convolutional Transformer (ASTCFormer) network for full video modeling in the second stage. ASTCFormer utilizes action segmentation transformers (ASFormers) and temporal convolutional networks (TCNs) to build a temporally aware surgical workflow recognition system. We compare the proposed ASTCFormer with recurrent neural networks, multi-stage TCN, and ASFormer approaches. The comparison is done on a dataset comprised of 207 robotic and laparoscopic cholecystectomy surgical videos annotated for 7 surgical phases. The proposed method outperforms the compared methods achieving a
$$2.7\%$$
relative improvement in the average segmental F1-score over the state-of-the-art ASFormer method. Moreover, our proposed method achieves state-of-the-art results on the publicly available Cholec80 dataset. The improvement in the results when using the proposed method suggests that temporal context could be better captured when adding information from TCN to the ASFormer paradigm. This addition leads to better surgical workflow recognition.
Tài liệu tham khảo
Feldman LS, Pryor AD, Gardner AK, Dunkin BJ, Schultz L, Awad MM, Ritter EM (2020) Sages video-based assessment (VBA) program: a vision for life-long learning for surgeons. Surg Endosc 34(8):3285–3288
Twinanda AP, Shehata S, Mutter D, Marescaux J, De Mathelin M, Padoy N (2016) Endonet: a deep architecture for recognition tasks on laparoscopic videos. IEEE Trans Med Imaging 36(1):86–97
Zia A, Hung A, Essa I, Jarc A (2018) Surgical activity recognition in robot-assisted radical prostatectomy using deep learning. In: International conference on medical image computing and computer-assisted intervention, pp. 273–280 . Springer
Jin Y, Dou Q, Chen H, Yu L, Qin J, Fu C-W, Heng P-A (2017) SV-RCnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE Trans Med Imaging 37(5):1114–1126
Zisimopoulos O, Flouty E, Luengo I, Giataganas P, Nehme J, Chow A, Stoyanov D (2018) Deepphase: surgical phase recognition in cataracts videos. In: International conference on medical image computing and computer-assisted intervention, pp. 265–272 . Springer
Hashimoto DA, Rosman G, Witkowski ER, Stafford C, Navarrete-Welton AJ, Rattner DW, Lillemoe KD, Rus DL, Meireles OR (2019) Computer vision analysis of intraoperative video: automated recognition of operative steps in laparoscopic sleeve gastrectomy. Ann Surg 270(3):414
Nakawala H, Bianchi R, Pescatori LE, De Cobelli O, Ferrigno G, De Momi E (2019) “Deep-Onto’’ network for surgical workflow and context recognition. Int J Comput Assist Radiol Surg 14(4):685–696
Jin Y, Li H, Dou Q, Chen H, Qin J, Fu C-W, Heng P-A (2020) Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Med Image Anal 59:101572
Farha YA, Gall J (2019) Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3575–3584
Ramesh S, Dall’Alba D, Gonzalez C, Yu T, Mascagni P, Mutter D, Marescaux J, Fiorini P, Padoy N (2021) Multi-task temporal convolutional networks for joint recognition of surgical phases and steps in gastric bypass procedures. Int J Comput Assist Radiol Surg 16(7):1111–1119
Zhang B, Ghanem A, Simes A, Choi H, Yoo A, Min A (2021) Swnet: surgical workflow recognition with deep convolutional network. In: Medical imaging with deep learning, pp. 855–869. PMLR
Sanchez-Matilla R, Robu M, Grammatikopoulou M, Luengo I, Stoyanov D (2022) Data-centric multi-task surgical phase estimation with sparse scene segmentation. Int J Comput Assist Radiol Surg 17(5):953–960
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Guyon I, Von Luxburg U, Bengio S, Wallach HM, Fergus R, Vishwanathan SVN, Garnett R (eds) Advances in neural information processing systems 30: annual conference on neural information processing systems 2017, December 4-9, 2017. Long Beach, CA, USA, pp 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International conference on learning representations
Arnab A, Dehghani M, Heigold G, Sun C, Lučić M, Schmid C (2021) Vivit: A video vision transformer. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6836–6846
Valderrama N, Ruiz Puentes P, Hernández I, Ayobi N, Verlyck M, Santander J, Caicedo J, Fernández N, Arbeláez P (2022) Towards holistic surgical scene understanding. In: International conference on medical image computing and computer-assisted intervention, pp. 442–452. Springer
Czempiel T, Paschali M, Keicher M, Simson W, Feussner H, Kim ST, Navab N (2020) Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International conference on medical image computing and computer-assisted intervention, pp. 343–352 . Springer
Jin Y, Long Y, Chen C, Zhao Z, Dou Q, Heng P-A (2021) Temporal memory relation network for workflow recognition from surgical video. IEEE Trans Med Imaging 40(7):1911–1923
Zhang B, Ghanem A, Simes A, Choi H, Yoo A (2021) Surgical workflow recognition with 3dcnn for sleeve gastrectomy. Int J Comput Assist Radiol Surg 16(11):2029–2036
Neimark D, Bar O, Zohar M, Hager GD, Asselmann D (2021) “Train one, classify one, teach one”-cross-surgery transfer learning for surgical step recognition. In: Medical imaging with deep learning, pp. 532–544. PMLR
Wang Z, Ding X, Zhao W, Li X (2022) Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199
Schmidt A, Sharghi A, Haugerud H, Oh D, Mohareri O (2021) Multi-view surgical video action detection via mixed global view attention. In: International conference on medical image computing and computer-assisted intervention, pp. 626–635. Springer
Kadkhodamohammadi A, Luengo I, Stoyanov D (2022) PATG: position-aware temporal graph networks for surgical phase recognition on laparoscopic videos. Int J Comput Assist Radiol Surg 17(5):849–856
Czempiel T, Paschali M, Ostler D, Kim ST, Busam B, Navab N (2021) Opera: Attention-regularized transformers for surgical phase recognition. In: International conference on medical image computing and computer-assisted intervention, pp. 604–614. Springer
Zhang B, Abbing J, Ghanem A, Fer D, Barker J, Abukhalil R, Goel VK, Milletarì F (2022) Towards accurate surgical workflow recognition with convolutional networks and transformers. Comput Methods Biomech Biomed Eng Imaging Vis 10(4):349–356. https://doi.org/10.1080/21681163.2021.2002191
Gao X, Jin Y, Long Y, Dou Q, Heng P-A (2021) Trans-svnet: accurate phase recognition from surgical videos via hybrid embedding aggregation transformer. In: international conference on medical image computing and computer-assisted intervention, pp. 593–603. Springer
Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6450–6459
Ghadiyaram D, Tran D, Mahajan D (2019) Large-scale weakly-supervised pre-training for video action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12046–12055
Yi F, Wen H, Jiang T (2021) Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568
Bar O, Neimark D, Zohar M, Hager GD, Girshick R, Fried GM, Wolf T, Asselmann D (2020) Impact of data on generalization of AI for surgical intelligence applications. Sci Rep 10(1):1–12
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6299–6308
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778
Chuang S-H (2013) From multi-incision to single-incision laparoscopic cholecystectomy step-by-step: one surgeon’s self-taught experience and retrospective analysis. Asian J Surg 36(1):1–6
Bethlehem MS, Kramp KH, van Det MJ, Henk O, Veeger NJ, Pierie JPE (2014) Development of a standardized training course for laparoscopic procedures using Delphi methodology. J Surg Educ 71(6):810–816
Dissanaike S (2016) A step-by-step guide to laparoscopic subtotal fenestrating cholecystectomy: a damage control approach to the difficult gallbladder. J Am Coll Surg 223(2):15–18
Hashimoto DA, Axelsson CG, Jones CB, Phitayakorn R, Petrusa E, McKinley SK, Gee D, Pugh C (2019) Surgical procedural map scoring for decision-making in laparoscopic cholecystectomy. Am J Surg 217(2):356–361
Zhang Y, Bano S, Page A-S, Deprest J, Stoyanov D, Vasconcelos F (2022) Retrieval of surgical phase transitions using reinforcement learning. In: International conference on medical image computing and computer-assisted intervention, pp. 497–506. Springer
Li S-J, AbuFarha Y, Liu Y, Cheng M-M, Gall J (2020) Ms-tcn++: multi-stage temporal convolutional network for action segmentation. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2020.3021756
Zhang W, Yang G, Huang H, Yang W, Xu X, Liu Y, Lai X (2021) Me-net: multi-encoder net framework for brain tumor segmentation. Int J Imaging Syst Technol 31(4):1834–1848
Shi D, Liu R, Tao L, He Z, Huo L (2021) Multi-encoder parse-decoder network for sequential medical image segmentation. In: 2021 IEEE international conference on image processing (ICIP), pp. 31–35 . IEEE
Rahman A, Tasnim S (2014) Ensemble classifiers and their applications: a review. arXiv preprint arXiv:1404.4088
Yang P, Hwa Yang Y, B Zhou B, Y Zomaya A (2010) A review of ensemble methods in bioinformatics. Curr Bioinform 5(4):296–308
Stahlschmidt SR, Ulfenborg B, Synnergren J (2022) Multimodal deep learning for biomedical data fusion: a review. Brief Bioinform 23(2):569
Lea C, Vidal R, Hager GD (2016) Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE international conference on robotics and automation (ICRA), pp. 1642–1649. IEEE
Lea C, Flynn MD, Vidal R, Reiter A, Hager GD (2017) Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 156–165