Abstract
Purpose
Even though workflow analysis in the operating room has come a long way, current systems are still limited to research. In the quest for a robust, universal setup, hardly any attention has been given to the dimension of audio despite its numerous advantages, such as low costs, location, and sight independence, or little required processing power.
Methodology
We present an approach for audio-based event detection that solely relies on two microphones capturing the sound in the operating room. Therefore, a new data set was created with over 63 h of audio recorded and annotated at the University Hospital rechts der Isar. Sound files were labeled, preprocessed, augmented, and subsequently converted to log-mel-spectrograms that served as a visual input for an event classification using pretrained convolutional neural networks.
Results
Comparing multiple architectures, we were able to show that even lightweight models, such as MobileNet, can already provide promising results. Data augmentation additionally improved the classification of 11 defined classes, including inter alia different types of coagulation, operating table movements as well as an idle class. With the newly created audio data set, an overall accuracy of 90%, a precision of 91% and a F1-score of 91% were achieved, demonstrating the feasibility of an audio-based event recognition in the operating room.
Conclusion
With this first proof of concept, we demonstrated that audio events can serve as a meaningful source of information that goes beyond spoken language and can easily be integrated into future workflow recognition pipelines using computational inexpensive architectures.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
Workflow analysis in the operating room has been subject to research for many years [1,2,3]. Continuously collecting information about the ongoing surgery is essential to create a digital representation of the current procedural tasks, the surgical phase, and overall situation. Almost all kinds of sophisticated assistive technologies that are foreseen to actively participate in surgical procedures in the future rely on the creation of a digital representation of the intervention [1, 3]. If, for example, a cognitive OR should dim the lights on time when a laparoscopic phase starts, or an autonomous-acting robotic circulator should fetch sterile goods when additional material is needed, the reliable situation awareness of a digital system is an essential prerequisite.
In order to create a system of such kind, previously proposed concepts mainly relied on the combination of multiple information sources by integrating signals of various surgical devices for the recognition of patterns, e.g., the activation of energy devices or the movement of an operating table [4, 5]. However, a major challenge is that today’s OR hardware lacks standardized communication protocols and interfaces. Moreover, possible data acquisition mostly relies on actual physical inter-device connections, which is cumbersome in the OR environment, as devices are often shared and moved between several OR suites and additional cables have to be run on the floor. Hence, comprehensive data acquisition setups can mainly be found in research projects and are rather a mere juxtaposition of workarounds, than a holistic solution with access to all devices at the push of a button. Efforts are made by consortia like OR.net to harmonize interfaces of upcoming products, and the first demonstrators of connected operating rooms such as the IVAP2025 in Munich, Germany, or the SCOT in Tokyo, Japan, have been presented [4, 6, 7]. Unfortunately, those setups are individual implementations that go along with high original costs and will ultimately not be integrated into an extensive number of ORs in the foreseeable future. This, however, contrasts with the need for multicentric large-scale data acquisition, which is required for building robust and reliable cognitive OR systems.
Due to the advancements in machine learning techniques in recent years, focus in research on workflow analysis shifted heavily to the processing of intraoperative images as an essential component. Instrument detection within laparoscopic videos or a semantic understanding of the surgical scene from the outer perspective using RGB(-Depth) cameras strongly contributed to the progress of workflow analysis [8, 9]. The main drawbacks of these video-based approaches are the need for a high-quality image as well as the diverse look and privacy concerns. With only a fraction of today’s interventions being performed minimally invasively, the availability of intraabdominal video signals can also not be taken for granted [10]. On the contrary, outer cameras require at least a direct line of sight, which turns out to be challenging in real-life OR environments due to multiple equipment booms and moving staff obstructing the view [11,12,13]. Especially in larger scenes, smaller objects can be hard to detect, and the high dynamic lighting situation poses an additional difficulty.
While all concepts contributed substantially to this field, an optimal solution for a simple yet robust data acquisition setup has not been found so far. In the following, we therefore propose a new way of collecting intraoperative data using the previously neglected dimension of audio. This proof of concept thereby combines both aforementioned approaches by gathering information from a diversified set of medical devices—via their sounds—in conjunction with well-established deep-learning-based image classification techniques transferred to the domain of audio.
Methodology
As the literature has proven, gathering status information from a larger set of surgical devices can be sufficient for phase detection, even without the use of any video signals [4, 5]. Building upon this idea, but overcoming the cumbersome process of establishing an individual connection to each and every device, we base our setup merely on audio signals. During an intervention, the visual and haptic perception of surgeons is already utilized to capacity. That is why sound design of technical equipment is an important part of the development process of almost every device nowadays. Audio icons are created to provide characteristic feedback of current settings and deployment status for the OR staff. Capturing those signals via microphones equals the auditory perception of a surgeon and reduces a complex sensor setup to a low-cost, potentially even wireless, solution that is spatially flexible and does not require any line of sight. Apart from medical devices and their signature sound, other distinct environmental sounds, such as the opening of wrapped sterile instruments, the mechanical noise of table movements, or the opening and closing of doors, can be used as well to spot process-related events.
In the following, we demonstrate these possibilities in two steps. First, we describe the creation of a data set and the recording setup in the operating room. Second, we explain the computational processing of the data.
Data acquisition
Despite extensive research in the field of surgical data science, no publicly available, broad collection of surgical audio recordings exists. That is why, as an initial step, creating a high-quality audio data set was inevitable.
Recording took place in multiple operating rooms of the University hospital rechts der Isar, in Munich, Germany, during 23 surgeries. The interventions comprised laparoscopic hemicolectomies and sigmoid colectomies performed on a robotic system (Da Vinci ** the class phone, which was based to a large extent on augmented samples.
In the predictions, the network seems to slightly mix up the two coagulation classes with each other, as well as the table movements up and down. The audio comments of the DaVinci Robot that form a quite diverse group of samples within its own class get sometimes confused with the idle class that also contains spoken language of the medical staff.
Discussion
As a proof of concept, this initial proposal is meant to introduce the domain of audio as a valuable source of information for future workflow analysis systems in the OR. Even though the pipeline was kept quite general and simple on purpose, results are very promising.
Within the scope of this paper, OR conditions were taken as given for the recordings. Thus, there is a lot of background noise on the data set, ranging from air-conditioning to radios playing in the background. Here, digital filtering in the preprocessing step could further improve future results. In an advanced setup, the positioning of microphones can also be optimized, thereby physically reducing the dominance of individual audio sources that have been close to the current microphone’s position, such as the suction device on Fig. 1a. Hence, microphones mounted at the OR’s ceiling, even as an array for added spatial resolution, should be evaluated.
For this initial proof of concept, we limited the scope to 10 different events that can help to identify the timestamps currently manually documented by the circulator, such as a patient entering the OR or the start of a surgery. However, for an actual deduction of surgical phases, the data set has to be extended to a more comprehensive list of entities. In conjunction with a phase annotation, patterns in the temporal relation of individual audio events can then be analyzed and surgical phases identified via a second classification step.
Also, an optimal sample length as a tradeoff between real-time capability and precision is yet to be evaluated. Furthermore, the presented event classification only relies on classic CNNs in order to provide some baseline results and prove the feasibility of event recognition using inexpensive computational methods for a later implementation on edge devices. However, more advanced machine learning techniques, such as transformer networks, could further improve the results and include temporal information.
From a medical point of view, our setup has the advantage of not being limited to a defined set of medical devices equipped with sensors or dedicated software interfaces but being deployable in every possible OR configuration and device combination. Even though direct data from medical devices would be preferable in the long run, audio recognition can contribute to the fast scaling of workflow data libraries in the research community already today. It even goes beyond the devices and can recognize distinct environmental sounds like drop** instruments for example or the unwrap** of sterile dra**. This advantage is even amplified by the fact that a created data set is not limited to the training of one type of intervention only, such as widespread image-based datasets like Colec80 [23] for cholecystectomies. As the audio data itself is universal, it can be reused as training data in all kinds of interventions and even across various surgical disciplines. Furthermore, the sound signatures are completely patient-independent and do not relate to age, sex, or ethnicity. As all predictions can run offline on edge devices, due to the lightweight audio data, confidential information does not need to leave the OR and can be processed as a stream with only identified events being logged.
While the temporal resolution of phase detection, entirely based on audio events, has yet to be evaluated, we are convinced that the combination with other modalities and techniques such as key word recognition would elevate current workflow detection pipelines to the next level in terms of prediction accuracy.
Conclusion
With an intelligent OR in mind, we presented a new approach for the future development of workflow recognition systems by incorporating audio signals. Microphones as a low-cost sensor in conjunction with log-mel-spectrogram-based signal analysis using deep convolutional neuronal networks showed promising results, when applied on a unique data set of more than 20.000 individually gathered OR audio samples. Using the MobileNet network, we were able to achieve up to 90% accuracy for the recognition of 10 classes, including audio events such as directed table movements or instrumentation with specific energy devices.
Next, we plan to expand our data set, including the variety of detectable events, and connect single predictions with a temporal context model. Meanwhile, we encourage the community to build upon this new approach and consider the dimension of audio as a highly informative data source for future workflow recognition systems.
References
Maier-Hein L, Vedula SS, Speidel S, Navab N, Kikinis R, Park A, Eisenmann M, Feussner H, Forestier G, Giannarou S, Hashizume M, Katic D, Kenngott H, Kranzfelder M, Malpani A, März K, Neumuth T, Padoy N, Pugh C, Schoch N, Stoyanov D, Taylor R, Wagner M, Hager G, Jannin P (2017) Surgical data science for next-generation interventions. Nat Biomed Eng 1(9):691–696
Blum T, Padoy N, Feußner H, Navab N (2008) Workflow mining for visualization and analysis of surgeries. Int J Comput Assist Radiol Surg 3:379–386
Demir KC, Schieber H, WeiseRoth T, May M, Maier A, Yang SH (2023) Deep learning in surgical workflow analysis: a review of phase and step recognition. IEEE J Biomed Health Inf. https://doi.org/10.1109/JBHI.2023.3311628
Kranzfelder M, Schneider A, Fiolka A, Koller S, Reiser S, Vogel T, Wilhelm D, Feussner H (2014) Reliability of sensor-based real-time workflow recognition in laparoscopic cholecystectomy. Int J Comput Assist Radiol Surg 9:941–948
DiPietro R, Stauder R, Kayis E, Schneider A, Kranzfelder M, Feussner H, Hager GD, Navab N (2015) Automated surgical-phase recognition using rapidly-deployable sensors. In Proc MICCAI Workshop M2CAI
Kasparick M, Schmitz M, Andersen B, Rockstroh M, Franke S, Schlichting S, Golatowski F, Timmermann D (2018) OR.NET a service-oriented architecture for safe and dynamic medical device interoperability. Biomed Eng/Biomedizinische Technik 63(1):11–30
Muragaki Y, Okamoto J, Masamune K, Iseki H (2022) Smart Cyber operating theater (SCOT): strategy for future OR. In: Hashizume M (ed) Multidisciplinary computational anatomy. Springer, Singapore, pp 389–393
Anteby R, Horesh N, Soffer S, Zager Y, Barash Y, Amiel I, Rosin D, Gutman M, Klang E (2021) Deep learning visual analysis in laparoscopic surgery: a systematic review and diagnostic test accuracy meta-analysis. Surg Endosc 35:1521–1533
Özsoy E, Örnek EP, Eck U, Czempiel T, Tombari F, Navab N (2022) 4d-or: Semantic scene graphs for or domain modeling. In: Medical image computing and computer assisted intervention–miccai 2022: 25th international conference, Singapore, September 18–22, 2022, proceedings, part VII. Springer Nature Switzerland, Cham. pp 475–485
Mattingly AS, Chen MM, Divi V, Holsinger FC, Saraswathula A (2023) Minimally invasive surgery in the United States, 2022: understanding its value using new datasets. J Surg Res 281:33–36
Padoy N (2019) Machine and deep learning for workflow recognition during surgery. Minim Invasive Ther Allied Technol 28(2):82–90
Volkov M, Hashimoto DA, Rosman G, Meireles OR, Rus D (2017) Machine learning and coresets for automated real-time video segmentation of laparoscopic and robot-assisted surgery. In 2017 IEEE international conference on robotics and automation (ICRA) pp. 754–759
Blum T, Feußne, H, Navab N (2010) Modeling and segmentation of surgical workflow from laparoscopic video. In: Medical image computing and computer-assisted intervention–MICCAI 2010: 13th international conference, Bei**g, China, September 20–24, 2010, proceedings, part III 13. Springer Berlin Heidelberg. pp. 400–407
Purwins H, Li B, Virtanen T, Schlüter J, Chang SY, Sainath T (2019) Deep learning for audio signal processing. IEEE J Sel Top Signal Process 13(2):206–219
Stevens SS, Volkmann J (1940) The relation of pitch to frequency: a revised scale. Am J Psychol 53(3):329–353
Lewicki MS (2002) Efficient coding of natural sounds. Nat Neurosci 5(4):356–363
Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Mane D, Monga R, Moore S, Murray, Chris Olah D, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viegas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X (2016) Tensorflow: large-scale machine learning on heterogeneous distributed systems. In: Proceedings of the 12th USENIX symposium on operating systems design and implementation, OSDI 16. pp 265–283
Chollet F (2015) Keras. [online] https://github.com/fchollet/keras. Accessed 20 Apr 2024
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009).Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. IEEE. pp 248–255
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition. pp 2818–2826
Majumdar S (2017) MobileNet v1 models for Keras [online] https://github.com/fchollet/deep-learning-models/blob/master/mobilenet.py. Accessed 20 Apr 2024
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4700–4708
Twinanda AP, Shehata S, Mutter D, Marescaux J, De Mathelin M, Padoy N (2016) Endonet: a deep architecture for recognition tasks on laparoscopic videos. IEEE Trans Med Imaging 36(1):86–97
Tan M, Le Q (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In: International conference on machine learning. PMLR. pp 6105–6114
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 770–778)
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. In: Proceedings of the 3rd international conference on learning representations (ICLR 2015). Computational and Biological Learning Society 2015
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
Jonas Fuchtmann, Thomas Riedel, Daniel Ostler, Maximilian Berlet, Alissa Jell, Luca Wegener, Lars Wagner, Simone Graf, Dirk Wilhelm declares no conflict of interest.
Ethical approval
The study including the human participants was approved by the ethical board of Klinikum rechts der Isar, of the Technical University Munich (application number 337/21 S-EB).
Informed consent
This article does not contain patient data.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Fuchtmann, J., Riedel, T., Berlet, M. et al. Audio-based event detection in the operating room. Int J CARS (2024). https://doi.org/10.1007/s11548-024-03211-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11548-024-03211-1