Introduction

Workflow analysis in the operating room has been subject to research for many years [1,2,3]. Continuously collecting information about the ongoing surgery is essential to create a digital representation of the current procedural tasks, the surgical phase, and overall situation. Almost all kinds of sophisticated assistive technologies that are foreseen to actively participate in surgical procedures in the future rely on the creation of a digital representation of the intervention [1, 3]. If, for example, a cognitive OR should dim the lights on time when a laparoscopic phase starts, or an autonomous-acting robotic circulator should fetch sterile goods when additional material is needed, the reliable situation awareness of a digital system is an essential prerequisite.

In order to create a system of such kind, previously proposed concepts mainly relied on the combination of multiple information sources by integrating signals of various surgical devices for the recognition of patterns, e.g., the activation of energy devices or the movement of an operating table [4, 5]. However, a major challenge is that today’s OR hardware lacks standardized communication protocols and interfaces. Moreover, possible data acquisition mostly relies on actual physical inter-device connections, which is cumbersome in the OR environment, as devices are often shared and moved between several OR suites and additional cables have to be run on the floor. Hence, comprehensive data acquisition setups can mainly be found in research projects and are rather a mere juxtaposition of workarounds, than a holistic solution with access to all devices at the push of a button. Efforts are made by consortia like OR.net to harmonize interfaces of upcoming products, and the first demonstrators of connected operating rooms such as the IVAP2025 in Munich, Germany, or the SCOT in Tokyo, Japan, have been presented [4, 6, 7]. Unfortunately, those setups are individual implementations that go along with high original costs and will ultimately not be integrated into an extensive number of ORs in the foreseeable future. This, however, contrasts with the need for multicentric large-scale data acquisition, which is required for building robust and reliable cognitive OR systems.

Due to the advancements in machine learning techniques in recent years, focus in research on workflow analysis shifted heavily to the processing of intraoperative images as an essential component. Instrument detection within laparoscopic videos or a semantic understanding of the surgical scene from the outer perspective using RGB(-Depth) cameras strongly contributed to the progress of workflow analysis [8, 9]. The main drawbacks of these video-based approaches are the need for a high-quality image as well as the diverse look and privacy concerns. With only a fraction of today’s interventions being performed minimally invasively, the availability of intraabdominal video signals can also not be taken for granted [10]. On the contrary, outer cameras require at least a direct line of sight, which turns out to be challenging in real-life OR environments due to multiple equipment booms and moving staff obstructing the view [11,12,13]. Especially in larger scenes, smaller objects can be hard to detect, and the high dynamic lighting situation poses an additional difficulty.

While all concepts contributed substantially to this field, an optimal solution for a simple yet robust data acquisition setup has not been found so far. In the following, we therefore propose a new way of collecting intraoperative data using the previously neglected dimension of audio. This proof of concept thereby combines both aforementioned approaches by gathering information from a diversified set of medical devices—via their sounds—in conjunction with well-established deep-learning-based image classification techniques transferred to the domain of audio.

Methodology

As the literature has proven, gathering status information from a larger set of surgical devices can be sufficient for phase detection, even without the use of any video signals [4, 5]. Building upon this idea, but overcoming the cumbersome process of establishing an individual connection to each and every device, we base our setup merely on audio signals. During an intervention, the visual and haptic perception of surgeons is already utilized to capacity. That is why sound design of technical equipment is an important part of the development process of almost every device nowadays. Audio icons are created to provide characteristic feedback of current settings and deployment status for the OR staff. Capturing those signals via microphones equals the auditory perception of a surgeon and reduces a complex sensor setup to a low-cost, potentially even wireless, solution that is spatially flexible and does not require any line of sight. Apart from medical devices and their signature sound, other distinct environmental sounds, such as the opening of wrapped sterile instruments, the mechanical noise of table movements, or the opening and closing of doors, can be used as well to spot process-related events.

In the following, we demonstrate these possibilities in two steps. First, we describe the creation of a data set and the recording setup in the operating room. Second, we explain the computational processing of the data.

Data acquisition

Despite extensive research in the field of surgical data science, no publicly available, broad collection of surgical audio recordings exists. That is why, as an initial step, creating a high-quality audio data set was inevitable.

Recording took place in multiple operating rooms of the University hospital rechts der Isar, in Munich, Germany, during 23 surgeries. The interventions comprised laparoscopic hemicolectomies and sigmoid colectomies performed on a robotic system (Da Vinci ** the class phone, which was based to a large extent on augmented samples.

Fig. 2
figure 2

Balanced and normalized confusion matrix for the MobileNet architecture, trained with 700 samples for each of the 10 classes. Due to rounding errors, the sum of certain rows exceeds 1.0

In the predictions, the network seems to slightly mix up the two coagulation classes with each other, as well as the table movements up and down. The audio comments of the DaVinci Robot that form a quite diverse group of samples within its own class get sometimes confused with the idle class that also contains spoken language of the medical staff.

Discussion

As a proof of concept, this initial proposal is meant to introduce the domain of audio as a valuable source of information for future workflow analysis systems in the OR. Even though the pipeline was kept quite general and simple on purpose, results are very promising.

Within the scope of this paper, OR conditions were taken as given for the recordings. Thus, there is a lot of background noise on the data set, ranging from air-conditioning to radios playing in the background. Here, digital filtering in the preprocessing step could further improve future results. In an advanced setup, the positioning of microphones can also be optimized, thereby physically reducing the dominance of individual audio sources that have been close to the current microphone’s position, such as the suction device on Fig. 1a. Hence, microphones mounted at the OR’s ceiling, even as an array for added spatial resolution, should be evaluated.

For this initial proof of concept, we limited the scope to 10 different events that can help to identify the timestamps currently manually documented by the circulator, such as a patient entering the OR or the start of a surgery. However, for an actual deduction of surgical phases, the data set has to be extended to a more comprehensive list of entities. In conjunction with a phase annotation, patterns in the temporal relation of individual audio events can then be analyzed and surgical phases identified via a second classification step.

Also, an optimal sample length as a tradeoff between real-time capability and precision is yet to be evaluated. Furthermore, the presented event classification only relies on classic CNNs in order to provide some baseline results and prove the feasibility of event recognition using inexpensive computational methods for a later implementation on edge devices. However, more advanced machine learning techniques, such as transformer networks, could further improve the results and include temporal information.

From a medical point of view, our setup has the advantage of not being limited to a defined set of medical devices equipped with sensors or dedicated software interfaces but being deployable in every possible OR configuration and device combination. Even though direct data from medical devices would be preferable in the long run, audio recognition can contribute to the fast scaling of workflow data libraries in the research community already today. It even goes beyond the devices and can recognize distinct environmental sounds like drop** instruments for example or the unwrap** of sterile dra**. This advantage is even amplified by the fact that a created data set is not limited to the training of one type of intervention only, such as widespread image-based datasets like Colec80 [23] for cholecystectomies. As the audio data itself is universal, it can be reused as training data in all kinds of interventions and even across various surgical disciplines. Furthermore, the sound signatures are completely patient-independent and do not relate to age, sex, or ethnicity. As all predictions can run offline on edge devices, due to the lightweight audio data, confidential information does not need to leave the OR and can be processed as a stream with only identified events being logged.

While the temporal resolution of phase detection, entirely based on audio events, has yet to be evaluated, we are convinced that the combination with other modalities and techniques such as key word recognition would elevate current workflow detection pipelines to the next level in terms of prediction accuracy.

Conclusion

With an intelligent OR in mind, we presented a new approach for the future development of workflow recognition systems by incorporating audio signals. Microphones as a low-cost sensor in conjunction with log-mel-spectrogram-based signal analysis using deep convolutional neuronal networks showed promising results, when applied on a unique data set of more than 20.000 individually gathered OR audio samples. Using the MobileNet network, we were able to achieve up to 90% accuracy for the recognition of 10 classes, including audio events such as directed table movements or instrumentation with specific energy devices.

Next, we plan to expand our data set, including the variety of detectable events, and connect single predictions with a temporal context model. Meanwhile, we encourage the community to build upon this new approach and consider the dimension of audio as a highly informative data source for future workflow recognition systems.