Introduction

Technologies that enable next-generation context-aware systems in the operating room are currently intensively researched in the domain of surgical workflow recognition [1]. Recent studies that apply machine learning algorithms to this task have shown highly promising results [2, 3]. To further support advances in this area, academic machine learning competitions are hosted regularly [4,32] which targets the analysis of set memberships of data elements. The centered bar charts which are arranged radially show the total number of frames a surgical instrument was visible in each set (i.e., individual instrument occurrence). Additionally, a bar chart that reflects the number of frames in which no instruments are visible, so-called idle frames, is also included in this view. The combinations of instruments (i.e., instrument co-occurrences) are displayed as nodes in the center of the instrument view. The nodes themselves are represented as pie charts, whereas each segment of the pie chart shows the prevalence of this instrument combination in the training, validation, and test set. The positioning of the nodes is determined by a force-directed layout algorithm implementation of the D3 library [30].

Fig. 2
figure 2

Instrument view of the proposed application with eight proctocolectomy surgeries from the “Surgical Workflow Analysis in the sensorOR 2017” challenge dataset [6] (A) and selected combination of Grasper and Ligasure (B)

To facilitate the exploration of the surgical instrument data, several interaction techniques are implemented in this view. By selecting an individual instrument, all instrument co-occurrence nodes that involve the selected instrument are highlighted in the Instrument view. Besides, co-occurrence nodes can be selected individually which reveals the proportion of co-occurrence frames in relation to the frames of the involved instruments (see Fig. 2B). Upon filtering of individual instruments or instrument co-occurrences, other views of the visual framework are updated accordingly to view the selected frames.

Supplementary views

The main views are enhanced by two supplementary views which provide a general overview of the dataset. The colors red, green, and blue encode the attributes of the training, validation, and test set, respectively. The first supplementary view represents a table that shows the partitioning of surgeries into the training, validation, and test sets. The individual surgeries can be interactively re-assigned to a different set via drag and drop. The second supplementary view encompasses two bar charts that display the total number of surgeries and frames for each set (see Fig. 3A). Additionally, a set of bar charts displaying the number of frames for each individual surgery are arranged on the right side of the view (see Fig. 3B). The average number of frames for each set is shown as dashed lines in the bar charts (see Fig. 3C).

Fig. 3
figure 3

Supplementary view of the proposed application. Two mirrored bar chars show the number of surgeries and the total number of video frames in the training, validation, and test set (A). A set of three bar charts display the duration (i.e., number of frames) of each surgery (B). The dashed lines show the average surgery duration per set (C)

Evaluation and results

The proposed visualization framework is evaluated through a user study using the Cholec80 dataset [7]. In addition to the user study, we use our framework to analyze splits of five popular datasets for the surgical phase and instrument recognition tasks, highlight problematic cases, and propose optimized splits.

User study

In total, ten participants with data science background have been recruited to participate in the evaluation study of the proposed visualization framework. After a brief introduction into the domain of surgical phase recognition and the features of the proposed application, the participants were asked to solve ten tasks covering a wide range of possible exploratory analyses that can arise during the preparation of Cholec80 dataset [7]. Further details on the user study are provided in the supplementary information. To measure the results of this study, task completion percentage was used, which has the value of 1 only if the participant solves the task correctly, 0 otherwise. Overall, the majority of the tasks were completed successfully by \(\ge 80\%\) of participants.

After completing the tasks, the participants were asked to fill out the System Usability Scale (SUS) [33] questionnaire. It consists of ten statements that the study participants ranked on a 5-point Likert scale ranging from 1 (strongly disagree) to 5 (strongly agree). The ranking of the statements is then used to calculate the SUS score which expresses the usability of the system. The value of the score ranges between 0 and 100, with higher values expressing better usability. The proposed application reached the SUS score of 81.25.

Analysis of dataset splits

In order to validate the proposed framework, we perform analysis of various dataset splits for the Cholec80 [7], CATARACTS [4A). Notably, all of the nine surgeries are assigned to the training set; therefore, the evaluation of the model’s performance on the test set does not include this special workflow. In addition, another unique workflow that only occurs in three surgeries (12, 14, 32) in the training set can be identified using the proposed visualization (see Fig. 4B). After the Gallbladder packaging phase, these three surgeries move on to the Gallbladder retraction, thus omitting the Cleaning coagulation phase. Subsequently, the surgeries return to the previously skipped Cleaning coagulation phase which is also the final phase of the three surgeries. Since this unique sequence of phases only appears in the training set, they are not included in the evaluation of the machine learning model. Proposed improvement: With this information at hand, the split can be optimized by re-assigning the surgeries 29, 32, 33, and 38 to the test set, as interactively determined in our tool. Accordingly, four randomly selected surgeries 58, 66, 71, 78 from the test set are assigned to the training set to retain the 40/-/40 split. As a result of this re-partition, the aforementioned cases of phase transitions now also appear in the test set.

Regarding the instrument use, the proposed visualization shows that all of the individual instruments are represented in all sets and also follow similar distributions. Nevertheless, there are several instrument combinations that do not occur in one of the sets (see Fig. 4C). However, these instruments combinations mostly represent rare cases, as they account for only a small fraction of the dataset and appear in single surgeries.

Fig. 4
figure 4

Characteristics and shortcomings of the 40/-/40 split of the Cholec80 dataset [7]. Surgeries starting in the Calot triangle dissection phase are only present in the training set (A). The ending sequence Gallbladder retraction to Cleaning coagulation occurs only in the training set (B). The instruments Bipolar and Scissors co-occur only in the training set (C)

32/8/40 split

To perform model selection or hyperparameter search, studies [11, 25, 37] use eight surgeries from the training set for validation, resulting in a 32/8/40 split [5B). This will presumably hinder the generalization of the model. Proposed improvement: This can be solved with our tool by re-assigning the surgery 14 to the validation set, surgeries 23, 29, 32 to the test set, and surgeries 37, 41, 57, 60 to the training set. Regarding the instruments, the co-occurrences of surgical instruments that are missing in one of the sets are more prevalent in this split due to the additional validation set. One considerable example is the simultaneous use of Grasper, Bipolar, and Irrigator occurring in 503 frames in the training set and in 154 frames in the test set (see Fig. 5C).

Fig. 5
figure 5

Characteristics and shortcomings of the 32/8/40 split of the Cholec80 dataset [7]. Surgeries from the validation set have fewer frames on average, compared to the training and test sets (A). The phase transitions (Gallbladder dissection, Cleaning coagulation) and (Cleaning coagulation, Gallbladder packaging) occur only once in the training set (B). The simultaneous occurrence of the instruments Grasper, Bipolar, and Irrigator is not represented in the validation set (C)

40/8/32 split

Instead of setting aside eight surgeries from the training set, some studies [11,

Fig. 9
figure 9

Visualization of phase occurrences and transitions from the M2CAI-workflow dataset [7, Full size image

Summary of unrepresented cases

Table 1 shows dataset splits of the five datasets as well as the number of phase transitions, and instrument combinations that are not represented in one of the sets. The improved dataset splits that are presented as part of this work are denoted with *.

Table 1 Number of phase transitions, instrument co-occurrences, and individual instruments that are unrepresented in one of the sets and were discovered using the proposed visualization framework. Improved splits proposed as part of this work are indicated with *