Simulation in medical education re-creates components of clinical encounters for the purpose of training or assessment.1 The growing requirement for simulation in healthcare involves a number of factors, including lower tolerance for medical errors and greater emphasis on patient safety, rapid advances in medical technologies for diagnosis and management, caring for more medically complex patients, evolving models for postgraduate education and healthcare delivery, and recognizing the need for deliberate practice to achieve excellence in patient care.1,2 A few studies, including systematic reviews and meta-analyses, have established that technology-enhanced simulation can be superior to traditional teaching methods for learning new knowledge, acquiring skills, and increasing trainee satisfaction.3,4

Driven by the mounting evidence for technology-enhanced simulation, in conjunction with the need to minimize harm to patients, many academic centres have purchased high-fidelity simulators for their respective training programs.5 Nevertheless, there is a significant cost associated with purchasing and maintaining high-fidelity simulators, and the monetary to educational ratio of high-fidelity simulators is still unclear when compared with other teaching modalities.6,7

Within the simulation literature, the term “fidelity” describes how the appearance and behaviour of the simulator match those of the real environment, thus a high-fidelity simulator is considered most realistic.8 As the simulator fidelity increases from low to high, the technology becomes more advanced and sophisticated, which invariably results in higher cost. Historically, it was assumed that improving the fidelity of the simulator would result in more effective learning.9 Moreover, studies have shown that simulator fidelity does not improve learning many of the basic motor or technical skills.9 This suggests that large financial investments in purchasing high-fidelity simulators may not always equate to improved learning for certain skills.

Non-technical skills (NTS) are defined as the cognitive (e.g., decision-making, situation awareness) and interpersonal (e.g., communication, leadership) skills important for reducing medical error and improving patient safety.10,11,12 High-fidelity simulators have been shown to be effective in teaching NTS.13 In addition, some previous investigations have compared high- vs low-fidelity simulators in teaching NTS, but these studies were limited either by having a small convenience sample or by comparing high- vs low-fidelity models with no actual difference in cost.14,15 It is often assumed that learning complex skills, such as NTS, demands an equally complex or high-fidelity simulator.9 With limited evidence to support this assumption, the purpose of our study was to compare the effectiveness of a low- vs high-fidelity simulator in teaching NTS to postgraduate medical trainees. If our study results were to show the same teaching effectiveness between the two types of simulators (i.e., non-inferior), then the lower cost simulator might have an overall advantage. Accordingly, we hypothesized that a low-fidelity simulator would be non-inferior to a high-fidelity simulator for teaching NTS.

Methods

Recruitment

After institutional research ethics board (Capital District Health Authority, Halifax, NS, CDHA-RS/2014-262, March 2014) approval, 36 postgraduate year (PGY) 1-5 residents were recruited from training programs at Dalhousie University—i.e., anesthesiology, emergency medicine, internal medicine, or a surgical specialty. One of the study investigators (Y.G.) conducted recruitment through e-mail and in person during August 2014. Written and verbal informed consent was obtained from all participants and demographic data were collected. The intention to evaluate only NTS was not initially disclosed to the participants; instead, we informed them that the purpose of our study was to assess whether simulator fidelity affects learning outcomes. We chose this approach in order to minimize any potential Hawthorne effect where participants would act differently if they knew that only NTS were being assessed. Full disclosure was provided once data collection was complete. Participants received a small honorarium for their participation, which was not revealed until their pre-briefing on data collection day.

Study design (Fig. 1)

Our hypothesis that simulator fidelity will not affect learning NTS was tested as a randomized-controlled non-inferiority trial. The participants were stratified by junior (PGY 1-2) and senior (PGY 3-5) residents and randomly assigned to either the high-fidelity simulator (HFS) or the low-fidelity simulator (LFS) group using a random number generator (Randomness and Integrity Services Ltd., Dublin, Ireland). The resulting assignments were placed in sequentially numbered opaque envelopes according to resident stratification (i.e., junior and senior residents). Following recruitment, one of the study investigators (Y.G.) opened the envelopes to reveal the group allocation. The participant then chose one of several potential times and dates designated to either the HFS or the LFS group. Within their fidelity group, the participants were assigned to teams of three. The teams were arranged by convenience due to the complexity of scheduling residents and the availability of the simulator facility and research assistants. A research volunteer acted as the third confederate for the sessions where only two participants were present. Before each scenario, one of the investigators (Y.G.) conducted a standardized pre-briefing to allow the participants to familiarize themselves with the simulator environment and equipment, address any limitations, broadly discuss the goals and objectives of the scenarios, and address any questions or concerns.

Fig. 1
figure 1

Representation of the experimental design on both simulators. HFS = High-fidelity simulator; LFS = Low-fidelity simulator

Pre-test phase

There were three different simulation scenarios (described below in the scenario section) followed by a structured debriefing. For each scenario, one participant of the team was evaluated (in the “hot-seat”), while the other two participants acted as confederates—e.g., nurse, respiratory technician, surgeon, etc., depending on the scenario. After completion of the scenario, all team participants were debriefed (described below). Following the debriefing, the participants rotated and one of the other confederates entered the “hot-seat”. The simulation and debriefing process was repeated with a different scenario and then repeated a third time for the final participant. The confederates were provided with pre-defined instructions and scripts on how to respond and behave during the scenarios. In order to minimize deviations from the pre-defined script, each confederate wore a microphone to facilitate back and forth communication with one of the study investigators in the simulation control room. For the groups where only two participants were present, one of the study investigators acted as the additional confederate according to the same script and instructions. Each scenario and debriefing was video recorded for subsequent evaluation.

Debriefings

All three team participants underwent a 20-min structured debriefing after each of their assigned scenarios. The debriefings were conducted by one of the study investigators (T.W.) who is a simulation instructor in both the Department of Anesthesia and the Department of Critical Care at Dalhousie University and has extensive knowledge and experience in simulation teaching and debriefing. The debriefings consisted of a discussion around both the technical and non-technical aspects of their performance and were based on the principles of crisis resource management (CRM). The debriefings were conducted in a manner that blinded the participants from the focus of NTS assessment. The conversations were video recorded to evaluate the consistency and quality of the debriefings.

Post-test phase

On the same day following completion of the pre-test phase, all three participants (regardless of their HFS or LFS grou**) individually underwent an identical simulator scenario on a high-fidelity simulator. The scenario was video recorded for subsequent analysis. One of the study investigators (Y.G.) and a research volunteer (D.H.) were present to act as confederates for all the post-test scenarios. There was no formal debriefing after the post-test scenario; however, any critical medical errors would be addressed and time was given for the participants to ask questions related to the scenario.

Simulators

The study was conducted at the simulation centre located at the Victoria General (VG) Hospital in Halifax, Nova Scotia. The high-fidelity mannequin used in our study was the SimMan® 3G (Laerdal Medical Canada Ltd, Toronto, ON, Canada) with the accompanying Laerdal software (Laerdal Medical Canada Ltd, Toronto, ON, Canada), and the low-fidelity mannequin was the Deluxe Difficult Airway Trainer (Laerdal Medical Canada Ltd, Toronto, ON, Canada), a plastic upper torso with no complex functionality used for airway teaching. Pillows and blankets were used for the lower torso. The same Laerdal software was used to control and display the patient’s vital signs on a video monitor. Table 1 further describes the similarities and differences between our low- and high-fidelity setup and configuration.

Table 1 Simulator Equipment and Room Setup
Table 2 Participant demographics

Simulator scenarios

The scenarios were selected from our institution’s preexisting intensive care unit (ICU) simulation program and chosen because they represent emergency situations that residents may encounter in the ICU, postanesthesia care unit, or emergency room. The pre-test scenarios included anaphylaxis, pulseless electrical activity (PEA) secondary to septic shock, and acute onset atrial fibrillation secondary to a pulmonary embolus. The pre-test scenarios had a fixed order, thus for the two-person groups, these individuals participated only in the anaphylaxis and PEA scenarios. The post-test scenario was management of cardiogenic shock secondary to an acute coronary syndrome. The scenarios occurred in a pre-defined sequence regardless of the participant’s action or inaction and were standardized to therapeutic interventions as much as possible. All scenarios began with a clinical stem provided by a nurse confederate, followed by two minutes for patient assessment. In the LFS group, the patient “spoke” via speakers in the room; whereas in the HFS group, the patient “spoke” through speakers in the mannequin itself. After the initial patient assessment, a six-minute Advanced Cardiovascular Life Support event occurred (e.g., PEA, ventricular tachycardia, or unstable narrow complex tachycardia). Once return of spontaneous circulation was achieved, two minutes were provided for patient resolution and disposition. Each scenario lasted ten minutes.

Evaluation and assessment tools

The Ottawa Global Rating Scale (OGRS) (Appendix 1), a tool developed at the University of Ottawa, Canada, to assess NTS, has shown construct validity.16,17 The OGRS consists of five domains of CRM skills— i.e., situational awareness, leadership, resource utilization, problem solving, and communication (each scored on a seven-point scale).

For our primary outcome, we used the overall OGRS performance score (seven-point scale), which is guided by the scores of the individual domains described above. For our secondary outcome, the total OGRS score is the summation of the individual scores in the five domains described above (scored from 5-35). We adjusted for pre-test overall and total OGRS scores to account for differences in baseline performance of the participants. Three raters from the University of Ottawa (P.R., T.V., D.B.) and unknown to the study participants reviewed the video recordings of performance and applied the OGRS. Each rater was a physician trained by one of the principle investigators (D.B.) in the use of the OGRS and with expertise in the field of medical simulation. Raters familiarized themselves with the OGRS literature and practised rating on video recordings not from this study. Any large variations in scores were discussed and calibrated prior to assessing the video recordings from this study.

The Observational Structured Assessment of Debriefing (OSAD) (Appendix 2) is a tool developed at Imperial College London, United Kingdom to evaluate the consistency and quality of healthcare debriefings.18 There are eight categories in the OSAD scoring system: approach, establishing learning environment, learner engagement, reaction, descriptive reflection, analysis, diagnosis, and application. Each category is scored from 1 = done very poorly to 5 = done very well to give a total score from 8-40. To assess for consistency in the quality of debriefing between the HFS and LFS groups, half of the debriefing sessions (18 videos) were chosen by a random number generator (Randomness and Integrity Services Ltd., Dublin, Ireland). Three raters (D.B., T.V., P.R.) scored each debriefing using the OSAD tool.

Statistical analysis

Sample size was calculated based on the mean (standard deviation [SD]) overall OGRS performance scores from a study by Kim et al., where they found that PGY-3 and PGY-1 residents obtained a mean (SD) overall OGRS score of 5.5 (0.9) and 4.0 (0.9), respectively.15 The difference in OGRS scores between the PGY-3 and PGY-1 residents was 1.5, thus we considered a difference of greater than 1in overall OGRS score between the LFS and HFS groups as educationally significant. Therefore, the sample size calculation to show non-inferiority between the LFS and HFS groups for teaching NTS was based on a non-inferiority margin of greater than 1. For a power of 0.9 and a type 1 error probability of 0.05, we calculated a total sample size of 36 participants, with 18 participants in each group, using G*Power analysis (Erdfelder, Faul, & Buchner, 1996).

Data are presented as mean (SD) and/or 95% confidence interval (CI) where appropriate. The Shapiro-Wilk test was used to test for normality, and homogeneity of variance was assessed with Levene’s test. Inter-rater reliability between the three OGRS evaluators was assessed using the intraclass correlation coefficient (ICC), and a value greater than 0.6 indicated moderate agreement between the raters.19 For our primary outcome, a one-way analysis of covariance (ANCOVA) was conducted to examine the difference between the HFS and LFS group on post-test overall OGRS scores, while controlling for pre-test overall OGRS scores a priori. The ANCOVA analysis produced estimated marginal means to provide means and SD for the two groups at post-test, adjusted for the a priori covariate of the pre-test score as well as the mean difference and its respective 95% CI. For our secondary outcomes, a one-way ANCOVA was performed to examine the difference between the HFS and LFS groups on post-test total OGRS scores, while controlling for pre-test total OGRS scores. A paired Student’s t test was used to compare the combined pre-test scores of the HFS and LFS groups with the combined post-test scores for both the overall and total OGRS scores. A Chi square test was used to look for any interaction between overall post-test OGRS scores and both PGY and specialty. The OSAD scores between the two groups were compared using the independent Student’s t test. A P < 0.05 was considered statistically significant. All data were analyzed using SPSS® version 21 (IBM Corp., Armonk, NY, USA).

Results

We enrolled all 36 residents who were invited to participate in the study. Table 2 summarizes the demographic information of the participants. The LFS and HFS groups had 17 and 19 participants, respectively. The difference between group numbers was due to difficulties in scheduling.

Overall and total OGRS scores

The mean (95% CI) measure of inter-rater reliability was good for the overall OGRS score (ICC, 0.72; 95% CI, 0.51 to 0.85) and total OGRS score (ICC, 0.69; 95% CI, 0.46 to 0.82); thus, we elected to use the means of the three raters for further analysis. Figs. 2 and 3 show the change from pre-test to post-test in the overall and total OGRS scores, respectively.

Fig. 2
figure 2

Box plot with superimposed data points of the mean overall Ottawa Global Rating Scale (OGRS) scores in the low-fidelity simulator (LFS) and high-fidelity simulator (HFS) groups during the pre-test and post-test phase. There was no significant difference in the overall mean post-test OGRS scores between the HFS and LFS groups after controlling for overall pre-test OGRS scores. Box plot line represents the median; the box is the interquartiles and whiskers the range

Fig. 3
figure 3

Box plot with superimposed data points of the mean total Ottawa Global Rating Scale (OGRS) scores in the low-fidelity simulator (LFS) and high-fidelity simulator (HFS) groups during the pre-test and post-test phase. There was no significant difference in the total mean post-test OGRS scores between the HFS and LFS groups after controlling for total pre-test OGRS scores. Box plot line represents the median; the box is the interquartiles and whiskers the range

For our primary outcome, there was no significant difference in the mean (SD) overall post-test OGRS scores between the HFS and LFS groups after controlling for overall pre-test OGRS scores [3.8 (0.9) vs 4.0 (0.9), respectively; mean difference, 0.2; 95% CI, -0.4 to 0.8; P = 0.48].

For our secondary outcomes, there was no significant difference in the mean (SD) total post-test OGRS scores between the HFS and LFS groups after controlling for total pre-test OGRS scores [19.8 (3.6) vs 21.0 (3.6), respectively; mean difference, 1.2; 95% CI, −1.2 to 3.6; P = 0.33]. We found no significant association between postgraduate training year and overall post-test OGRS score [χ2 (4) = 1.55; P = 0.817]. We also found no significant association between postgraduate program and overall post-test OGRS score [χ2 (4) = 2.37; P = 0.67].

Comparison between pre-test and post-test OGRS scores

The mean (SD) overall OGRS scores for both groups during pre-test and post-test were 3.3 (0.7) and 3.9 (0.9), respectively (mean difference 0.6; 95% CI, 0.2 to 1.0; P = 0.01). The mean (SD) total OGRS scores for both groups during pre-test and post-test were 17.8 (3.5) and 20.4 (3.5), respectively (mean difference, 2.6; 95% CI, 1.0 to 4.3; P = 0.003).

Debriefing consistency

The mean (SD) OSAD scores for the debriefings were 21.3 (2.4) in the HFS group and 22.5 (2.4) in the LFS group (mean difference, −1.1; 95% CI, −3.6 to 1.3; P = 0.34).

Discussion

The results of our study suggest that a low-fidelity simulator is not inferior to a high-fidelity simulator for teaching NTS. This supports the notion that the realism, or fidelity, of the mannequin does not strongly influence learning NTS during simulated crisis situations.

Our results are consistent with the findings in a previous study by Cheng et al. who compared the use of low- vs high-fidelity simulators for assessing knowledge and team leader behavioural performance during pediatric resuscitation. Their study results showed no difference in NTS between the two groups.15 The authors used a high-fidelity pre-programmed infant simulator capable of recording vital signs, audio feedback, breath sounds, chest rise, heart sounds, and palpable pulses. Their low-fidelity simulator was identical to the high-fidelity simulator but with all the innate functions turned off. In contrast to our study, they utilized only one pediatric advanced life support scenario, thus making their findings difficult to generalize to other types of complex acute care scenarios, such as the ones used in our study. Their methodology may also limit generalizability, as the difference in fidelity between groups involved a more complex design related to scripted and non-scripted debriefing. Finally, their low-fidelity model was not in fact lower in cost.

A study by Finan et al. also compared HFS with LFS for neonatal resuscitation training, and the authors found no significant difference in NTS between the two groups.14 Interestingly, they measured salivary cortisol levels to assess stress levels in the participants and found no difference between the two groups. These findings suggest that HFS, despite being more “realistic”, did not elicit a commensurate emotional response from the participants. A major difference between their study and ours was that they had no pre-test on either a HFS or a LFS; instead, they simply required that the participants had completed a course in the Neonatal Resuscitation Program within the previous two months and advanced resuscitation training one month before the study period. The study also used a small convenience sample, which was probably underpowered to find a difference between groups.

We also found that OGRS scores did not improve with PGY of training. A possible explanation is that most of the participants had prior involvement with high-fidelity simulation and debriefing on NTS, and a previous study showed that NTS can improve even after one simulator session.13 Another reason may be due to the lack of discriminative ability of the OGRS to detect a difference in more senior residents. This is the likely reason why the study by Kim et al. found a significant difference between PGY1 and PGY3 residents, whereas the study by Clarke et al. found no significant difference between PGY2 and PGY3 residents.16,20 Unfortunately, our study lacked the sample size for a proper investigation of this association, but this issue certainly warrants future studies.

Despite its wide use in the simulation literature, the term “fidelity” has been poorly defined.21 As the field of simulation advanced, it became apparent that there were multiple dimensions to describe and define simulator fidelity.8 Miller was the first to separate fidelity into both a “physical” and a “psychological” domain, where physical fidelity reflects how closely the training equipment, mannequin, and environment approximate the real situation.8 In contrast, psychological fidelity refers to the emotional connection of the learner to the simulation scenario.22 More recently, Diekman et al.—then modified by Rudolph et al.—proposed that fidelity or realism can be separated into three areas: 1) physical, 2) conceptual, and 3) emotional.23,24 The difference between their distinctions and Miller’s is the addition of a conceptual domain that deals with theory, meaning, concepts, and relationships. For example, if there is hemorrhagic bleeding, then there will be hypotension and tachycardia. A common assumption in simulation is that complex skills, such as CRM, require increased physical fidelity. Nevertheless, our results suggest that conceptual and emotional fidelity are perhaps more important for teaching and learning NTS.

The significant cost associated with high-fidelity simulators renders them financially prohibitive to many academic centres, especially those in low- and middle-income countries.6 This study shows that effective NTS training can be successfully performed with low physical fidelity simulators that are often a fraction of the cost to purchase and maintain. A recent study showed that a low-cost and low physical fidelity mannequin was effective in teaching NTS to Rwandan anesthesia providers.25 Our study will hopefully provide useful information to help guide the development of future simulation scenarios intended to teach NTS in centres with limited financial resources.

There were several limitations to our study. First, during the pre-test scenarios, the participants randomized to the HFS group were assessed on the identical HFS setup during the post-test phase. Ideally, the pre-test scenarios should be conducted on another HFS mannequin, in order to minimize pre-familiarization of the HFS simulator. Nevertheless, although the expectations would have been that the HFS group would have higher OGRS scores than the LFS group, our results did not find any added learning benefits to being trained on a HFS during the pre-test phase. Second, there were two groups in the HFS and one group in the LFS with a two-person team during the pre-test phase. Therefore, these groups were exposed to one less scenario and debriefing, which may have introduced some bias. Another limitation to this study was having one study investigator (T.W.), who was not blinded to the objective of this study, conduct all of the debriefings. But based on the OSAD scores, the consistency and quality of the debriefings were the same in both the HFS and LFS groups. During the post-test scenario, the study investigator and research volunteer who acted as team participants were not blinded to the purpose of the study. This may influence how they behaved during the scenarios, which could lead to observer bias. For our study, we assessed immediate post-simulation training OGRS scores and did not conduct a retention test. Therefore, it is unclear how simulator fidelity affects long-term learning of NTS. Finally, most of the participants in this study were recruited from the anesthesia department, which may reduce the generalizability of our findings.

In conclusion, our study suggests that a low physical fidelity simulator is not inferior to a high physical fidelity simulator for teaching NTS, i.e., based on the overall OGRS score in these two contexts. Our data do not support the assumption that higher fidelity (and higher cost) models result in improved learning of NTS in critical care. Adoption of low-fidelity low-cost models may have the potential to improve both the value of and access to simulation-based medical education.