Introduction

The fiberoptic endoscopic evaluation of swallowing (FEES) is a standard procedure to study swallowing. In 1988, the first FEES report was published by two speech and language pathologists (SLPs) and an Otorhinolaryngologist [1]. Several professionals share the FEES procedure in different countries. In Anglo-American countries, it is predominantly performed by SLPs [2, 3]; in European countries, FEES is mainly performed by Phoniatricians and Otorhinolaryngologists as a medical procedure [3]. Together with the videofluoroscopic study of swallowing (VFSS), FEES is widely considered the gold standard instrumental exam [4] 1. The analyses of FEES recordings allow clinicians to identify signs of impairments in swallow safety (penetration and aspiration) and swallow efficacy (pharyngeal residues) [5].

Various measurement scales are currently available to analyze FEES videos [6]. The assessment of FEES recordings is usually based on visuoperceptual measures; this implies that evaluations of FEES videos are fundamentally subjective. Visuoperceptual scales created for interpreting various signs of dysphagia in FEES are often accompanied by specific training to learn how to use them [7,8,9,10,11]. Neubauer et al. developed the Yale Pharyngeal Severity Rating Scale (YPRSRS) [9], an image-based scale to assess the amount of residue in the valleculae and pyriform sinuses. The severity levels on a 5-point scale are defined “none, trace, mild, moderate, severe” for the valleculae and the pyriform sinuses. An operational description, an anchor image, and a percentage of residue are provided for each point. In a systematic review comparing pharyngeal residue severity rating scales for FEES [12], the YPRSRS was judged to be the most reliable and valid scale. In the original study, 20 raters (Otorhinolaryngology residents, Otorhinolaryngologists, and SLPs) attended a training that included written definitions, visual depictions, explanations, and clarification of doubts [9]. Similarly, a short training on the use of the scale on FEES frames was provided to raters in the validations of the YPRSRS in German [10], Turkish [11], and Italian [13]. The German training [10] lasted 8 min, while the Italian training lasted 4 min [13].

Training in analyzing FEES is a relevant topic, recently reviewed in a sco** sudy [14]. Several post-basic trainings on FEES were designed for exam interpretation and execution by SLPs, Phoniatricians, Otorhinolaryngologists, or other professions [15, 16]. Other training programs targeted different professionals, such as medical student residents [17], neurologists physicians [18], and nurses [15]. The training duration ranged from a minimum of 30 min [18] to over 10 h [19]. A few studies used video-recorded lectures [15] or online lessons [17], and some programs have included self-paced exercises [15, 16, 19], and self-assessments on skills learned [15, 16].

In the validation studies on YPRSRS, the effects of the training was evaluated only on clinicians [9,10,11, 13], and no comparison was made between ratings of clinical and inexperienced raters such as students. Furthermore, previous papers did not include how to identify and assess pharyngeal residues by using YPRSRS on FEES videos [9,10,11, 13]. Recently, the YPRSRS has been shown to be reliably used to evaluate pharyngeal residues also in FEES videos, although with average lower reliability coefficients compared to FEES frames [20]. Therefore, training on how to apply the YPRSRS on FEES video is required in order to improve its scoring in these circumstances.

The purpose of the present study was to: (i) develop a training aimed at acquiring skills for the interpretation and evaluation of pharyngeal residue in FEES videos and frames using the YPRSRS scale for medical doctors, SLPs, and students who attend the bachelor’s degree in Speech and Language Pathology; and (ii) verify the training efficacy in improving construct validity and inter-rater reliability. The hypothesis was that training could support participants in develo** specific skills in assessing pharyngeal residues in FEES.

Methods

This project was carried out following the Declaration of Helsinki of the World Medical Association (WHO). Consent of the Ethics Committee of the University of Milan was obtained on 17/11/2020 (number 102/20). Frames and videos used in this work were selected from pre-existing archived material. A randomized controlled trial was conducted among clinicians, whereas a prospective observational pre-post study was performed in students.

The Yale Pharyngeal Residues Severity Rating Scale

The YPRSRS is an ordinal scale that rates the amount of pharyngeal residues in the valleculae and pyriform sinuses [9]. The definitions of severity are distributed on a 5-point scale (1 = none, 2 = trace, 3 = mild, 4 = moderate, 5 = severe). Each level of the scale corresponds to an operational description, an anchor image, and a percentage of residue. A separate score is provided for the valleculae and the pyriform sinus.

Selection of the Materials

Seventy pairs of videos and frames with different consistencies were selected from FEES video-recordings collected for previous studies. All FEES examinations were kept anonymous. The FEES were conducted using a XION EF-N flexible fiberscope (XION GmbH, Berlin, Germany) attached to an EndoSTROBE camera (XION GmbH, Berlin, Germany) and recorded as an AVI format. A standard FEES protocol was used including the sequential administration of boluses of thin liquids (5-10-20 ml of blue-dyed water x 3 trials for each volume; International Dysphagia Diet Standardisation Initiative – IDDSI 0; < 50 mPa·s at 50s-1 and 300s-1), pureed food (5-10-20 ml of pudding x 3 trials for each volume; IDDSI 4; 2583.3 ± 10.41 mPa·s at 50s-1 and 697.87 ± 7.84 mPa·s at 300s-1), and regular food (half biscuit x 2 trials; IDDSI 7) [21]. For the present study, only the 5 ml trials were selected for thin liquids and pureed food. For all consistencies, a frame was selected after swallowing the last bolus. As a first step, the 70 pairs of videos and frames were independently assessed by two experienced raters (> 10 years of experience in FEES analysis), a Phoniatrician, and a SLP. Only the pairs of videos and frames that were assigned the same YPRSRS score were selected. Thirty pairs of videos and frames (15 for valleculae, 15 for pyriform sinuses) were selected for for validity and reliability analysis, and an additional 6 pairs (3 for valleculae, 3 for pyriform sinuses) were chosen for the training by consensus. All YPRSRS scores and consistencies of the recorded swallows were included in the analysis.

Raters

A sample size of 29 clinicians was determined based on the previous studies on the YPRSRS. Inclusion criteria were professional activity as Phoniatricians, Otorhinolaryngologists, SLPs, or resident Otorhinolaryngologists with a minimum clinical experience of 1 year in the dysphagia. In addition, a convenience sample of students who attended the 2nd year of the bachelor’s degree in Speech and Language Pathology was recruited. To participate in the study, all students had to have already attended the classes on dysphagia assessment and treatment.

Training

Clinicians were randomly allocated to the training or the control group (1:1) based on a random number sequence. Randomization was stratified for the profession (medical doctor vs. SLP) and years of experience (< 5 years vs. 5 ≥ years), according to a previous study [10]. All students received the training.

The training design was based on the characteristics of the trainings for the interpretation of FEES previously described in the literature [15,16,17,18,19, 9, 10]. It consisted of 3 steps, for a total of 160 min. As a first step, the participants viewed a pre-recorded 75-minute video-audio lesson composed of theoretical modules describing dysphagia signs observed during the FEES procedure, pathophysiological mechanisms and consequences of pharyngeal residues, an introduction to the YPRSRS, the clinical application of the scale, and case studies. After the online lesson, each participant independently practiced assigning a YPRSRS score to 6 pairs of frames and videos of FEES (3 for valleculae, 3 for pyriform sinuses), not included in the pre-post training assessment. The results of individual practice and any questions or uncertainties were discussed among participants during a one-hour synchronous meeting moderated by an expert tutor. Debriefing meetings consisted of small groups (4–5 participants) were formed to encourage communication and discussion.

The whole training was delivered using Microsoft Teams (Microsoft Corporation, Redmond, WA).

Data Collection

Both the training and the control group scored the 30 pairs of videos and frames using the YPRSRS twice: (i) before and after for the training for the training group and the students, at least two weeks between assessments; (ii) at least two weeks between assessments for the control group. The FEES material was submitted for evaluation by the participants through the Google form platform; the order of the videos and frames was randomized for both the first and second assessments. The material was sent together with the scale and the anchor images. All data were treated in a pseudo-anonymized form; each participant was assigned an alphanumeric code. At the end of both assessments, participants completed a self-evaluation questionnaire created ad hoc to investigate participants’ perceived self-efficacy in interpreting FEES with the YPRSRS scale. The self-evaluation questionnaire consisted of 9 items, 2 of which were about anatomical identification of valleculae and pyriform sinuses and 7 items about scoring, based on a 5-point scale (1 = never, 2 = rarely, 3 = sometimes, 4 = often, 5 = always).

Statistical Analysis

For the analysis, the IBM SPSS v26.0® software for Windows (SPSS Inc, Chicago, Illinois) and the R software v.4.2.0 [22] were used. The clinicians’ and the students’ ratings were analyzed separately.

The baseline characteristics of clinicians were analyzed to make a comparison between the control and training. The Kolmogorov-Smirnov test was used to assess the normality of continuous variables and, as none of them were normally distributed, the Mann-Whitney U test was performed to compare group distributions; frequencies were compared through the chi-square test.

Construct validity and inter-rater reliability were used as a measure for the efficacy of the training. Conversely, intra-rater reliability was not considered a suitable outcome measure to assess the efficacy of the training due to the manipulation introduced by the training itself. Construct validity was defined as the agreement between each rater and the expert score employed as the “gold standard” [9, 10]. Inter-rater reliability was defined as the degree of agreement among raters scoring the same object on the same assessment [23].

Construct validity was calculated through weighted Cohen’s Kappa (quadratic weighting), separately for each rater. The distribution of raters Cohen’s Kappa was compared between the first and the second assessment using the paired t-test. Concerning control, training, and student groups, Cohen’s Kappa distribution was compared separately for the first, and the second assessment, using the one-way analysis of variance (ANOVA) with Tukey HSD adjustment to correct the significance level for post hoc pairwise comparisons.

As for inter-rater reliability, the level of agreement among each group of raters was calculated with Fleiss Kappa (quadratic weighting) both for the first and the second assessment; for each group, indices were subsequently compared using paired sample t-tests based on the linearization method for correlated agreement coefficients [24]. ANOVA with Tukey’s HSD method was also employed to check for differences in Fleiss Kappa values among the three groups separately for the first, and the second assessment.

The benchmark of Landis and Koch was used to evaluate the levels of agreement for the Kappa [25]: 0.00-0.20 slight agreement, 0.21–0.40 fair agreement, 0.41–0.60 moderate agreement, 0.61–0.80 substantial agreement, 0.81-1.00 almost perfect agreement. For the Fleiss Kappa, the following benchmark was adopted: < 0.40 poor, 0.40–0.75 intermediate to good, > 0.75 excellent [26].

To compare the results of the self-assessment questionnaire between the first and the second evaluation within each group, the non-parametric Wilcoxon signed rank test was used. Comparisons among groups for the first and second assessments were performed using Kruskal-Wallis test; pairwise comparisons were adjusted using Bonferroni correction.

Results

Raters’ Characteristics

Twenty-nine clinicians were initially recruited for the study. However, 4 clinicians, 2 from the training group and 2 from the control group, dropped out before completing the second assessment. Thus, data from 25 raters were ultimately analyzed. Table 1 shows descriptive statistics on clinicians’ characteristics, together with a comparison between training and control groups. No significant differences were observed between the two groups, except for the higher number of FEES the raters have participated in/performed among the control group (p = .02).

Table 1 Characteristics of clinicians at baseline: age, sex profession, years of experience, participation in FEES, and execution of FEES are reported

In additions to clinicians, 47 students (22.36 ± 3.54 years, 100% female) also participated in the present study as raters. Overall, 59.57% of the students had never observed a FEES, while 63.83% had performed a university internship with patients with dysphagia.

Training Results

Results

Effect of the Training on Construct Validity

Results of the comparison of construct validity values between the first and the second assessment within each group are reported in Table 2. Concerning FEES frames, an almost perfect agreement in the control, training, and students’ groups was observed; no statistically significant differences were found between the first and second evaluations for any of the three groups. As for FEES videos, construct validity values significantly improved between baseline (substantial agreement) and post-training (almost perfect agreement) in the training group for the pyriform sinus videos. In students, the construct validity associated significantly improved in vallecule and pyriform sinus videos: agreement improved from substantial to almost perfect in valleculae videos, and from moderate to almost perfect in the pyriform sinus videos (Table 2).

Table 2 Construct validity: comparison between baseline and second assessment in the Training, Control, and Student groups for Valleculae frames and videos, Pyriform sinus frames and videos

Pairwise comparisons of construct validity values among groups are reported in Table 3. At the baseline, the ANOVA post-hoc analysis showed that students exhibited significantly lower values of construct validity compared to the control group for the valleculae (both frames and video) and compared to both groups of clinicians for the pyriform sinus videos (Table 3). At the second assessment, students had lower values of construct validity compared to the control group only for the assessments of the valleculae frames (Table 4). No significant differences in construct validity scores were found between the training and control groups.

Table 3 Construct validity: Pairwise comparisons among Training, Control, and Students groups for Valleculae frames and videos, Pyriform sinus frames, and videos: ANOVA results
Table 4 Inter-rater reliability: comparison between baseline and second assessment in the Training, Control, and Student groups for Valleculae frames and videos, Pyriform sinus frames and videos
Effect of the Training on Inter-Rater Reliability

Table 4 shows inter-rater reliability results. No significant differences were observed between the first and the second assessments. Moreover, no significant differences in the inter-rater reliability values were found among groups at any time point (Table 5).

Table 5 Inter-rater reliability: comparison among Training, Control, and Student groups for Valleculae frames and videos, Pyriform sinus frames and videos: ANOVA results

Self-Assessment

Self-assessment results on the perceived self-efficacy in interpreting FEES with the YPRSRS scale are reported in Tables 6 and 7. The Wilcoxon signed-rank test found no significant difference between the first and the second assessments among clinicians, regardless of their training status. Students’ group values significantly improved at the second assessment (Table 6), although the perceived self-efficacy remained lower than clinicians’ (Table 7).

Table 6 Self-assessment: comparisons between baseline and second assessment in the Training, Control, and Student groups for the 9 items of the self-assessment questionnaire
Table 7 Self-assessment: pairwise comparisons among No training, Training, and Students groups for the 9 items of the self-assessment questionnaire

Discussion

This study examined the training efficacy in improving the raters’ performance by assessing pharyngeal residue in FEES videos and frames using the YPRSRS and self-assessment changes from baseline and second assessment. The results obtained showed an improvement in agreement between participants, particularly students, and the experts in interpreting FEES videos. This improvement cannot solely be attributed to task repetition.

Our training spanned approximately 160 min, encompassing video lessons, independent practice, and a debriefing meeting. Unlike previous trainings for the use of the YPRSRS [9,10,11, 13], our current study introduced both practical exercises and a debriefing meeting to complement the theoretical lessons. Notably, the presented training lasted nearly 3 h, a substantial increase compared to the previous studies (8 min in the German study [10] and 4 min in the Italian study [13]). Other training programs on visuoperceptual scales typically have a similar or longer duration compared to the current one. In Kaneoka et al.‘s study [8], which aimed to demonstrate the reliability and validity of The Boston Residue and Clearance Scale (BRACS), four speech-language pathologists (SLPs) participated in a 3-hour session led by an expert clinician and co-creator of BRACS. The Visual Analysis of Swallowing Efficiency and Safety (VASES) training [7] comprised five parts covering VASES rules, practice with five FEES videos, a pre-recorded 60-minute session, additional practice with another five FEES videos, and a live 60-minute session. The median completion time for this training was 6 h.

Furthermore, previous studies that validated the YPRSRS in different languages [9,10,11, 13] analyzed the effectiveness of the training in frames but not in videos. Videos were included in this study because they better reflect the dysphagia assessment in the actual clinical practice. In addition, the video application is preferred in the instrumental evaluation of swallowing to ensure the quality of diagnosis and management of dysphagia [27]. The training presented in this study has advantages and disadvantages. The online mode allowed participants to take classes from home at the times most convenient for them. Online and face-to-face training are comparable in terms of effectiveness [17]. However, the asynchronous mode did not allow direct and timely exchange between instructors and participants. This disadvantage was partly offset by the debriefing.

Concerning clinicians, construct validity for pyriform sinus videos significantly improved after training, while no improvement emerged from the construct validity analysis of the control group in the second assessment. No group showed significant differences in inter-rater reliability values between the baseline and second assessments. Despite this, in the “training” group, it is possible to consider a trend of improvement in the valleculae videos (kappa values ranged from an intermediate to good agreement to an excellent agreement). The limited sample size, based on previously published studies [10], could have led to a lack of statistical power to detect significant differences. The training seems to have improved the rating precision of clinicians in assessing videos; however, similarly to previous studies [9,10,11, 13], such a result was not observed in raters’ performance during frames baseline evaluation.

Differently from previous studies, no difference was found between training and control groups for inter-rater reliability scores associated with frames assessment: Neubauer et al. [9] reported a significantly higher value for inter-rater reliability of trained raters in frames for both locations and Gerschke et al. [10] found significant differences between trained and untrained raters only for valleculae frames ratings. To date, the YPRSRS is widely employed, although it was newly devoloped or recently validated when the cited studies were conducted [9, 10]. Thus, it is possible to assume that participants were still not familiar with the scale in previous studies [10, 11, 13], while the higher level of clinical experience with the scale from participants in the present study could have influenced the efficacy of the training for frames.

In the students’ group, values for construct validity in FEES videos were significantly higher after training; the training reduced the gap between students’ and clinicians’ accuracy at the baseline as students achieved values of construct validity similar to clinicians. For the first time, students were selected to participate in training on the YPRSRS. The present results seem to confirm that the training, especially on less experienced raters, improves the accuracy of pharyngeal residues assessment in videos.

Self-questionnaire analyses of efficacy showed that students felt more confident using the scale after the training. This improvement in self-efficacy aligns with the progress noted in the second-assessment analysis, particularly in the pairwise comparisons with clinicians. No significant differences were observed between the baseline and second assessments in the clinicians’ groups. Notably, students, in general, exhibited lower confidence in applying the scale compared to clinicians in pairwise comparisons. An interesting finding emerges when comparing the training group to both the control and student groups during the second assessment regarding item 9. Item 9 involves the necessity of reviewing the video multiple times before assigning a score. The training likely enhanced clinicians’ thoroughness; consequently, they may have found it beneficial to watch the videos several times before finalizing their scores. Self-assessment questionnaires were not present in previous studies [9,10,11]. Moreover, none of the studies mentioned in a recent review on training to analyze functional parameters with FEES [14] included a measure specifically addressing self-confidence [18, 7,8,9,10, 28]. Investigating this aspect assumes relevance because, especially in inexperienced trainees such as students, good self-assessment skills are considered a key component in develo** clinical skills [29].

There are some limitations in this study. All the students participated in the training. Thus, having a control group composed of untrained students was impossible. Future studies should also include a control group of students. Because of drop-outs, the number of raters who completed the second assessment was less than the expected sample size. Moreover, this study differs from previous studies on the YPRSRS as the “best-of-the-best” criterion was not used in choosing frames and videos. Although it may better reflect what clinicians experience in clinical practice, having included videos gathered from everyday clinical practice in a University Hospital may have made the evaluation more complex.

Conclusion

The results showed that the training can improve clinicians’ and students’ agreement with experts in assessing pharyngeal residues with the YPRSRS in the FEES videos. Promoting evidence-based training courses would allow students and clinicians with different backgrounds to have a standard method of interpreting FEES and sharing information in clinical practice and research.