Background

Pruritus is one of the most common and distressing symptoms in patients with chronic kidney disease receiving hemodialysis [1,2,3,4]. Chronic kidney disease-associated pruritus (CKD-aP) does not originate from skin lesions, but rather is a systemic, persistent itch sensation that often leads to considerable mechanical skin damage due to a continuous and uncontrollable urge to scratch [5, 6]. More than 60% of patients undergoing hemodialysis have some degree of pruritus, with 20–40% suffering from moderate-to-severe pruritus [1, 7,8,9]. Patients with CKD-aP suffer severely impaired health-related quality of life (HRQoL), including sleep disturbance, chronic fatigue, agitation, shame, social isolation, and depression [1, 3, 7, 8, 10, 11]. Severe itching is also associated with an increased risk of mortality [7]. Despite its high prevalence and distressing sequelae, CKD-aP remains poorly characterized and has no approved treatment [8]. The pruritus tends not to be adequately controlled by topical emollients, antihistamines, or steroids or off-label used treatment, like gabapentin, which are not always well tolerated [2, 8].

Since pruritus is a symptom that only patients themselves can report on, a patient-reported outcome (PRO) measure is required to evaluate the efficacy of any new investigational treatment. Numerical Rating Scales (NRS) measuring worst itch intensity are commonly used in clinical trials, but few have had their psychometric properties evaluated in line with best practices and FDA evidentiary standards [12]. Furthermore, the magnitude of the reduction in NRS scores that represents meaningful improvement for patients with CKD-aP has not been extensively studied or established.

The Worst Itching Intensity NRS (WI-NRS) is a simple-to-use, single-item PRO [13, 14]. Patients indicate the intensity of the worst itching they have experienced over the past 24 h by marking one of 11 numbers—from 0 to 10—that best describe the worst itching experiences (“0” labeled with the anchor phrase “no itching” and “10” labelled “worst itching imaginable”). This WI-NRS has been validated for dermatologic conditions like psoriasis [15, 16] and atopic dermatitis [14] but not for systemic pruritus like CKD-aP. We previously identified that a reduction of ≥ 3 points on the WI-NRS represented a clinically meaningful response to treatment with the selective kappa opioid receptor agonist difelikefalin in hemodialysis patients with moderate-to-severe pruritus [13]. However, gaps remain in our understanding of the measure’s content validity from patients’ perspectives as well as its other psychometric properties, including test–retest reliability and whether it mirrors other methods of measuring changes in itch (i.e., known-groups validity).

The FDA’s Patient-Focused Drug Development Guidance suggests the use of mixed methods (quantitative and qualitative) to triangulate on defining meaningful within-patient change thresholds for clinical outcome assessments (COA) [17]. While there is guidance on quantitative approaches to determine meaningful within-patient change thresholds (anchor-based methods are preferred) [18, 19], there is no consensus on optimal methods for qualitative or other mixed-methods approaches. An emerging approach for evaluating meaningful within-patient change thresholds for COAs is to survey or interview patients as they exit a clinical trial to ascertain their experience of treatment, whether the change they experienced was meaningful, and to gather further interpretation of score changes on administered COA endpoints [20,21,22].

Thus, the goal of the present study was to evaluate the content validity and psychometric properties of the WI-NRS in hemodialysis patients with CKD-aP based on qualitative interviewing and quantitative methodologies, as well as to confirm our earlier estimated meaningful change threshold [13] using anchor-based analyses and mixed methods exit interviews.

Methods

Content validity methods

Content validity of the WI-NRS (see Additional file 1: Fig. S1) was evaluated through qualitative interviews with hemodialysis patients with CKD-aP of any severity. Interview participants were recruited from four dialysis centers in the US, had to be aged ≥ 18 years, on hemodialysis three times per week for ≥ 3 months before screening, self-reporting pruritus ≤ 1 month before screening, and could not have pruritus unrelated to CKD, pruritus only during dialysis sessions, or a co-morbidity that might compromise the patient, study, or study measures. The content validity interviews included concept elicitation questions to ensure participants’ descriptions of their CKD-aP were consistent with the WI-NRS content and wording, and standardized cognitive interviewing to ensure that the wording, response options, and recall period were appropriate for capturing patients’ experiences. Interviews were conducted in English following a semi-structured interview guide, took approximately 60 min, and were digitally audio-recorded with the consent of the participants. Transcripts were analyzed using ATLAS.ti (version 7.5.12 or higher). After the first five interviews, a high-level qualitative analysis determined that no modifications to the WI-NRS was required.

Psychometric analyses

Psychometric properties of the WI-NRS were assessed using data collected from one phase 2 [23], and two phase 3 (US-based KALM-1 and global KALM-2) [24, 25] randomized placebo-controlled multicenter studies investigating the safety and efficacy of intravenous difelikefalin in patients with moderate-to-severe pruritus undergoing hemodialysis. The phase 2 dataset was used to assess psychometric validity. Pooled phase 3 trial data were used for confirmatory analyses and in an anchor-based analysis to verify the meaningful change threshold previously established with phase 2 data [13]. Eligibility criteria for patients in the phase 2 (N = 174) and phase 3 (N = 848) trials were similar to the content validity interviews, although patients were additionally required to self-report baseline pruritus severity of ≥ 4 on the WI-NRS (calculated as the average of the daily WI-NRS scores collected over a 7-day run-in period) [23,24,25]. WI-NRS data were analyzed as weekly mean scores, defined as the average of the daily ratings for each week from baseline to the last week of the treatment period. For a weekly score to be calculated, data had to be available for ≥ 4 of 7 days, otherwise the weekly score was set to missing. Table 1 details other PRO measures from the phase 2 and phase 3 studies used in the psychometric analyses. Psychometric assessments were evaluated in line with the US Food and Drug Administration guidance on PROs [12]. Statistical analyses were conducted using SAS version 9.4 and used a 2-sided significance level of P < 0.05.

Table 1 Patient-reported outcome measures

Test–retest reliability

For the phase 2 cohort, test–retest reliability was assessed by determining intraclass correlation coefficients (ICCs) between Weeks 1 and 2 and between Weeks 2 and 4, based on the ICC(2,1) method [29]. Patients with the same Patient Global Impression of Worst Itch Severity (PGI-S) response between the test and retest time points were defined as stable and included in the analysis. For the phase 3 cohort, test–retest reliability was assessed using the same time points with all evaluable patients included. As generally accepted [30, 31], test–retest reliability was supported with ICCs > 0.70.

Construct validity

The construct validity of the WI-NRS was assessed by examining convergent and divergent validity. Moderate (r ≥ 0.3 to < 0.5) or large (r ≥ 0.5) convergent correlations by Cohen’s standards [32] were hypothesized for the PGI-S (phase 2 only) and for items within the Skindex-10 and the 5-D Itch that measure similar concepts to the WI-NRS. The MOS Sleep Scale domain scores were used for divergent validity tests on the phase 2 data (i.e., to assess the extent to which sleep and itch, which are less related concepts, exhibit low correlations [r < 0.3] with one another).

Known-groups validity

To assess the discriminant properties of the WI-NRS, known groups validity was evaluated by creating groups using the PROs collected from the phase 2 study (PGI-S, Patient Self-categorization of Pruritus Disease Severity, Skindex-10, 5-D Itch, MOS Sleep Problem Index II) and the pooled phase 3 studies (Skindex-10, 5-D Itch). The mean of the screening (i.e., baseline) WI-NRS was computed for each category of each PRO measure. As the data were normally distributed (by Kolmogorov–Smirnov test), a linear model analysis of variance (ANOVA) was conducted with the baseline weekly mean WI-NRS as the dependent variable and the categorical known group as the independent variable (separate models for each individual known group) to evaluate differences in weekly mean WI-NRS scores. Two-sample t-tests were used to compare differences in WI-NRS for known groups with two categories; linear model ANOVA were used for known groups with more than two categories.

Meaningful change threshold study and analysis

The anchor-based methods and meaningful change threshold for the phase 2 cohort have been previously published [13]. The same anchor-based approach was used to define the point-change on the WI-NRS (change from baseline to end of treatment) that represented a clinically meaningful improvement to patients in the pooled phase 3 cohort. The Patient Global Impression of Change (PGI-C) was used as the anchor; this FDA-recommended [33] measure specifically asks patients to indicate the improvement of their condition taking into consideration treatment effect and patient expectation. The “minimally improved,” PGI-C anchor category was used in the primary anchor approach. The “minimally improved” and “much improved” categories were combined for use as a secondary anchor.

Exit study to further evaluate threshold of meaningful change

To determine what constituted a meaningful change from patients’ perspectives, mixed-method exit interviews were conducted with patients completing the phase 3 trials using methodologies adapted from Koochaki et al. [21] and McCarrier et al. [20]. For the exit interviews, eligible patients had to complete the final visit of the 12-week double-blind treatment period of either phase 3 trial. Enrollment to the exit interviews was stratified to ensure different point change ranges on the WI-NRS were represented: 10–12 patients reporting a one-point improvement and 15–20 reporting a two-, three-, and four-point improvement on the WI-NRS from baseline to Week 8–10. Exit interviews involved one-on-one, telephone-based interviews in either English or Spanish. Interviews lasted 60–90 min, and were conducted using a semi-structured interview guide. Participants were asked to complete the modified Patient Global Impression of Change (M-PGIC) measure (see Table 1) to evaluate whether the change in itch they experienced during the trial was meaningful to them, with a qualitative discussion of why they considered the change meaningful. Patients were then asked to review the WI-NRS and their WI-NRS change score recorded in the clinical trial (end-of-study weekly mean – baseline weekly mean), with discussion of whether that change was or was not meaningful. Distribution of WI-NRS change scores and % changes were analyzed by M-PGIC category and by participant responses on meaningful change.

Results

Content validity

Twenty-three interviews assessing content validity were conducted between June and August 2016 across four US sites: New York (n = 4, 17.4%), Florida (n = 5, 21.7%), California (n = 8, 34.8%), and Tennessee (n = 6, 26.1%). Participants had a mean age of 55.4 ± 17.0 years and most were White (n = 10, 43.5%), male (n = 14, 60.9%), and not Hispanic (n = 15, 65.2%) (Table 2). During concept elicitation, "itch" or "itching" were the terms most commonly used to describe CKD-aP. When asked about itch intensity and severity, many participants (n = 12, 52.2%) spontaneously provided a numerical response on a 0–10 severity scale. Some (n = 6, 26.1%) rated their itching as at least a “6” or “7” on a 1–10 or 0–10 scale. One participant (4.3%) rated their itching severity as “8–10” at night, but “5” during the day. Concept elicitation results were consistent with WI-NRS item wording and supportive of the response scale. Overall, the cognitive interviewing results showed that participants provided positive feedback on the WI-NRS and reported that the questionnaire was straightforward, comprehensive, and relevant to their experiences with CKD-aP. In addition, the instructions, wording, and response options were well understood by participants. They were able to easily select a response option and describe how they arrived at their answers. Based on a detailed review of the data, no changes to the WI-NRS were recommended.

Table 2 Patient characteristics

Psychometric validation

Demographics of the phase 2 and pooled phase 3 cohorts are given in Table 2.

Test–retest reliability

Patients from the phase 2 trial that were stable on the PGI-S had good reproducibility on their weekly mean WI-NRS scores between Week 1 and Week 2 (ICC = 0.76) and between Week 2 and Week 4 (ICC = 0.81) (Additional file 1: Table S1). WI-NRS scores for patients from the pooled phase 3 trials were also reproducible, with ICC = 0.80 between Week 1 and Week 2 and ICC = 0.81 between Week 3 and Week 4. The values were above the generally accepted 0.7 threshold [30] supporting the test–retest reliability of the WI-NRS.

Construct validity

WI-NRS scores significantly correlated with the Skindex-10 and 5-D Itch measures in both phase 2 and phase 3 datasets, especially with the conceptually related Skindex-10 Disease domain (r = 0.7–0.8) and the 5-D Itch Degree domain (r = 0.65–0.67) at the end of treatment (Table 3). Similarly, the weekly mean WI-NRS from the phase 2 trial patients was significantly correlated with the conceptually related PGI-S scale at the end of treatment (r = 0.63). Overall correlations were better at the end of treatment than at baseline, most likely due to higher score variance at this timepoint (to be randomized, subjects had to report WI-NRS ≥ 4 at screening). For the phase 2 trial patients, as hypothesized, correlations with the conceptually unrelated domains of the MOS Sleep measure (Sleep Problem Index I and II, and Sleep Disturbance) were small (r = 0.16–0.26) by Cohen’s standards [32].

Table 3 Construct validity

Known-groups validity

For both the phase 2 and phase 3 cohorts, the baseline WI-NRS scores were significantly different (P ≤ 0.032) between known groups of the conceptually related 5-D Itch total score and Skindex-10 measures (Table 4). Known-groups comparisons of WI-NRS against Patient Self-Categorization of Pruritus Disease Severity (‘Profile B’ versus ‘Profile C’) and PGI-S were also statistically significant and in the anticipated direction in the phase 2 cohort. Overall, higher (worse) mean baseline WI-NRS scores were observed for groups with worse categories defined by these independent variables. Differences in WI-NRS scores at baseline were not significantly different when grouped by the quartiles of the conceptually unrelated MOS Problem Index II (P = 0.1049; phase 2 cohort only).

Table 4 Known-groups validity of WI-NRS vs. other measures at baseline

Threshold of meaningful change

For the pooled phase 3 cohorts, the mean change in WI-NRS associated with a change from baseline to ‘minimally improved’ on the PGI-C was − 1.85 points (26% change; Table 5). Based on the secondary anchor-based approach (representing larger changes), the mean change in WI-NRS associated with a change to a much improved response on the PGI-C was − 3.54 points (51% change). The mean WI-NRS change associated with a change to minimally or much improved on the PGI-C was − 2.72 points (39% change). Mean WI-NRS change values for each PGI-C category are given in Additional file 1: Table S2.

Table 5 Meaningful change thresholds for WI-NRS (phase 3 cohort)

Exit interviews

Participant characteristics

Exit interviews were conducted with 70 patients in the US completing the phase 3 trials. Stratification targets of 10–20 patients by range of point reduction on the WI-NRS were met for all subgroups, except for the ≥ 3 to < 4-point reduction subgroup (n = 9). Forty-seven interviews were conducted in English and 23 in Spanish. Participants were mostly White (n = 42, 60.0%) and male (n = 46, 65.7%), and had a mean age of 55.7 ± 12.1 years (Table 2). Eight (11%) completed the interview after the specified interview window of 1–3 days after the first visit of Week 13 in the trial. One participant only answered questions related to her general itch experience, ended the study before the quantitative questionnaires were completed or debriefed, and could not be reached in follow-up attempts.

Baseline WI-NRS scores recorded in the trial ranged from 4 to 10 (Additional file 1: Table S3). Most participants had experienced baseline to Week 12 WI-NRS improvement scores ≥ 4 points (n = 26, 37.1%), followed by those who had improvement scores of ≥ 2 to < 3 (n = 18, 25.7%), ≥ 1 to < 2 (n = 10, 14.3%), ≥ 3 to < 4 (n = 9, 12.9%), ≥ 0 to < 1 (n = 5, 7.1%), and < 0 (n = 2, 2.9%).

Evaluation and discussion of meaningful change

For the M-PGIC completed during the interview, most participants reported reduced itch and that the amount of improvement was meaningful to them (n = 37/70, 52.9%). All participants with WI-NRS changes < 1 point reported on the M-PGIC that the change experienced in itch was either not meaningful to them, or that there was no change or worsening (n = 7; Fig. 1a). Half of respondents with a WI-NRS change of ≥ 2 to < 3 points (8/16, 50.0%) and most with a change ≥ 3 points (25/35, 71%) indicated the improvement was meaningful on the M-PGIC.

Fig. 1
figure 1

Evaluation of meaningful within-patient change on the WI-NRS in exit interviews. a Exit interview M-PGIC responses by WI-NRS change score. b WI-NRS scores by participant response on whether change was clinically meaningful. Participants who reported worsening itch over the trial were not asked if change was or was not meaningful. Abbreviations: M-PGIC, modified Patient Global Impression of Change; WI-NRS, Worst Itching Intensity Numerical Rating Scale

When given the opportunity to review their WI-NRS change score over the course of the trial, most participants who responded indicated that their change on the WI-NRS was meaningful (n = 54/59, 92%; Fig. 1b). This included 67% of respondents (n = 6/9) with ≥ 1 to < 2-point WI-NRS changes, 93% (n = 14/15) with ≥ 2 to < 3-point changes, and all respondents (n = 32/32) with WI-NRS changes ≥ 3 points. While reviewing the WI-NRS results, 18 participants who had not reported meaningful change on the M-PGIC changed their responses and said that the change on the WI-NRS was meaningful. Thus, the distribution of participants reporting meaningful improvement differed between the M-PGIC responses and WI-NRS point-change consideration.

Participants described similar reasons for selecting the M-PGIC category of meaningful improvement – most typically reductions in frequency (e.g., “in the first week, I started to notice that the itching was less frequent”), intensity (e.g., “I mean I still itch every day, but it’s not as bad”), and duration of itch, leading to HRQoL improvements such as improved mood, increased focus, and improved sleep (e.g., “I can lay in the bed and I can go to sleep and the itching now does not wake me up in my sleep”). Those who experienced improvement but considered it not meaningful described reduced frequency, severity, or duration of itch but described that the improvements were intermittent, for example, only on dialysis days.

Participants who reported their WI-NRS change score was meaningful indicated noticing their itch improving (n = 39/55, 71%). For example, participants noted reduced itch frequency (n = 25/55, 45%), general itch reduction (n = 12/55, 22%), and decreased severity (n = 7/55, 13%). Some participants also described not feeling as embarrassed or self-conscious in public (n = 7/55, 13%), physical improvements on their skin as it healed (n = 6/55, 11%), and improved quality of life or state of mind (n = 6/55, 11%). Of the five participants who reported their WI-NRS change score was not meaningful, two specified that they were still experiencing itch, two said the change was not great enough for them to consider it meaningful, and one described no change in itch at all.

Discussion

While several PROs have been developed to assess itch, few have been validated for use in clinical trials of patients with CKD-aP [8, 34], and none have had the threshold of meaningful improvement determined in these patients. Here, using a mixed methods approach, we showed the WI-NRS to be a reliable and valid PRO measure for CKD-aP. Moreover, the findings were confirmed across several large patient cohorts that together represent an international population. The content validity interviews indicated patients found the WI-NRS relevant, and that the item wording, response options, and recall period were appropriate for capturing the experiences of patients with CKD-aP. Test–retest reliability over two weeks for the WI-NRS was strong (ICCs > 0.75) [30] in both clinical trial cohorts, and is comparable to that for other PROs used to assess itch intensity in patients with chronic itch [35, 36]. Although no anchor was available to define stable itch in the phase 3 cohort test–retest analyses, ICCs > 0.80 at the discrete test–retest time points indicated enough stability in the sample (which included placebo patients) and good test–retest reliability. The construct validity analysis indicated the measure correlated well with the Skindex-10 and 5-D Itch measures, especially with conceptually related domains within those measures. The anchor-based analyses of the phase 3 cohort support that an improvement from baseline of ≥ 3 points represents an appropriate definition of meaningful within-patient change on the WI-NRS. This validates our previous findings for the phase 2 cohort, where equally a ≥ 3-point meaningful within-patient change threshold in WI-NRS was identified in quantitative distribution- and anchor-based methods [13].

A key strength to our study was the inclusion of exit interviews to confirm patients’ perspectives of what constituted a meaningful within-patient change on the WI-NRS [22]. These exit interviews used novel qualitative methodology, leveraging the weekly mean WI-NRS data from baseline and Week 12 of the clinical trials and exploring change categories by M-PGIC. Further, we used a second methodology, where we shared with participants their actual WI-NRS score changes and asked them to discuss whether or not this point change represented a meaningful change. This allowed participants to reflect and comment on their actual lived experience, as opposed to being asked to provide feedback on a hypothetical scenario [20]. In the exit interviews, when reviewing actual WI-NRS change scores experienced, all patients with a change ≥ 3 points considered the change meaningful, mentioning reduced intensity, frequency, and duration of itch and improvements in HRQoL. However, meaningful changes were also reported by two-thirds of participants with score changes in the range 1–1.99-points, suggesting changes on the WI-NRS do not have to be large in this population. This indicates both that there are individual differences in the magnitude of change considered meaningful by patients and that many patients with CKD-aP will experience meaningful improvements with changes below the ≥ 3-point change threshold.

In the exit interviews, the distribution of participants reporting meaningful improvement in their itch intensity differed between the M-PGIC responses and WI-NRS point-change consideration. This could be due to differences in the tasks asked of patients: patients could have interpreted the M-PGIC method and question to refer to their global experience related to itch in the clinical trial, whereas reviewing the WI-NRS change score may have been viewed as more specific to improvements in itch intensity. Also, some differences might be expected in patients’ responses between a 4-option categorical scale and an 11-point NRS. The order of administration of the two methods may also have influenced the results.

Although enrollment was stratified by WI-NRS point change to best represent the wider trial population completing the 12-week treatment period, patients in the exit interviews may not fully represent the real-world population since the trials included only patients with moderate-to-severe CDK-aP, whereas many patients have milder itch [1, 7,8,9].

Conclusions

In conclusion, the results from this study add to evidence supporting the reliability, validity, and responsiveness of the WI-NRS for measuring itch intensity in patients with CKD-aP undergoing hemodialysis. The WI-NRS may therefore be used to assess the efficacy of anti-pruritic treatments, and potentially in clinical evaluation and management of pruritus in this population. These results are strengthened through two separate analyses: one conducted in a phase 2 trial cohort and a confirmatory analysis in a larger pooled cohort of phase 3 trial patients. The proposed, conservative ≥ 3-point reduction on the WI-NRS represents a meaningful within-patient change threshold that can be used to interpret results from clinical trials involving patients undergoing hemodialysis with moderate-to severe pruritus, for example to identify responders and non-responders to treatment.