FormalPara Key summary points

Weak measurement, from any cause, risks type-II errors (i.e. false negatives)

Content validity limitations are an important source of PRO measurement weakness

We examined how MS fatigue PROs adhere to content validity standards

Quality has improved over time, but the fatigue PROs examined had poor content validity. One likely outcome is type-II errors in clinical trials

COSMIN content validity criteria need to be more specific and stringent

Digital Features

This article is published with digital features, including a video abstract to facilitate understanding of the article. To view digital features for this article go to https://doi.org/10.6084/m9.figshare.22812440.

Introduction

Patient-reported outcomes (PROs) play an increasingly prominent role in clinical trials, drug approval and reimbursement. Any PRO measurement shortcomings risk type-II errors, threatening treatment development and licensing. The negative implications are pervasive and damaging, not just for trial outcomes and available treatment options for health care providers, but, more importantly, patient well-being is compromised [1].

Over the last decade, scientific and regulatory criteria for PRO development have evolved. There has been a shift from primarily emphasising psychometric properties to incorporating aspects of content validity; i.e. the extent to which PROs adequately reflect a defined measurement construct [2]. The importance of content validity cannot be underestimated, but, we believe, remains underappreciated and misunderstood.

At face value, the notion of content validity is beguilingly simple: it is the extent to which a set of items fairly represent the construct they purport to measure. However, more careful consideration clarifies content validity’s fundamental importance and complexity, and helps to explain why it is cited as the most important PRO measurement property [3], and a pre-requisite for any statistical (“psychometric”) examinations [4].

In PRO measurement, the responses to a set of questions (items) are combined to derive a score. This score is intended to quantify a health concept or variable; for example, fatigue. Therefore, the items link the concept to the score. If the score derived from an item set is to be a valid indicator of the concept, the concept must be clearly defined and broken down to its relevant components and subcomponents. This process is known as construct definition, conceptualisation, and conceptual framework development. When this process is not explicit, validity is compromised to an extent that cannot be determined or quantified; in other words, unquantifiable type-II errors are liable to occur.

There is another critical step on the route to achieving content validity that comes after conceptual framework development. It is the articulation of the subcomponents, proposed to be scored, as sets of items. For example, motor, cognitive and psychosocial impacts of fatigue. Again, the subcomponents should be defined and broken down into their parts, so that the link between the items and the subcomponent are explicit. Also, item wording requires careful attention so that it aligns with the subcomponent, the other items in the set, and articulates both the concept and the measurement aspect (e.g. frequency, intensity, severity). Wording should be as unambiguous as possible. When all the steps are carefully attended to, the link from the concept, via its components and subcomponents, through the items to the score, is clear and content validity can be considered achieved.

This description indicates that beguiling simple content validity comprises multiple aspects, which, when incomplete, compromises PRO validity, causing uncertain measurement, with type-II errors as the ultimate result. There are two additional complexities. First, there is no external method of “proving” content validity. Second, achieving content validity does not guarantee that the scores generated by an item set satisfy requirements for measurement. This is the related but independent domain of psychometric (statistical) measurement performance testing, which is only meaningful when content validity is established.

While content validity cannot be proven, guidance aids its achievement and evaluation. The US Food and Drug Administartion (FDA) advise that PRO content validity should be underpinned by well-defined concepts of interests, contexts of use, and conceptualisation, with an item development process involving members of the target population for which the instrument is being designed [4]. The FDA principles are integrated into updated Consensus Standards for Measurement Instruments (COSMIN) PRO development guidance [3, 5]. When these updated guidelines have been applied to PRO content development in diverse clinical contexts [5,6,7,8,9,10], including upper limb function [5], consistently low-quality content validity evidence has resulted. General standards of PRO content validity have been described as “questionable” [8] and “worrisome” [6, 7].

We aimed to establish the extent to which fatigue PROs used in multiple sclerosis (MS) clinical trials satisfy COSMIN’s content validity recommendations. We chose fatigue as this is one of the most common and burdensome symptoms for people living with MS (PLwMS) [11, 12], and there has been limited progress over time in our understanding of it and of its management. This could reflect a measurement problem.

Methods

Overview

We searched for existing fatigue PRO instruments used in MS studies and assessed these against the COSMIN content validity criteria [3]; specifically, elements related to the conceptual basis of the instruments, namely, the PRO construct, conceptual framework, target population, context of use, development sample, qualitative work, and use of literature reviews in the instrument development process. While our final item (concerning appropriate literature searches) is not explicitly defined within the COSMIN criteria, FDA guidance [4] states that content development can involve both qualitative work and literature reviews. We did not evaluate psychometric criteria, as adequate content validity is a pre-requisite for meaningful psychometric comparisons and interpretations, and poor content development is not negated by a strong psychometric profile.

Literature review

Embase® and Medline® were searched for English language publications up to 20 October 2021. Our search terms were: “multiple sclerosis” AND “fatigue” AND (“instrument” OR “patient reported outcome” OR “patient-reported outcome” OR “questionnaire”). Abstracts were screened to identify fatigue PRO instruments, and relevant PRO development papers retrieved.

Instruments were retained for further assessment if they were (1) MS-specific fatigue PROs or (2) non-disease-specific (i.e. generic) fatigue PROs. We excluded fatigue items embedded within broader instruments and fatigue PROs specific to other diseases. Most PROs had a single associated development paper. When relevant, linked papers were retrieved in line with COSMIN’s recommendations for using ‘indirect evidence’ and ‘other additional information’ when assessing PRO content validity [3].

Quality analysis

Table 1 details the COSMIN standards [3, 13] against which our extracted PROs were assessed. Consistent with previous publications [5,6,7], we focus solely on checklist items associated with PRO content (and which map directly to FDA guidance [4, 5]). Extracted information on each of the seven content development domains from the PRO development papers was rated using a 4-point scale; “good” (3), “adequate” (2), “doubtful” (a), “poor/none” (0). Instruments were independently rated by two reviewers, and discrepancies were discussed and resolved. We also catalogued each PRO’s number of items, item wording, response format, recall period, and administration format.

Table 1 COSMIN checklist items

Changes in quality of instrument development over time

To determine if PRO development quality has changed over time, we rank-ordered the PROs by development year and plotted total COSMIN scores over time. We compared average COSMIN scores of PROs developed before and after the FDA guidance was published. We examined scatterplots and computed the correlation between total COSMIN score and annual PubMed frequency of PRO development papers (a proxy of instrument use). We searched clinicaltrials.gov to determine the number of clinical trials using each PRO, and examined the correlation between total COSMIN score and the annual number of studies using PROs. Statistical analyses were conducted in JASP v.0.16.2 (https://jasp-stats.org/).

Ethics

This article is based on previously conducted studies and does not contain any new studies with human participants or animals. Ethics committee approval was therefore unnecessary.

Results

Literature review

Searches identified 3814 abstracts containing 87 unique PROs. Figure 1 and Supplementary Table S2 show why 69 PROs were excluded.

Fig. 1
figure 1

PRISMA flow diagram illustrating literature searches, inclusion/exclusion criteria, and identification of 18 fatigue PRO instruments

PRO descriptions

We retained 18 PROs for quality analysis, comprising 10 generic fatigue measures and 8 MS-specific measures. Table 2 presents descriptive information, including the original development references, linked papers, and instrument characteristics.

Table 2 Properties of fatigue PRO instruments

Quality analysis

Table 3 provides the content development quality assessment for each instrument. Table 4 summarises the overall findings across all 18 PROs. Table S1 provides text extracts from the development papers that informed our scoring. Below, we summarise the findings for each assessment criterion.

Table 3 Quality analysis of content development for 18 fatigue PRO instruments
Fig. 2
figure 2

Association between total COSMIN score and a citations/year and b studies using instrument

Table 4 Summary of content development quality analysis for fatigue PRO instruments

Fatigue construct description

COSMIN guidance states that the construct description ‘should be clear enough to judge whether the items of a PROM are relevant for the construct and whether the construct is comprehensively covered by the items’ [3]. Only one instrument [Neurological Fatigue Index—Multiple Sclerosis (NFI-MS)] was rated ‘good’, where the definition was elaborated in a separate publication [14, 15]. Table S3 provides all fatigue definitions provided by the PRO authors, enabling their comparison. Half [9/18 (50%)] were rated ‘doubtful’. Seven (39%) were rated ‘poor/none’. For example, the Fatigue Severity Scale (FSS) authors say the instrument was designed to ‘measure fatigue severity’ without first providing a definition of fatigue, simply stating that fatigue ‘has been notoriously difficult to define’ [16].

Conceptual/theoretical framework

COSMIN guidelines require the origin of the target construct to be ‘based on a theory, conceptual framework, or disease model’. Here, only the NFI-MS and the Fatigue Symptoms and Impact Questionnaire—Relapsing Multiple Sclerosis (FSIQ-RMS) received a ‘good’ rating. The developers produced conceptualisations of fatigue in MS using semi-structured qualitative interviews with PLwMS, develo** a framework of themes/subtheme, for which they evidence widespread endorsement [14]. These themes were used as a substrate for PRO development [15, 17].

Five instruments (28%) provided some conceptual basis (of sorts) for item selection and were rated ‘adequate’ rather than ‘good’, often because the link between the conceptual underpinning and item selection was not as comprehensive as COSMIN criteria require. For example, in the development paper for the Fatigue Impact Scale (FIS), the authors state that the ‘measure was designed as a specific health status measure according to the taxonomy of Guyatt et al.’, that they ‘adopted the viewpoint expressed by the Canadian MS Research Group that ‘measuring the effect of fatigue on activities … is more sensitive than simply asking patients to rate fatigue’, and that ‘Items for the FIS were selected on the basis of existing fatigue questionnaires’ [18]. Despite the developers of the FIS clearly giving thought to the conceptual underpinnings of their instrument, it is unclear exactly how the conceptual basis drove item selection for this instrument.

Six instruments (33%) were rated as ‘doubtful’ and five (28%) as ‘poor/none’, where the development publications either made cursory reference to a conceptual framework, with limited information on how the framework drove item selection, or had no apparent conceptual underpinning whatsoever.

Target population

COSMIN recommends instrument developers provide a clear description of the target population for which the PRO was developed, including details about disease types, characteristics, and demographics. Additionally, if the instrument was developed for use across multiple populations, each should be clearly described. Most development papers provided target population descriptions that were rated as ‘good’ [4/18 (22%)] or ‘adequate’ [8/18 (44%)]. While the described target population must be clearly specified, it can still receive a good rating even if it covers a potentially broad population. For example, the development paper for the Iowa Fatigue Scale (IFS) (rated as ‘good’) simply states that the ‘development and testing of the IFS was performed on general patients in primary care and is designed to be used for that group of patients’ [19]. Three papers contained target population descriptions rated as ‘doubtful’ quality. The Chalder Fatigue Questionnaire/Scale (CFQ) development publication, for example, describes the instrument as ‘a short scale which can be used in both hospital and community populations’, with no details given about the types of hospital patients or whether the intended community population is healthy or not. Development publications for three PROs [FSS, Multiple Sclerosis-Specific Fatigue Severity Scale (MFSS), Short Fatigue Questionnaire (SFQ)] did not specify a target population.

Context of use

PRO developers should be clear for which application the instrument was developed [3], such as for discriminative, evaluative, or predictive applications. Context of use may also refer to a specific setting for which the instrument was developed (e.g. for use in a hospital or at home) or a specific administration mode (e.g. paper or computer-administered). Across all COSMIN criteria, context of use had the highest number of instruments ranked as ‘good’ [11/18 (61%)]. The Visual Analogue Scale for Fatigue (VAS-F) is an example. The developers suggest ‘potential uses including assessments of fatigue before and after clinical interventions as an indication of the effectiveness of therapy’ [20]. Two instruments received ‘adequate’ ratings, and five were rated ‘doubtful’. For example, the developers of the Unidimensional Fatigue Impact Scale (U-FIS, rated ‘doubtful’) suggest that their instrument may be ‘valuable for future studies interested in MS-related fatigue’. While this gives a general indication of the potential context of use, it is not clear what type of studies are being referred to nor in which specific population of PLwMS.

Representative development sample

Health concepts can be context-dependent, in degree or nature. Therefore, both COSMIN and FDA recommend that PRO items are generated from qualitative work using samples representative of those in which the instrument will be used [3, 4], and should include a diversity of patients with different characteristics to cover the breadth of the concept of interest. Of the instruments assessed, only two (11%) rated ‘good’ (FSIQ-RMS and U-FIS). The development papers for both PROs provided clear descriptions of their qualitative research samples, which represented the intended target population for the finalised instrument. Two instruments were rated as ‘adequate’ (NFI-MS and PROMIS Fatigue MS), and four as ‘doubtful’. The remaining ten instruments (56%) lacked a representative development sample, generally because of an absence of any documented qualitative research.

Qualitative work to generate items

COSMIN and the FDA require clarity about the process of item generation, including details about all qualitative work undertaken [3, 4]. Four instruments (22%) rated as ‘good’ (U-FIS, NFI-MS, PROMIS Fatigue MS, and FSIQ-RMS). They provided clear descriptions of their research process and how findings guided item selection. However, over half of the instruments (56%; 10/18) lacked any qualitative work in their development. Three instruments (17%) were rated as ‘doubtful’, as their development papers simply stated that interviews were conducted. They failed to describe how interview findings informed item generation or selection.

Literature reviews

The FDA recommends conducting literature reviews as part of the iterative PRO development process, to help identify measurement domains and items [4]. Only two instruments (FSIQ-RMS and PROMIS Fatigue MS) provided some description for how literature searches guided aspects of development. Literature reviews were not conducted during the development process for most PROs (16/18 [89%]).

Summary of COSMIN scoring and change in quality over time

Table 4 shows that ‘target population’ and ‘context of use’ had the highest number of ‘good’ and ‘adequate’ ratings. Conversely, ‘construct definition’, ‘use of a guiding conceptual framework’, ‘qualitative work with an associated well-defined sample’, and ‘literature reviews to inform item selection’ had the highest number of ‘doubtful’ and ‘poor/none’ ratings.

Overall, the reasons for poorer ratings may be the lack of guidance [3, 4]. Total COSMIN scores remained relatively stable until the mid-2000s (n = 12, median score = 7, of max 21), and have increased significantly since (n = 6, median score = 12.5, U = 10.0, p = 0.02), with the exception of the SFQ. This coincides with the FDA PRO guidance publications of 2006 (draft, [21]) and 2009 (final, [4]). The five highest scoring instruments (Fatigue Scale for Motor and Cognitive Functions [FSMC], U-FIS, NFI-MS, PROMIS Fatigue MS, and FSIQ-RMS) were all published since 2009.

Despite improvements in development quality, no instrument achieved a ‘good’ rating across all criteria. Furthermore, there was a negative, moderate-magnitude, correlation between total COSMIN score and the annual frequency of development paper PubMed citation (Rho = − 0.62), implying PRO selection is inversely related to development quality. Figure 3a (the associated scatterplot) shows that the moderate negative magnitude correlation is driven by high use of the FSS/MFSS (when this is excluded, rho = − 0.42). We think it is more correct to say there is no relationship between COSMIN score and use. It is notable that the five mostly highly scoring fatigue PROs have not had much use. Finally, there was a low association (Rho = 0.18) between total COSMIN score and the annual frequency of PRO use in clinical trials, according to clinicaltrials.gov. Figure 3b (the associated scatterplot) shows three PROs (FSS, FIS, and MFIS) have been heavily deployed, while the remainder have been very rarely used. Again, fatigue PRO use in clinical trials has not been driven by development quality, as measured using COSMIN criteria and scoring.

Fig. 3
figure 3

Fatigue PRO instrument development quality over time

Discussion

When PROs are selected for studies, we assume that their measured effects will adequately approximate the actual, but unmeasurable, effects. Three main PRO requirements are needed to satisfy this assumption. First, PROs must be valid indicators of the constructs they intend to measure. Second, PROs must satisfy statistical measurement criteria. Third, PROs must be to able adequately detect change when it occurs. Our study concerns validity, the most fundamental of the three requirements and a prerequisite for the others to be interpreted meaningfully. When PROs lack validity, type-II errors, with potentially pervasive implications, result. We believe this justifies a very critical approach to evaluating PROs.

Historically, content validity has always been identified as a measurement requirement, but its fundamental importance was not emphasised [22, 23]. This may have been because health measurement borrowed its methods from educational testing and measurement, where the educational curriculum set the framework for testing, thereby enabling content validity to be determined relatively easily. Then, the mainstay of validity testing was psychometric (statistical) examinations of PRO scores (convergent and discriminant construct validity, group differences and hypothesis testing). These tests were exposed as providing only circumstantial validity evidence 40 years ago [24, 25]. That message, however, seemed to go largely unnoticed until FDA published their guidance, emphasising content validity’s central and fundamental importance in health measurement [4, 21, 26].

Our COSMIN-based appraisal of 18 fatigue PROs, selected from a pool of 87 instruments, showed considerable variability in quality. Five PROs achieved relatively high ratings across most categories (FSMC, U-FIS, NFI-MS, PROMIS Fatigue MS, and FSIQ-RMS), but none fully satisfied COSMIN criteria. Consistent weaknesses were for construct definitions, qualitative work, use of development samples representative of the target population, and use of literature reviews. Only the NFI-MS rated ‘Good’ for having a well-defined construct as well as the use of a conceptual framework. Our findings are concerning, as these are fundamental requirements for achieving valid PRO content development, and preclude definitive recommendations about which PROs might be most suitable for certain contexts.

PRO development chronology explains some of their variability in developmental quality. Median COSMIN scores increased significantly after FDA guidance was published [4, 21], implying an influential intervention. However, the scatterplots between total COSMIN score and annual citations, and between total COSMIN score and annual frequency in clinical trials, imply that PRO quality may not drive use in studies nor selection for clinical trials. This means that studies influencing the care of PLwMS and our research directions have almost certainly been misleading, with potentially damaging implications for the quality of life for PLwMS. This also underscores the value of formalising PRO selection strategies and information dissemination.

While COSMIN criteria provide a useful framework to assess PRO content validity, and represent an important step in PRO quality control, we believe that they require further development to facilitate robust objective assessment. For example, COSMIN criteria require construct definitions, theories or conceptual frameworks to provide clear origins for the concept of interest, and the use of appropriate qualitative data collection methods. However, improved clarity is needed for exactly what constitutes an adequate construct definition, a sufficiently high-quality conceptual framework, and an adequate-quality qualitative work.

Given the increasingly important and influential role PROs play in clinical trials and patient care, we also feel that there are several areas where COSMIN criteria could have increased clarity and set a more rigorous benchmark. In some circumstances, “Good” ratings can be easily achieved and content validity therefore overestimated. Examples are the IFS for Target population (‘development and testing of the IFS was performed on general patients in primary care, and is designed to be used for that group of patients’ [19]), and VAS-F for Context of use (‘potential uses including assessments of fatigue before and after clinical interventions as an indication of the effectiveness of therapy’ [20]). According to current COSMIN guidance, both situations can be given good ratings because reviewers can assume ’that the PROM is applicable in all patients ‘ in primary care or in whom an intervention is being studied (see guidance, p.18, [13]). From our perspective, such guidance risks PRO misuse. We recommend that under such circumstances users should be encouraged to examine the performance of a PRO empirically in a sample representative of their content of use.

Another important consideration is the relative importance, and order, of COSMIN criteria. Logically, clear construct definitions and conceptual frameworks are prerequisites before other criteria are assessed. We found that these criteria had the highest number of ‘doubtful’ and ‘poor/none’ ratings. This implies that the use of COSMIN total or aggregate scores can give misleading impressions of scale development quality.

We suggest there is one specific area where COSMIN requires significant development in its guidance: evaluating the articulation of the scored concepts and components (subscale) as item sets. This is a critical step, separate from concept definitions and conceptual framework development, as the items link the concept/component of interest and the score. We demonstrate the importance of this step by examining—as an additional, subsequent evaluation prompted by our concerns about COSMIN criteria’s completeness—the relevant item sets of the two highest COSMIN scoring fatigue PROs, NFI-MS and FSIQ-RMS.

Table 5 shows the item sets that generated the NFI-MS and FSIQ-RMS scores. The relevant question is, to what extent do the item sets represent the constructs measured by the subscales? A careful examination highlights our concerns. There are no definitions for the measured components, which makes it impossible to judge the item sets that generated the scores. Moreover, many of the items are non-specific and, as such, they are confounded. Often the relationships between item sets and underlying conceptual frameworks are ambiguous. For example, while the initial (57-item) set for the NFI-MS was explicitly derived from the thematic framework, the original items themselves are not provided. This also makes it difficult to judge the full item set, in terms of its representation of the purported measurement concept(s). This 57-item set was then reduced to the 4 subscales of the finalised instrument, and a 10-item summary subscale derived from two of the four subscales. This item reduction process was driven by statistical/psychometric criteria, rather than being predicated on any conceptual groundings. This makes it challenging to evaluate item concept coverage of the original conceptualisations.

Table 5 Scored item sets for (a) the FSIQ-RMS and (b) the NFI-MS

There are some surprising item grou**s. For example, the 4-item NFI-MS cognitive scale contains an item on coordination. Conceptually, however, we might expect this item to be in the physical subscale. The FSIQ-RMS’s 5-item physical subscale, and the 5-item co** subscale, both contain the same two items. This is conceptually questionable, as such ‘multidimensional’ items (according to the development publication [17]) cause measurement overlap, thereby reducing subscale validity. If the item sets of all the PROs are examined closely, other concerns are evident. In essence, we think these PROs have weaker content validity, and COSMIN scores alone are unable to establish content validity of finalised item sets.

We can identify four explanations for how suboptimal item sets arise. First, there is a general under-recognition of the important stage of articulating subscales as items. It is not emphasised in any guidance we know of. Second, there is an absence of subscale definitions to guide item generation, drafting and selection. Third, during PRO development, statistical methods are commonly used to group items into scales. This groups items on their statistical, rather than their conceptual, relationships. Fourth, we think there is a dissociation between the conceptual framework development and the qualitative work. It is common for patient statements from qualitative work to be used as items, often verbatim, without careful consideration of the relationship to the concept. All four explanations threaten content validity, risking type-II errors.

Our study had limitations. We identified PROs from abstract screening and may have missed relevant PROs. We did not assess the Quality-of-Life in Neurological Disorders (Neuro-QoL) fatigue scale. This was excluded during screening as a set of fatigue items within a broader instrument [27]. A brief examination of the Neuro-QoL fatigue PRO’s 19-item set highlights many of the content validity issues that we have raised. We did not assess the psychometric properties of the PROs. Our focus was their content validity, a prerequisite for psychometric evaluations [3, 4]. However, we recognise the need for head-to-head comparisons of these PROs to determine the impact of differing development quality on measurement performance.

Conclusions

PRO instruments must be valid to minimise type-II errors and to prevent misleading study results. Item content is the main determinant of PRO validity. Here, we demonstrate that fatigue PROs used in MS research have weak content validity. While the FDA recommendations have seen an apparent increase in quality, and have spawned other guidance documents, we think they are currently too vague and lenient for the measurement rigour required by today’s clinical trials. An area we feel is particularly underemphasised is the articulation of a concept by a set of items that generate scores. A comparison of measurement PRO performance is required.

We recognise our work is critical of the field. We also recognise that the work done by others on fatigue measurement, and on content validity methods and assessment, is important and necessary to underpin developments and progress. However, we believe our level of critique and debate is required to ensure measured effects approximate real effects.