Background

Bipolar disorder (BD) is a severe mental disorder with a chronic-recurring course. Since the first episode of a BD is often of a depressive kind [1], BD is often misdiagnosed as major depressive disorder (MDD) [2, 3] and treated as such. This may lead to adverse consequences, such as increased suicide risk [4, 5], greater probability of hospitalization [5], poorer response to antidepressants, and antidepressant-induced switch to mania [6]. Because of its course characterized by recurring episodes separated by periods of euthymia with no or scant symptoms of hypomania, BD may persist undiagnosed for a long time unless a frank episode of mania erupts [7, 8]. Minor hypomanic episodes are often overlooked, and, indeed, the differentiation of clinically elated and irritable mood or increased activity from “normal” variation in the population is often challenging. On average, the duration of undiagnosed, hence untreated, BD may last up to 10 years, and there is some evidence that up to one-third of patients with BD are misdiagnosed at least once during their lifetime [9, 10].

Early identification of BD is essential for appropriate treatment [11]. Several self-report tools have been developed to identify people with possible or probable BD [12]. Self-report screening tools are brief and cost-effective and can be preferred in the busy clinical setting to standardized interviews, which are more accurate but are time-consuming and require appropriate training for the administration and scoring. Nevertheless, caution should be applied in deriving epidemiologic estimates from case-finding based on screening tools [13]. Two of the most used and validated instruments for the early detection of BD are the 32-item Hypomania Checklist (HCL - 32) [11], and the Mood Disorder Questionnaire (MDQ) [14]. These two instruments have been validated in many countries [11, 15,16,17,18,19,41].

Before testing the criterion validity of the MDQ and the HCL-32, confirmatory factor analysis (CFA) was applied to the items of both questionnaires to make sure that a single global score was an appropriate summary measure of the screeners in the total sample. Preliminary analysis with the Mardia’s test [42] revealed a violation of multivariate normality in the data for both the MDQ and the HCL-32 (skew’s p < 0.0001 in both analyses). Therefore, the Diagonally Weighted Least Squares (DWLS) estimator was used in CFA. To assess goodness of fit estimation, we used the following parameters: the chi-square, the Comparative Fit Index (CFI), the Root Mean Square Error of Approximation (RMSEA), and the Standardized Root Mean Square Residual (SRMR). In the presence of a chi-square with p < 0.001, as expected with large samples (n > 300), RMSEA values of 0.08 or lower, SRMR values of 0.09 or lower, and CFI values of 0.90 or higher were considered an indication of acceptable fit according to conventional rules of thumb [43]. The following model were tested: an unidimensional model, which assumes all core items of the MDQ or the HCL-32 tap into a single dimension of propensity to the manic/hypomanic syndrome; a two-factor model of elated and irritable dimensions, as in Ouali et al., 2020 for the MDQ [35] and in Meyer et al., 2007 for the HCL-32 [15]; and these two-factor models’ bifactor implementation [44], which assumes that most variance in the scores is attributable to a general factor resulting from the loading of all items on a single dimension of propensity to the manic/hypomanic syndrome, with an additional but residual variance purportedly explained by the loading of the items on the “elated” and the “irritable” dimensions, as defined above. To check for reasonable unidimensionality of the general factor extracted from the bifactor model, the explained common variance (ECV), the percentage of uncontaminated correlations (PUC), and the Omega Hierarchical (ωH) were calculated [45]. We also calculated the construct replicability H index of Hancock and Mueller (2001) [46]. H values of .80 or higher indicate a well-defined latent variable, which is more likely to be stable across studies. The presence of multidimensionality might be discarded when ECV is higher than .60 and ωH > .70 or PUC > .70 [45]. CFA models were tested with the “lavaan” package running in R [47].The calculation of the bifactor indices was done with the “Bifactor Indices Calculator” package running in R [48].

The receiver operating characteristics (ROC) curve was used to test for the criterion validity of the tools. Criterion validity was intended the degree to which the scores of the instrument were an adequate reflection of a “gold standard” [49]. For the purposes of this study, we used the diagnosis assigned after the SCID interview as a “gold standard” for reference. Thus, the ROC curve analysis was used to distinguish between diagnostic groups for both the MDQ and the HCL-32. Sensitivity was defined as the probability of a true positive case, i.e. the probability of identifying a patient with BD. Specificity was the probability of a true negative case, i.e. the probability of identifying a patient without BD. We also derived the positive predictive value (PPV), i.e., the probability that a person is a case of BD when a positive test result is observed; the negative predictive value (NPV), i.e., the probability that a person is not a case of BD when a negative test result is observed; and the positive diagnostic likelihood ratio, which is the odds ratio that a positive test will be observed in a population of people with BD compared to the odds that the same result will be observed among a population of people without BD. The accuracy in the prediction was estimated from the area under the curve (AUC; with 95% confidence interval). Agreed threshold for the AUC were: ≤ .70, poor; between .70 and .80, fair; between .80 and .90, good; above .90, excellent [50].

We used the “pROC” package running in R to perform the ROC analysis [51], while the best cut-off point for the MDQ and the HCL-32 was established according to the Youden (1950) method with the “Optimal Cut points” package [52]. The comparison of the two paired ROC curves for MDQ and HCL-32 in the same sample was done with a bootstrap test according to Hanley and McNeil (1983). The test was performed with the “pROC” package.

Sample size estimation and power analysis

CFA and ROC analysis impose some requirements for sample size. As for the CFA, with DWLS applied to binary or ordinal data, a sample size between 200 and 500 subjects is enough for model convergence and parameters’ estimation, according to Monte Carlo simulation studies (Bandalos, 2014). Thus, the global sample size in this study was sufficient to conduct CFA.

As for the ROC analysis, with alpha set at 0.05 and power at 80% (beta = 0.20), with 59 cases of BD and 281 controls, we could detect an AUC as low as 0.612, which is even lower than the minimum fair AUC (0.700). With the same parameters and 59 cases of BD and 86 cases of MDD, we could test the diagnostic ability of the screeners in discriminating the two diagnoses detecting an AUC as low as 0.632. This power analysis was performed with the “pROC” package running in R [53].

Results

The sample included 86 patients diagnosed with MDD, 22 patients diagnosed with BD-I and 37 patients diagnosed with BD-II. There were also 281 putatively healthy controls (Table 1).

Table 1 General characteristics of the participants included in the study

There were no differences by gender or maximum education level among participants; controls were marginally younger than the patients (partial eta-squared = 0.020).

Clinical data were available for patients only. There was no relevant difference in the age of onset of the psychopathology among groups. A family history of depression was observed more often in patients diagnosed with BD-II, while a family history of BD was observed in just 5% of patients diagnosed with MDD and in about 25% of those diagnosed with BD (Table 1 for details).

Patients diagnosed with BD-I were more likely to have attempted suicide and have been more often admitted to a psychiatric service than patients with MDD or BD-II. A prescription of an antidepressant was received by most patients, with no differences by diagnosis. A second-generation antipsychotic was prescribed in about 10% of cases, again with no difference by diagnosis. Lithium was rarely prescribed and only in patients diagnosed with BD-I.

Overall, 86 patients with MDD, 58 patients with BD (either BD-I or BD-II), and 265 controls completed the MDQ; while the HCL-32 was completed by 64 patients with MDD, 32 with BD, and 225 controls.

Floor or ceiling effects

There were no floor effects for the MDQ: 25 controls (8.9%) and just 1 with MDD (1%) scored zero on the MDQ (χ2 = 11.85; df = 2; p = 0.003). However, a modest ceiling effect was observed for the MDQ: 4 controls (1.4%) and 11 patients with BD (17.7%) scored 13 on the MDQ (χ2 = 44.38; df = 2; p < 0.0001).

There were no floor and ceiling effects for the HCL-32. Overall, in the sample 7 patients scored zero on the HCL-32: 5 controls, 2 with MDD, none with BD (χ2 = 1.28; df = 2; p = 0.52). No participants scored 32 on the HCL-32.

Reliability of the questionnaires

Cronbach’s alpha for MDQ was 0.79 (95%CI: 0.76–0.83) in controls; 0.78 (0.75–0.82) in patients with MDD; and 0.71 (0.60–0.81) in patients diagnosed with BD. Cronbach’s alpha for HCL-32 was, respectively, 0.85 (0.82–0.87) in controls, 0.80 (0.74–0.85) in MDD, and 0.76 (0.68–0.85) in BD.

Confirmatory factor analysis of the factor structure of the MDQ and the HCL-32

For both the MDQ and the HCL-32, the bifactor implementation of the two-factor model had the best fit according to the predefined parameters (Table 2).

Table 2 Confirmatory factor analysis of the MDQ and the HCL-32. Goodness-of-fit indices of the tested models

For the bifactor model of the MDQ, H = 0.79, ECV = 0.54, PUC = 0.60, and ωH = 0.64.

For the bifactor model of the HCL-32, H = 0.80, ECV = 0.33, PUC = 0.48, and ωH = 0.37.

Thus, for both the MDQ and the HCL-32 there is some indication in favor of a single, reproducible latent component. However, the multidimensionality in the data might influence the results that can be derived from a global summary score.

Discriminant capacity of the MDQ and the HCL-32

Patients diagnosed with BD scored higher than patients diagnosed with MDD and controls on both the MDQ and the HCL-32 (Table 3).

Table 3 Scores of the HCL-32 and the MDQ by subgroup of participants

According to the epsilon-squared effect size (Tomczak and Tomczak, 2014), about 20% of the variance in the sample was attributable to the differences in MDQ by groups, and 10% was attributable to the differences in HCL-32 by groups.

ROC analysis

The MDQ and the HCL-32 were able to distinguish patients diagnosed with BD from putatively healthy controls, with better AUC in MDQ (82.7; 95%CI: 75.3–90.2) than in HCL-32 (73.4; 63.9–83.0) (Fig. 1).

Fig. 1
figure 1

Receiver operator characteristic (ROC) curve of the predictive capacity of the Tunisian MDQ (on the left) and the Tunisian arabic HCL-32 (on the right) in differentiating patients with BD from healthy controls. Sensitivity and specificity are reported as percentages, with a cross indicating on the curve the best compromise between them (corresponding to the cut-off). The area under the ROC curve (AUC) is reported alongside its 95% confidence interval

The MDQ (AUC: 88.9; 81.4–96.3) and the HCL-32 (AUC: 83.3; 74.5–92.1) were equally able to distinguish patients diagnosed with BD from patients with MDD (Fig. 2).

Fig. 2
figure 2

Receiver operator characteristic (ROC) curve of the predictive capacity of the Tunisian MDQ (on the left) and the Tunisian arabic HCL-32 (on the right) in differentiating patients with BD from patients with MDD. Sensitivity and specificity are reported as percentages, with a cross indicating on the curve the best compromise between them (corresponding to the cut-off). The area under the ROC curve (AUC) is reported alongside its 95% confidence interval

When compared with the Hanley and McNeil’s test, the MDQ was confirmed better than the HCL-32 in distinguishing patients with BD from putatively healthy controls, while no difference was found between the two screeners in the differentiation of patients with BD from those with MDD (Fig. 3).

Fig. 3
figure 3

Comparison with the Hanley and McNeil’s test between the Tunisian arabic MDQ and the Tunisian arabic HCL-32 in distinguishing patients with BD from putatively healthy controls (on the left), or from patients with MDD (on the right)

The best threshold for the differentiation of patients with BD from patients with MDD was 7 for the MDQ (Fig. A1) and 15 for the HCL-32 (Fig. A2).

Sensitivity and specificity at the best threshold were 87 and 77%, respectively, for the MDQ, and 87 and 69% for the HCL-32. Both screeners had a better NPV (92.3 and 91.4%, respectively) than PPV (65.8 and 58.7%). The positive diagnostic likelihood ratio was modestly higher for the MDQ (3.86) than for the HCL-32 (2.84).

In the investigated samples, 109 controls (41.1%), 21 patients with MDD (24.4%), and 52 patients with BD (89.7%) scored at or above the cut-off on the MDQ (χ2 = 63.14; df = 2; p < 0.0001). The corresponding figures for the HCL-32 were 108 (48%) among controls, 21 (32.8%) among patients with MDD, and 28 (87.5%) among patients with BD (χ2 = 25.78; df = 2; p < 0.0001).

Discussion

In this study, both the MDQ and the HCL-32 were able to distinguish patients diagnosed with BD from patients diagnosed with MDD, with a good accuracy (when measured with AUC) and an informative positive diagnostic likelihood ratio (above 2). Both screeners were more able to exclude the presence of a BD than to confirm it, on the basis of their PPV and NPV. Reliability was good for both the MDQ and the HCL-32. In controls, too, the reliability of the two screeners was good to excellent.

The controls were probably likely to admit socially acceptable hyperthymic traits, such as being more sociable than their peers or being exuberant in social circumstances. This might explain the higher fraction of controls than of MDD patients scoring at or above the cut-off for screening a BD. However, the reporting of hypomanic-like symptoms by controls does not necessarily correspond to real, true episodes of hypomania. Moreover, the higher reporting of hyperthymic traits and hypomanic-like symptoms by controls was not corroborated by an independent source.

This is the first study to have tested a bifactor structure of the MDQ and the HCL-32. In past investigations, a two-factor structure was repeatedly reported to explain the distribution of the scores of the two screeners, with some items reflecting a propensity to elated behaviors, and another set of items being a reflection of an impulsive/irritable mood [24, 32, 54, 55]. In this study, this two-factor solution did not show a good fit according to the predefined parameters. The bifactor implementation of this two-factor model, instead, showed a good fit to the data. The excessive reliance on the exploratory factor analysis over the confirmatory factor analysis of past studies might in part explain the difference between this and previous investigations of the topic. It should be noted that both the MDQ and the HCL-32 are usually applied as a single factor screener, thus a bifactor model of a multidimensional structure of the screeners is the best approximation to the expected factor structure of the tools and to its current use. It should be noted that in this study, the indicators of the appropriateness of the general factor of the bifactor model were below the accepted threshold for full acceptance of the general factor as a single summary score of the tools. This may depend on the application of the model to a sample that included both patients and putatively healthy controls. This might have inflated the impact of the multidimensionality of both tools, since the elated and impulsive/irritable experience of the patients might be qualitatively different from the corresponding experience in people without a mood disorder.

In this sample, the best cut-off for the HCL-32 was close to the one reported in past studies that were carried out in the Western samples, usually about 14 or 15. However, in some non-Western samples, such as in the Arabian study of Fornaro et al. (2015) [34] or the Brazilian sample of patients of Soares et al. (2010), higher cut-offs were reported, around 17/18. Fornaro et al. (2015) included inpatients, while Soares et al. (2010) [18] enrolled outpatients. Probably both severity and cultural differences in admitting some hypomanic symptoms might have had a role in explaining the higher cut-offs in those studies. In this study, the sensitivity and specificity of the HCL-32 in discriminating patients with BD from those with MDD were, respectively, .87 and .69, somehow higher than the corresponding figures in the Soares et al. study (.75 and .58), and close to the values observed by Perugi et al. (2012) [56] in their large Italian study (.85 and .78). Fornaro et al. [34] found similar values of sensitivity (.82) and specificity (.77) of their version of the HCL-32 in the discrimination of Arabic patients with MDD from those with BD. Both the Perugi et al. (2012) study and Fornero et al. [34] study found a higher specificity of the applied version of the HCL-32, suggesting that sample composition might affect the detection of hypomanic symptoms. Indeed, in the present study, we enrolled a larger fraction of patients with BD-II than with BD-I, while the Fornero et al. [34] study had a ratio of BD-I to BD-II = 4.7. This might be considered a limitation of the present study, but in community samples, the lifetime prevalence of BD-II tends to be higher (1.57%; 95%CI: 1.15–1.99) than the lifetime prevalence of BD-I (1.06%; 0.81–1.31) [26]. Moreover, in past studies, patients had already received a diagnosis of BD, thus might have been more prone to admit hypomanic symptoms.

Overall, the two screeners revealed ease of use, albeit requiring some degree of literacy. Time to fill in was in general minimal for patients with adequate reading skills, but sometimes it requires more time in older patients. Nevertheless, both the MDQ and the HCL-32 might represent valuable help in busy primary care settings, favoring the recognition of cases in need of closer evaluation.

Strengths and limitations

The major strength of the study is its design, which was as close as possible to clinical reality, as we included patients only complaining of depressive signs and symptoms, but did not have any precompiled diagnosis of unipolar or bipolar depression when they first presented. This is a major difference from most of the other studies about MDQ and HCL-32, which often included patients that had already received a diagnosis of BD [1, 34], and might have received some clue about the symptoms they are expected to admit [57]. Several limitations have to be taken into account. Some of the questionnaires, either MDQ or HCL-32, were incomplete, especially among patients with BD. This depended mainly on patients leaving blank some items, such as item 6 (about wanting to travel) or 7 (about risky driving) of the HCL-32 because they do not habitually do the enquired action (they do not travel or drive a car), thus they didn’t know how to reply to the question. As a consequence, we had to discard some of the cases and this resulted in a loss of power for the analysis. In particular, we had not enough cases with BD-II to test the discriminant capacity of the tools with respect to MDD, the main usage of a screening tool to identify BD. Indeed, while manic episodes are more likely to be recognized by clinicians and to be remembered by the patients, the hypomanic episodes are precisely those that complicate the diagnosis of BD in the clinical setting.

Conclusion

Despite its limitations, this study showed the good capacity of both the MDQ and the HCL-32 as screening tools to be used to differentiate patients with BD from patients with MDD. Both screeners work best in excluding the presence of BD in patients with MDD, which is an advantage in deciding whether or not to prescribe an antidepressant, which can have known negative effects in patients with BD [58]. When the screener is positive for the presence of BD, it may prompt a deeper investigation of past manic/hypomanic episodes that might have been overlooked at the first assessment.