Background

Common mental disorders occur at different prevalence rates between sexes [1]. In particular, women are twice as likely to develop anxiety and depression across the lifespan compared to men [1,2,3,4]. This difference arises after puberty [4, 5], suggesting the involvement of sex-specific factors in the development of such disorders [6]. To identify these factors, neuroimaging studies have investigated sex differences in brain structure and function [7,8,9,10,11], mostly using univariate analyses. However, discordant findings have been reported [7, 10, 11]. In a recent meta-analysis, Ritchie et al. [10] found generally larger volumes in males, while other studies reported larger volumes in females for different regions [8, 11]. Possible explanations might be differences in the normalization and segmentation processes [7, 12], as well as the effects of total brain volume [12]. Sex differences in total brain volume has been shown to drive many structural and volumetric differences in the brain [12, 13], leading to discordant results depending on the correction method applied. In addition, variations in structural and functional MRI according to the menstrual cycle and hormonal contraceptive use have been reported in the literature for many structures, such as hippocampus, amygdala, prefrontal cortex, cingulate cortex, and insula [14, 15]. Of note, many of the structures with notable sex differences and hormonal effects are part of the limbic system [16,17,18,19,20]. The limbic system is strongly involved in emotional processing, learning and memory, functions typically altered in mental and neurological disorders [21]. Due to its involvement in such functions, the limbic system has consequently been proposed as a key player in mood disorders such as major depression [22,23,24,25].

Recently, multivariate approaches have been developed to study sex differences in the brain. Machine learning models that classify for sex based on brain structural or functional magnetic resonance images (MRI) yield class probabilities that can be used as an imaging-derived multivariate phenotype to study sex differences on a continuum from female- to male-like brains [26,27,28,29]. Conceptually similar approaches have already been used extensively to predict brain age [30,31,32,33,34,35,36], where machine learning models deliver a continuous phenotype reflecting apparent aging effects. While most brain age studies to date built models based on data from the whole brain, regionally constrained models may identify region-specific associations with mental health, such as frontal brain age alterations in schizophrenia or subcortical alterations in Alzheimer’s disease [34]. In a recent study, Sanford and colleagues [36] investigated sex differences in local brain age gaps (i.e. difference between regionally constrained neuroimaging-predicted age and chronological age) in young adults. Compared to males, females showed significantly lower local brain age gap in the frontal region and insula, while they had significantly higher local brain age gap in the posterior regions. However, on a global scale the authors did not report any differences in brain age gaps, suggesting finer-grained (i.e., regional specific) models having a higher sensitivity to sex differences. Translating this finding into the field of sex classification, leveraging regional constrained models may provide estimates more sensitive to sex-specific phenotypes. Whereas Weis and colleagues [29] have classified sex based on the whole brain connectome and different functional brain networks separately, the potential of regionally constrained models to study brain sex based on structural MRI has yet to be investigated.

Here, we investigate whether a regionally constrained model based on brain volumes of the limbic system can correctly classify sex, and whether the obtained regional class probabilities are sensitive to mental health. We compared limbic brain sex to non-limbic brain sex, aiming to investigate relevant biological differences in brain sex determination as well as the possible clinical association with major depression. Specifically, the present study sought: (i) to compare sex differences at a univariate (i.e., single structure) level between limbic and non-limbic structures, (ii) to validate regionally constrained machine learning models trained either on limbic or non-limbic feature sets as compared to a whole brain model, (iii) to test for an association between obtained class probabilities and major depressive disorder (MDD) diagnosis. We hypothesize that (i) the regionally constrained models, much like whole brain models, are able to correctly predict sex from structural imaging data, (ii) the estimates (i.e., class probabilities) contain biologically meaningful variation (tested via heritability analysis), and that (iii) the limbic estimates have a stronger association with depression than estimates from other models.

Methods

Participant selection

As illustrated in Fig. 1, structural MRI data from the Human Connectome Project (HCP) [37] was used for univariate analysis of brain features, and for training of machine learning models in a healthy sample. For the HCP, the subject selection criteria (see Supplementary Figure S1, Additional File 1) aimed to (I) maintain an equal female-male ratio (based on biological sex as provided by the data), while (II) limiting possible confounding effects such as hormonal fluctuation, and (III) maximizing the sample size for machine learning. To achieve this, after excluding 14 individuals following quality control of the imaging data, female subjects were first selected according to the hormonal information available with the goal to limit the potential effects of irregular cycle and hormonal alterations such as presence of hypo- or hyperthyroidism and Thyroid Stimulating Hormone (TSH) levels out of the normal range (0.4–4.0 mU/L, as define by the HCP-YA data dictionary), yielding n = 391 females. These females were then matched to an equal number of males according to age and Euler number. Finally, in order to maximize the sample size, the remaining males (n = 105) were matched for age and Euler number with an equal number of randomly selected females from the subjects excluded in the first step. This procedure led to a total sample size of N = 992 subjects (50% females), with an age range 22–38 years old (females: mean age = 28.98, sd = 3.63; males: mean age = 27.90, sd = 3.60).

Fig. 1
figure 1

Schematic overview of the study design. Step 1: After combining different segmentations in FreeSurfer, each structure was assigned either to the limbic or to the non-limbic feature set. Step 2: Univariate analysis of sex differences and feature heritability were applied to the limbic and non-limbic feature set. Step 3: A machine learning model for sex classification was trained on each feature set (regionally constrained models). Model performance was assessed via AUC-ROC and the class probabilities for each participant were stored. Step 4: The heritability and the clinical association with major depressive disorder (MDD) diagnosis was assessed for the class probabilities obtained for each model

The Queensland Twin Imaging (QTIM) study was used as an independent healthy validation dataset [38] to replicate univariate analyses and to test the HCP-trained machine learning models. After excluding 20 outliers based on imaging quality control, only subjects in the age range 18–30 years were selected, to ensure a similar age range with the HCP. The final QTIM sample comprised N = 1017 subjects (61.6% females, age range 18–30 years old, females: mean age = 22.49, sd = 2.84; males: mean age = 22.29, sd = 2.86).

To test for clinical associations with the derived brain sex class probabilities, data from the Strategic Research Program for the Promotion of Brain Science (SRPBS) [39] was used. Due to data availability and the differences in prevalence across sexes [1], we focus on MDD. After excluding 10 outliers following quality control, we selected healthy controls (HC) and individuals with MDD, based on the diagnosis variable available in the dataset. To reduce scan site confounds, we only included HC data for sites in which MDD data was available. This yielded a sample of N = 844 subjects (HC: N = 595, 54.5% female, age range 18–80 years old, females: mean age = 41.34, sd = 14.99, males: mean age = 38.24, sd = 16.28; MDD: N = 249, 47.8% female, age range 18–75 years old, females: mean age = 43.39, sd = 12.87, males: mean age = 41.77, sd = 11.04).

Image segmentation, quality control and features selection

Raw T1-weighted MRI scans were preprocessed in FreeSurfer v7 and automated cortical and subcortical reconstruction were performed. To obtain a more precise segmentation of the limbic system, we combined volumes obtained with different segmentation approaches. The cerebral cortex was segmented using a multimodal parcellation described in Glasser et al. [40], which returns 180 features for each hemisphere. For subcortical regions, we combined the classic set of features from FreeSurfer with additional segmentations of subcortical limbic structures [17], hippocampus [41], amygdala [42], and thalamus subfields [43]. For all segmented cortical and subcortical regions, we calculated the corresponding volumes (mm3). We then assigned the structures derived from each segmentation process to a limbic or non-limbic feature set based on the common definition of limbic system as found in the literature [17,18,19,20], carefully avoiding overlap between different segmentation approaches. The final feature set comprises 493 regions (358 features from Glasser multimodal segmentation [40], 17 subcortical features from FreeSurfer classical segmentation, 12 features from the subcortical limbic segmentation [17], 38 features from hippocampus segmentation [41], 18 features from the amygdala segmentation [42] and 50 features from thalami subfields segmentation [43]), of which we assigned 160 regions to the limbic (see Supplementary Table 1, Additional File 1) and 333 regions to the non-limbic feature set.

To correct for differences in head size that could affect the machine learning models’ ability to classify sex, all analysis were repeated accounting for the estimated total intracranial volume (eTIV). For each raw brain imaging feature, we regressed out the eTIV and used the eTIV corrected data as features in the respective eTIV-adjusted machine learning models. As further control, we compared this eTIV residualisation approach to another approach, the power-corrected proportion method (PCP) (D. [44, 45], which assumes an exponential relationship between the raw volume and the eTIV. We correlated the features obtained with the two methods (see Supplementary figure S2, Additional File 1), and since they converged on very similar results, we used the eTIV residualisation approach in all eTIV-accounting analyses.

We used Euler numbers, a proxy of image data quality [46] for quality control of the imaging data. We averaged Euler numbers from the left and right hemisphere and excluded subjects with an average Euler number lower than three standards deviations from the sample mean.

Sex differences and heritability analyses of single brain volumes

We first investigated differences between limbic and non-limbic structures using univariate analyses for each structure of interest. For each region, we tested for sex differences using linear models that accounted for age and Euler number. Subsequently, we assessed the differences in effect size distributions between limbic and non-limbic structures using a t-test. To account for differences in head size, we repeated the same univariate analysis introducing the eTIV as additional covariate together with age and Euler number. Next, we computed the broad sense heritability (see Heritability analysis) of volumes for each region and subsequently tested for heritability differences between limbic and non-limbic structures using a t-test. To look at the association between the two measures, we computed a correlation between the sex differences and the heritability for limbic and non-limbic structures. We tested for significant differences between correlation coefficients using a Fisher’s test for the comparisons of independent correlations.

Sex classification models and clinical associations

For each set of features (limbic, non-limbic and whole brain), two sex classification models were fitted, one using the raw volumes, and one using the residuals from the same features after regressing out the estimated total intracranial volume as follow:

$$lm \left( {Volume\, \sim Estimated\,Total\,Intracranial\,Volume} \right)$$

To avoid confounding factors, our training sample was balanced for sex and subjects were matched for age and image quality. We trained our models on HCP data, using gradient tree boosting as implemented in the xgboost package in R (version 4.2.2). Sex was coded in the training sample as binary variable with 0 assigned to males and 1 assigned to females. The resulting class probability ranges between 0 (male-like) and 1 (female-like). The learning rate was set at η = 0.01 and the initial number of rounds to 1000, and we performed fivefold cross-validation within the training sample. For each iteration the prediction error was assessed and used to determine the optimal iteration number, used to train the final models on the full set of data. We then applied these models to the test samples to classify brain sex. The resulting class probabilities were extracted and used for further analysis. To evaluate each model performance, both the accuracy and the area under the receiving operating characteristic curves (AUC) were calculated. As further control for the training procedure, we trained our models in a sex-balanced subsample of QTIM, matching the participants according to age and Euler number (N = 780, 50% females, females: mean age = 22.47, sd = 2.89, males: mean age = 22.28, sd = 2.86) and evaluated the performances.

To test the statistical significance of the models, we performed permutation testing. For each raw and eTIV-controlled set of features, the classification labels (sex of the participants) were randomly permuted 5000 times, while maintaining the feature sets unchanged, resulting in the randomized association between the feature matrix and the labels. For each permutation, fivefold cross-validation was applied. The accuracies were stored and used to build a null distribution that was then compared against the accuracies of the true models. Both a cut-off of 50% (chance level) for the accuracy and AUC and significant statistical results in the permutation testing were considered to evaluate the model as successful in classifying sex.

For external validation, we applied the HCP-trained model to independent data from the QTIM and SBRPS samples and computed the class probabilities in these unseen datasets. In the SBRPS data, we tested for the association with major depression using a linear model considering the diagnosis as independent variable, class probability obtained with each model as the dependent variable and accounting for sex, age, Euler number and site, as follow

$$lm\left( {Class\;probabilities\, \sim Diagnosis + sex + age + Euler\;number} \right)$$

To overcome potential issues with a continuum of class probabilities originating from two distinct distributions (males, females), we repeated the same analysis within females and males separately, accounting for age and Euler number as covariates as follow:

$$lm\left( {Class\;probabilities\, \sim Diagnosis + age + Euler\;number} \right).$$

As further control, we repeated the same analyses (general and within each sex separately) on a subset of subjects with an equal number of HC and MDD subjects matched by age and sex (HC: N = 249; MDD: N = 249; females = 48%). We consider diagnosis as an independent variable and sex, age and Euler number as covariates for the analysis in the general sample, while accounting only for age and Euler number as covariates in the within sex analyses, following the models stated above. Effect sizes were assessed using Cohen’s d [47].

Heritability analysis

Monozygotic and dizygotic twin couples were selected for the HCP and QTIM data. Only couples of twins with both twins in the data were selected in both HCP and QTIM dataset. Due to the sample, in the HCP all included dizygotic couples were of the same sex. In the QTIM sample, both same sex and discordant sex couples of twins were selected. Siblings were excluded from the analysis in both datasets. The total sample sizes for heritability analysis were N = 378 (MZ = 236, DZ = 142) for HCP and N = 674 (MZ = 302, DZ = 372) for QTIM data. The heritability analyses were run at a single-structure level and on the predicted class probabilities from the machine learning models. An AE model was used for both, returning the variance explained by the additive genetic component (A) and non-shared environment component (E). Sex, age and Euler number were accounted as covariates. For the single feature analyses, we repeated the same analysis on the raw volumes and on the residuals after regressing out the covariates and the total intracranial volume to control for head size. The heritability analyses were carried out with the twinlm function of the mets package in R (version 1.3.1).

Results

Limbic structures showed greater sex differences and heritability than non-limbic structures

The degree of sex differences of a given FreeSurfer-derived feature to its heritability in two independent samples (HCP and QTIM) is depicted in Fig. 2. Limbic structures showed significantly greater sex differences as compared to non-limbic structures in both the HCP (t = 4.89, p < 0.001) and QTIM (t = 3.87, p < 0.001) data. These results were replicated when accounting for the eTIV in both samples (HCP: t = 5.72, p < 0.001; QTIM: t = 5.59, p < 0.001), suggesting overall greater sex differences independent of head size in the limbic system. The limbic volumes were also more heritable than the non-limbic volumes, both with and without controlling for eTIV, in HCP (Raw features: t = 4.85, p < 0.001; eTIV-controlled: t = 4.84, p < 0.001) and QTIM sample (raw features: t = 3.36, p < 0.001, eTIV-controlled: t = 3.75, p < 0.001).

Fig. 2
figure 2

Limbic volumes show stronger sex differences and higher heritability than non-limbic volumes. The analysis was initially conducted in HCP (panel a, c) and thereafter replicated in the independent QTIM sample (panel b, d). The upper row (a, b) presents results from raw volumes whereas the lower row (c, d) shows results after controlling for estimated total intracranial volume. Each data point reflects the effect of one brain region. Regression lines illustrate the direction of effect across all limbic or non-limbic regions, indicating that the most heritable features had also grater sex differences. In both samples, the identified effects from raw volumes were smaller yet still significant when accounting for the total intracranial volume. ***p < 0.001

For both, limbic and non-limbic structures, we observed a significant positive association between sex difference and heritability, indicating that the most heritable features had also greater sex-differences (range across the eight regression lines depicted in Fig. 2: r = 0.238, p = 0.003 to r = 0.633 p < 2.2e−16) (see Supplementary Table 2, Additional File 2, for the complete list of values. Positive t-values for sex differences reflect greater volumes in males.). Slopes were not statistically different between limbic and non-limbic features, except for eTIV-corrected features in the HCP sample, where the slope difference between limbic and non-limbic reached statistical significance (z = − 2.125, p = 0.034). However, caution is warranted interpreting this difference as this finding could not be replicated in QTIM.

Regionally constrained models successfully classify sex

We used the different feature sets (limbic, non-limbic, whole brain) from the HCP sample to train a machine learning model that classified sex from volumetric imaging data. We first assessed model performance within HCP using fivefold cross-validation, indicating that all three models were able to successfully classify sex (please see Fig. 3 for illustration). The limbic model achieved an accuracy of 87% and AUC of 0.935, while the non-limbic achieved an accuracy of 84.4% and AUC of 0.92. Both models performed approximately as well as the whole brain model (accuracy = 86.6%, AUC = 0.941). When controlling for eTIV, the models were still capable to correctly classify sex, although with a lower accuracy and AUC (limbic (eTIV): accuracy = 70.9%, AUC = 0.778; non-limbic (eTIV): accuracy = 74.6%, AUC = 0.819; whole brain (eTIV): accuracy = 77.2%, AUC = 0.861). No significant difference in the performance between the three models was found.

Fig. 3
figure 3

The limbic model achieved performances similar to models based on non-limbic and whole brain data. The ROC curves and the AUC for each model in the HCP sample (cross-validation within the training set) and in the QTIM sample (validation in independent test data) show high performances for all models. Regressing eTIV from the features (dashed lines) reduced performance compared to raw volumes (solid lines) yet all models still performed well above chance

Permutation testing with 5000 permutations per model indicated that no permutation-based accuracy was better than the accuracy obtained with the true models, confirming the validity of our models (see Supplementary Figure S3, Additional File 1). Notably, when looking into the feature importance for the raw and eTIV-corrected whole brain models, the main contributors to both models belong to the limbic system (see Supplementary Figure S4, Additional File 1).

Next, we applied the HCP-trained models to QTIM data for external validation, confirming solid performance in independent data (with an accuracy of 82.3%, 77.6% and 79.7% and an AUC of 0.905, 0.885 and 0.920 for limbic, non-limbic and whole brain model, respectively). For eTIV accounted models we also observed performances similar to those obtained within HCP (accuracy: 68.4%, 66.8%, 71.6%, AUC: 0.748, 0.736, 0.793 AUC for limbic, non-limbic and whole brain model, respectively).

When training the data in QTIM, the performances were in line with those obtained from the training in HCP and the independent testing in QTIM, with accuracies of 85.9%, 82.1% and 87.6% and AUC of 0.933, 0.902 and 0.943 for the limbic, non-limbic and whole brain model respectively. When correcting for eTIV, the models achieved lower but still relevant performances (accuracies: 73.7%, 74.0%, 77.2%, AUC: 0.806, 0.830, 0.865 for limbic (eTIV), non-limbic (eTIV), and whole brain (eTIV)) (see Supplementary Figure S5, Additional File 1). These results support the generalizability of our HCP trained model since there was only a minor improvement in performance metrics when the model was trained and validated in QTIM compared to the performance we achieved by applying our independently trained HCP model to QTIM.

Brain sex class probabilities are heritable

The results of the broad sense heritability estimated from twin data of the class probabilities obtained from each of the machine learning models is shown in Fig. 4. Sex class probabilities were heritable for all models, both in the HCP (limbic: 81.4%, non-limbic: 89.4%, whole brain: 87.4%) and the QTIM data (limbic: 78.9%, non-limbic: 82.1%, whole brain: 74.7%). When accounting for the total intracranial volume, the broad sense heritability decreased in both samples yet was still substantial (HCP: limbic (eTIV): 61.3%, non-limbic (eTIV): 47.6%, whole brain (eTIV): 51.98%; QTIM: limbic (eTIV): 48.5%, non-limbic (eTIV): 56.5%, whole brain (eTIV): 57.8%).

Fig. 4
figure 4

The class probabilities from all models were heritable in both samples, indicating meaningful underlying biological information. Each cell shows the broad sense heritability in percent, with highest values for those derived from models that do not account for eTIV. Similar values are obtained when repeating the analyses within each sex (see Supplementary Figure S6, Additional File 1)

Higher class probabilities are associated with major depression

To investigate the association with MDD in a clinical sample, we applied the models to the SRPBS sample, considering healthy controls and depressed patients. After verifying the accuracy and the AUC of each model in healthy control and MDD patients (see Fig. 5), we extracted the class probabilities and associated them with the diagnosis. Our results indicated that the class probabilities in the clinical sample were overall higher (i.e. in the direction of a female brain phenotype) as compared to the healthy controls. These results were significant in all models, with strongest effect for the limbic model (limbic: t-value = 2.81, p = 0.005, Cohen’s d = 0.21; non-limbic: t = 2.57, p = 0.011, Cohen’s d = 0.2; whole: t = 2.40 p = 0.016, Cohen’s d = 0.18). When accounting for eTIV, these results were no longer significant (please see Fig. 6 for details). Interestingly, when analyzing the association between class probabilities and depression within each sex, no effect was found in males, while only the limbic (t = 3.11, p = 0.002, Cohen’s d = 0.34) and whole brain (t = 2.27, p = 0.023, Cohen’s d = 0.25) models showed significant differences between HC and MDD in females. When accounting for eTIV, none of the associations within sex were significant. However, similar patterns of stronger effects in females compared to males was found for the three models (females: limbic (eTIV): t = 1.94, p = 0.053, Cohen’s d = 0.21; non-limbic (e-TIV): t = 0.15, p = 0.881, Cohen’s d = 0.02; whole brain (eTIV): t = 1.57, p = 0.118, Cohen’s d = 0.17; males: limbic (eTIV): t = 0.13, p = 0.895, Cohen’s d = 0.01; non-limbic (eTIV): t = 1.20, p = 0.231, Cohen’s d = 0.13; whole brain (eTIV): t = 0.27, p = 0.787, Cohen’s d = 0.03). When repeating the analyses in the age-matched HC-MDD subsample, only the limbic model in females showed significant greater class probabilities in MDD (t = 2.14, p = 0.033, Cohen’s d = 0.28), while no significant association was found for the general sample or males or when correcting for eTIV (see Supplementary Figure S7, Additional File 1).

Fig. 5
figure 5

All models achieve high performance in the SRPBS sample. The ROC curves and the AUC show the performance of each model for healthy controls and MDD patients. In healthy controls the accuracies were 75%, 73.3%, and 74.1% for the limbic, non-limbic and whole brain model, respectively. When accounting for eTIV the accuracy and the AUC decreased in all models, yielding 65.2%, 69.2% and 70.3% accuracy for limbic (eTIV), non-limbic (eTIV) and whole brain (eTIV), respectively. For the MDD group the accuracies were slightly lower for all models (raw models: 71.9%, 69.9% and 70.7%; eTIV-corrected: 63.5%, 64.7%, 68.3% for limbic, non-limbic and whole brain, respectively)

Fig. 6
figure 6

MDD diagnosis is associated with a more female-like brain. Associations between class probabilities and diagnosis for major depression disorders. Higher class probabilities are associated with MDD diagnosis for all raw models. When accounting for eTIV these effects are no longer significant. *p < 0.05; **p < 0.01

Discussion

The present study investigated whether regionally constrained machine learning models can correctly classify sex from T1-based volumetric MRI data, and if regional class probabilities are more sensitive to a female-prevalent mental disorder compared to whole brain models, the current standard in the field. Known for its involvement in various emotional and cognitive abilities as well as its central role for mental conditions, we here focused on the limbic system to investigate brain sex as a putative phenotype to study female-prevalent mental disorders. Our results based on univariate analysis indicate that limbic structures show substantially greater sex differences compared to non-limbic structures. Limbic structures were also more heritable, which may indirectly suggest that their sex differences are mediated by genetic components. These univariate findings supported the utility of the limbic system as a target for regional constrained brain sex prediction. Our multivariate machine learning models using limbic, non-limbic and whole brain features, respectively, were able to classify sex, yielding heritable class probabilities that were associated with major depression diagnosis, suggesting meaningful underlying biological variance.

Limbic structures showed greater sex differences and heritability than non-limbic structures

In line with previous studies showing associations of limbic structures with sex and hormonal status [7, 9, 14, 15], our univariate analyses showed that limbic structures were characterized by stronger sex differences compared to non-limbic structures. Likewise, we found stronger heritability for limbic compared to non-limbic structures, which resembles a previous report of high heritability in several limbic structures, including hippocampus, amygdala, and nucleus accumbens [48], and contributes to the global efforts to disentangle genetic contribution to brain structure and function [48,49,50,51,52,53]. The degree to which genetic contributions to the anatomy of the limbic system are influenced by sex is an understudied topic with mixed results thus far. While some studies find no sex differences in brain volumes [48], others report lower heritability in females, suggesting as possible explanation a greater influence of environmental factors such as the hormonal status in females [51, 54]. Supporting the latter hypothesis, many limbic structures express receptors for sex hormones [15, 55], contributing to plasticity mechanisms and structural changes. Here, we however found the same pattern of stronger heritability in limbic structure compared to non-limbic structures in males and females, indicating that the heritability patterns were not driven by one sex. Notably though, our results indicate a strong positive association between sex differences and heritability of brain structure, indicating that the structures showing greater sex differences are also the most genetically determined structures, even after controlling for head size.

Regionally constrained models successfully classify sex

The direct comparison and interpretation of sex effects on specific brain regions using univariate frameworks may be limited by the diversity in terms of protocols and parameters used by different studies [7, 15]. Thus, multivariate analysis in a machine learning framework can provide the advantage of condensing the information from a set of brain features into a single score (the class probability), which could be used as variable for further analysis avoiding difficult comparisons. Here, we demonstrated that sex can be classified from regionally constrained feature sets without performance loss. In particular, our limbic and non-limbic models were both able to classify brain sex with accuracies and AUC similar to those obtained with the whole brain model. It is worth noting that the limbic model achieved descriptively the same level of accuracy and AUC with much less features than the other models (limbic: 160 features, non-limbic: 333, whole brain: 493). It must be noted that the cross-validation procedure implemented in HCP did not account for the presence of family members in the splitting procedure, thus potentially biasing the classification performance via twin similarity. However, the external validation in two independent samples achieved high accuracies for all models, suggesting that the presence of twins in the model training did not induce substantial bias.

The importance of external validation for machine learning approaches to improve generalizability and reproducibility has been largely discussed in the literature [56,57,58,59]. In fact, while internal validation by splitting the data in training and test sets is an important tool to ensure reproducibility in a similar sample, it can also be affected by different forms of data leakage [56], inflating the performance of the model. In this context, external validation is key to ensure generalizability in independent data and in samples with different characteristics and acquired under different conditions [59]. Here, we externally validated the model in two different sample, proving the ability of our models to generalize under different conditions.

Brain sex class probabilities are heritable

Since deviations in class assignment can reflect both methodological error or biological variance we assessed the degree to which the class probabilities returned by each of the models capture biologically meaningful variance, by first assessing their heritability. Although previous work has investigated heritability of different structural and functional brain measures using univariate analyses [49, 51, 53, 54], heritability studies of multivariate estimates of brain sex are scarce. A recent study obtained heritable sex scores from a classification of whole brain data [60], in line with our results. Here, we extend these findings to regional constrained estimates of brain sex. We found that the class probabilities were heritable for all models, including those controlling for total intracranial volume, supporting their interpretability by pointing at underlying genetic factors.

Higher class probabilities are associated with major depression

Building upon the heritability results indicating biological meaningful variance, we furthermore investigated the clinical association with major depression under the hypothesis that class probabilities of limbic features may serve as a putative phenotype for the investigation of sex-prevalent mental health conditions. Based on data-availability, we focused on MDD, known to be more prevalent in females [1]. Previous univariate analyses have shown alterations in neuroimaging phenotypes in MDD [22,23,24,25, 61,62,63], with particular focus on the limbic system. Studies investigating structural changes in brain MRI have displayed reduction in brain volume of several limbic regions in depressed subjects, underling the possible involvement of the limbic system in onset and maintenance of depression [22,23,24,25, 63]. However, many of these studies do not account for possible sex differences in the effects, reporting an overall smaller volume of cortical and subcortical structures in MDD patients compared to HC [22, 24, 25]. Multivariate classification models deliver probabilities at the single subject level, facilitating the investigation of sex specific effects in disorder associations. Comparing clinical data both across sex and within sex allowed us to quantify the degree to which MDD associations with brain imaging data are sex specific. Previous studies attempted to associate brain sex estimates with other common mental disorders and symptoms with sex deviant prevalence [64, 65]. However, these studies only focused on whole brain estimates. We extend this by showing that in all three models (limbic, non-limbic, whole brain) a more female-like brain is associated with MDD diagnosis. Interestingly, the association was strongest for the limbic model, complementing previous univariate findings on the relevance of the limbic system and supporting that regionally constrained estimates might represent sensitive markers to study brain–mental health associations. Moreover, when analyzing the two sexes separately, the effect survived only in females and only in the limbic and whole brain model, highlighting the importance of investigating sex differences in clinical associations with brain sex. Nevertheless, it must be noticed that when controlling for eTIV none of the models were significantly associated with depression diagnosis. Thus, from the data at hand we cannot rule out that the observed clinical associations were driven by sex differences in total brain size.

Methodological considerations and future directions

Potential limitations may stem from the fact that hormonal status, by acting on brain plasticity, might affect brain structure. Thus, the class probabilities obtained with our model might change according to the phase of the menstrual cycle, the intake of hormonal contraceptives, and the age-related hormonal status of the participants. As noted, many limbic structures are particularly sensitive to these changes [14, 15, 50]. We attempted to mitigate the hormonal effects by matching the subjects in the training data based on the available menstrual cycle information. However, the lack of precise data on the hormonal status of the participants hinders the effort to control for the variability due to hormonal effects. Women’s health factors such as menstrual cycle, hormonal contraceptives use, pregnancy, and menopause are still largely overlooked in neuroimaging research [66]. Thus, further neuroimaging data and research considering hormonal levels across the menstrual cycle and different life stages is needed to provide more insights on brain sex classification models and to move the field forward. Moreover, while we placed strong emphasis on replicating our results in independent data, lending credibility in both the univariate analyses (heritability, sex differences) as well as in the multivariate analysis (model performance, heritability of class probability), data availability limited us in the replication of clinical associations. We thus deem it important to validate our clinical associations in another dedicated study, which would also test our results generalizability to different technical and clinical characteristics, such as differences in scan protocols or confounds due to the known impact of antidepressant treatment on brain plasticity [22, 63]. Finally, although the effects in clinical association analyses with MDD were in the same direction in all models (see Fig. 6), those controlling for eTIV did not reach statistical significance. Sex differences in total brain volume have been previously associated with structural differences in regional volume [12, 13, 45, 67]. Here, we corrected our analysis by regressing out the eTIV from each feature. For additional validation, we applied the power-corrected proportion approach and correlated the resulting features with features from the eTIV regression. The substantial correlation between both approaches (r > 0.98) corroborates that our eTIV correction is robust (see Supplementary Figure S2, Additional File 1). However, other methods beyond those tested could still yield different results [45, 67]. Indeed, the discussion on how to account for such differences is still ongoing. Recent studies showed how applying different correction methods for total intracranial volume based on features transformation might lead to different results in both univariate and multivariate analyses for sex differences [45, 67]. Other approaches acting at a subject selection level (e.g. matching participants of different sexes based on the total intracranial volume to ensure a balanced sample) might be more successful in limiting the effect of brain size in multivariate analyses [68]. Studies exploiting future developments in the area of eTIV correction might therefore add further insight into the impact of eTIV differences.

Conclusion

In conclusion, we here show in two independent data sets that sex can be classified from T1-based MRI volumes using regionally constrained models, by integrating prior knowledge into the selection of machine learning model features. Our limbic model achieved as high accuracy as the whole brain model using only one third of the features of the latter and the respective class probabilities displayed the strongest associations with major depression diagnosis. Heritability analysis further supports that these probabilities capture biologically meaningful information.

Perspectives and significance

Previous studies have deployed machine learning models to classify sex from brain imaging data and derived male–female class probabilities at the individual level. However, the degree to which these probabilities vary across subsets of brain regions remains largely unexplored. Here, we study sex differences in limbic vs. non-limbic brain features and found strongest association of limbic sex probabilities with clinical data. These findings highlight the potential utility of regionally constrained models to investigate the link between brain sex and mental disorders and call for future investigations into other mental disorders with sex differences in prevalence and symptom profiles.