Introduction

Neurofibrillary tangles (NFT), inclusions composed of toxic hyperphosphorylated forms of the microtubule-associated protein tau (p-tau), are the defining neuropathological feature of a category of neurodegenerative diseases termed tauopathies [1, 2]. This large group of diseases includes primary age-related tauopathy (PART) [3], Alzheimer’s disease (AD) [1], argyrophilic grain disease (AGD) [4], frontotemporal lobar degeneration (FTLD) [5], and chronic traumatic encephalopathy (CTE) [6]. PART describes a neuropathologic continuum observed in the brains of elderly individuals containing p-tau pathology in the absence of or with mild amounts of amyloid-beta peptide (Aβ). Subjects with a Consortium to Establish a Registry for Alzheimer's Disease (CERAD) neuritic plaque severity score of zero are considered PART definite while those with a score of one are considered PART probable. Clinically, those with PART may or may not have cognitive impairment [3, 7], raising the possibility that other factors (e.g. cerebrovascular disease) play a role. For these reasons, studying PART provides an opportunity to assess age-related neurodegenerative processes that contribute to cognitive impairment. The relationship between cognitive impairment in PART and NFT burden is currently not well understood [7]. For example, non-impaired individuals can have a significant NFT burden, complicating our understanding of the contribution of such brain changes to symptomatology [3, 7]. Conversely, it is well understood that NFTs accumulate with age and that individuals who are older are more likely to have cognitive decline [8]. Thus, the age-independent relationship between NFT burden and cognitive impairment in PART remains unclear. One approach to improving our understanding of the complex relationship between NFT burden, aging, and clinical presentation is by leveraging more precise quantification of histologic features.

Prior to the introduction of computational-based approaches to neuropathology, the Braak tau staging system was the most prevalent method of measuring pathological p-tau burden in research and remains so in the clinical setting [9]. While this method has its strengths, it is inherently semi-quantitative, modestly reproducible, and subject to rater bias, leading to inconsistencies between evaluators and institutions [10,11,12,13,14]. Further, the Braak staging system was developed for assessment of p-tau pathology in the context of AD and has not been sufficiently validated in specifically Aβ-negative subjects. The Braak staging system is based on hierarchical neuroanatomical spread and not the degree of p-tau burden in specific brain regions [9, 12]. Despite it being a reflection of p-tau topographic distribution, it is often used as a proxy for assessing the magnitude of neurofibrillary degeneration due to lack of convenient alternatives [15,16,17,18]. Consequently, in PART, which minimally advances outside of the medial temporal lobe, two cases with large differences in NFT burden have the same Braak stage. We have found that Braak staging has suboptimal clinicopathologic predictive power in Aβ-negative individuals [19]. Thus, there is a need for better quantitative approaches to assessing p-tau burden [20,21,22,23].

Recent developments in whole slide digitization allow the use of computational approaches to precisely assess and quantify neuropathological features. This includes measuring histological staining intensity (e.g., positive pixels), which we have previously deployed in the context of hippocampal tissue sections immunohistochemically-stained for p-tau [19]. However, this approach fails to distinguish between critical structural and morphological features that could assist in our understanding of the relationship between neuropathology and antemortem clinical symptomatology. Furthermore, this method relies on human defined pixel color ranges and intensities, and is thus vulnerable to biases of variable effects of formalin fixation on tinctorial properties [24]. An alternative approach is to utilize deep-learning based models such as convolutional neural networks (CNNs). CNNs can be trained to generate meaningful histologic metrics on whole slide images (WSIs) to assist in feature quantification [25], classification [26], or segmentation [NFT burden calculation and slide level annotation

Neurofibrillary tangles (NFT) were semantically segmented from whole slide images (WSI) (Fig. 1a–c) using a SegNet model architecture, detailed in Signaevsky et al. 2019, which was trained on annotations performed by expert neuropathologists on 2221 NFTs from 14 different WSIs. For each slide, the model calculated NFT number, size, and location. WSIs were neuroanatomically segmented into the hippocampus proper (i.e., dentate gyrus, cornu ammonis, and subiculum) and the adjacent entorhinal cortex region, which variably includes posterior portions of the parahippocampal gyrus and the (trans-)entorhinal region or lingual gyrus (Fig. 1a) using Aperio ImageScope software. NFT counts were calculated for each region as the number of NFTs divided by the area of the region. AI-derived NFT positive pixel density was calculated as the sum of the area of all NFTs in a region divided by the area of the region. For standard positive pixel calculations, staining was measured in the hippocampus proper and entorhinal cortex separately and together using a modified version of the Aperio positive pixel count (Version 9) based on the intensities of the positive control sample in each batch to determine the area of immunoreactivity. Positive pixel counts were normalized using the number of positive pixel counts to the total area creating a 0–1 p-tau burden scale.

Fig. 1
figure 1

Detection of neurofibrillary tangles (NFT) in phospho-tau (AT8) immunohistochemically stained whole slide images (WSI). a Example of a hippocampal WSI immunohistochemically stained for phosphorylated-tau (AT8). The hippocampus proper (blue) and entorhinal region (red) were manually segmented. b High-power (20x) representative image of the hippocampal CA2 subfield showing p-tau positive neurofibrillary tangles. c Corresponding output of above image passed through semantic segmentation model that identifies NFT. Each pixel value corresponds to the probability that it represents an NFT

Mean clustering coefficient calculation

To estimate the degree of NFT clustering for a given WSI, we represented the spatial distribution of NFTs as a network and calculated the mean clustering coefficient. The center coordinate of each NFT is represented as a two-dimensional point cloud fed into a kd-tree and queried all points within a given radius, r. Thus, the spatial distribution of NFT for a given WSI is represented as a graph where each NFT is a node and its neighbors are the other NFTs within a distance of r (Fig. 5a). There is no standard metric of inter-NFT distance, therefore we created graphs over multiple values of r from 100 (50.66 microns) to 5000 pixels (2533 microns) in 100 pixel intervals. To correct for the whole slide NFT burden in this calculation, all statistics for this metric included the total number of NFTs as a nuisance variable.

Statistical analysis

All statistics were carried out via the statsmodels library in Python [37]. Data was visualized using the ggplot2 package in project R [38]. Descriptive statistics were used to identify differences between the cognitively normal and cognitively impaired PART groups for clinical, pathological, and genetic variables. Differences were detected using chi-square. A t-test was performed to determine if age differed significantly between normal and cognitively impaired groups. A multivariable model was created to determine to what extent measures of NFT burden (Braak NFT stage, positive pixel count, and AI-based) predict cognitive impairment in PART. Analyses evaluating associations between NFT burden and individual sub-measures of cognitive impairment utilized t-test for clinical diagnosis, Spearman rank-order for CDR, and Pearson correlation for MMSE. Age-adjusted models included age as a parameter. All statistical analyses using measures of NFT burden were corrected for multiple comparisons via false discovery rate.

Results

Dataset demographics, neuropathologic findings, and genetics

A total of 706 subjects were included in this study (Table 1). The overall mean age was 85.15 with a range of 55 to 110 years. Of these, 362 subjects (mean age 82.96, 168 male, 194 female) had no cognitive impairment (NCI) and 344 subjects (mean age 87.45, 161 male, 183 female) had some degree of cognitive impairment (CI). The CI group was significantly older than the NCI group (p < 0.0001). In our genetic analysis, we found no significant interaction between cognitive impairment and presence of ε2 APOE allele, ε4 APOE allele, or MAPT haplotype distribution.

Table 1 Summary of cohort data

Neuropathologic case review found 166 subjects (26.9%) exhibited hippocampal age-related tau astrogliopathy (ARTAG). Comparing between the groups, we found CI had significantly higher rates of ARTAG than NCI (31.27% vs 22.58%, p = 0.019). Considering that both ARTAG and CI are more prevalent in the elderly, we found after age adjustment via Cochran-Mantel–Haenszel method with two-level stratification there was no longer a significant association between ARTAG and CI (pooled OR: 1.42, p = 0.058). There was no significant statistical difference in Braak NFT stage scores between the two groups (NCI: mean 2.35, stdev 1.30; CI: mean 2.46, stdev 1.31; two tailed t-test, p = 0.27; chi-square test, p = 0.43). There were no significant differences in the distribution of CERAD score between the groups (NCI: mean 0.15, stdev 0.37; CI: mean 0.19, stdev 0.40; chi-square test, p = 0.48).

Tau burden

In our main unadjusted analysis of tau burden as a predictor of cognitive status (Table 2), we found that the Braak NFT stage was not a significant predictor of cognitive impairment (OR 1.09, p = 0.2769). However, both AI-detected NFT counts and AI-detected NFT positive pixel density were significant predictors of cognitive impairment in the entorhinal cortex (counts, OR 1.38, p = 0.0001; pixels, OR 1.32, p < 0.0001), hippocampus (counts, OR 1.40, p = 0.0001; pixels, OR 1.35, p < 0.0001), and combined regions (counts, OR 1.45, p < 0.0001; pixels, OR 1.40, p < 0.0001) (Fig. 2). Standard p-tau immunoreactivity positive pixel count was also a significant predictor of cognitive impairment in the entorhinal cortex (OR 1.29, p = 0.0039), hippocampus (OR 1.42, p = 0.0002), and combined regions (OR 1.39, p = 0.0002).

Table 2 Odds of being cognitively impaired at death based on p-tau metric
Fig. 2
figure 2

Increased neurofibrillary tangle (NFT) counts in cognitively impaired subjects. NFT densities are shown split by cognitive status, cognitively impaired (CI) and no cognitive impairment (NCI). NFT counts of the entorhinal cortex, hippocampus, and both regions combined are presented. Triple asterisks (***) denote p < 0.0001 based on a two-sample t-test between groups. Two-way analysis of variance yielded a F-statistic of 58.99

Similarly, in our age-adjusted analysis of tau burden as a predictor of cognitive status (Table 2), we found that the Braak NFT stage was not a significant predictor of cognitive impairment (OR 0.89, p = 0.1603). Age-corrected AI-detected NFT counts were a significant predictor of cognitive impairment in the entorhinal cortex (OR 1.15, p = 0.0373) and combined regions (OR 1.28, p = 0.0373), but not the hippocampus (OR 1.22, p = 0.0595) (Fig. 3D). In contrast, age-corrected AI-detected NFT positive pixel density and age-corrected standard positive pixel count were not a significant predictor of cognitive impairment in the entorhinal cortex (AI-pixel, OR 1.19, p = 0.0666; standard pixel, OR 1.15, p = 0.1467), hippocampus (AI-pixel, OR 1.17, p = 0.0847; standard pixel,OR 1.01, p = 0.0666), or combined regions (AI-pixel, OR 1.20, p = 0.0598; standard pixel, OR 1.21, p = 0.0678). When comparing AI-detected NFT counts with age (Fig. 3 a-c), we found a significant correlation between NFT counts and age in the entorhinal cortex (r = 0.28, p < 0.0001), hippocampus (r = 0.33, p < 0.0001), and combined regions (r = 0.34, p < 0.0001).

Fig. 3
figure 3

AI-detected NFT counts by region with respect to age and cognitive status. ac Relationship between NFT counts and age, grouped by cognitive status in Entorhinal Cortex (a), Hippocampus (b), and combined (c). Pearson correlation values between age and region’s NFT density are shown with associated p value. d Age adjusted NFT density group difference by region. Asterisks denote p < 0.05 based on a two-sample t-test between groups. Two-way analysis of variance yielded a F-statistic of 4.23

Detailed breakdown of associations between regional AI-detected NFT counts and each individual clinical variable can be found in Fig. 4. There was a significantly increased (p < 0.001) NFT in cases with a positive clinical diagnosis of cognitive impairment vs those without in all regions and combined. There was a modest yet statistically significant positive correlation between NFT counts and CDR score in the hippocampus (⍴ = 0.13, p = 0.02) and combined regions (⍴ = 0.12, p = 0.04) but insignificant in the entorhinal cortex (⍴ = 0.09, p = 0.14). There was a significant negative correlation between NFT counts and MMSE score in the entorhinal cortex (r = − 0.16, p = 0.01), hippocampus (r = − 0.17, p = 0.01), and combined regions (r = − 0.18, p = 0.003).

Fig. 4
figure 4

Relationship between NFT counts by region and each individual cognitive variable. In this analysis we used a loose label of cognitive impairment as a composite metric based on MMSE, CDR, or documented clinical history of cognitive impairment. This figure shows the relationships between AI-detected NFT counts by region and each individual clinical variable. (Left column) Two-sample t-tests were performed for documented clinical history of cognitive impairment. (Middle column) Spearman rho correlation was performed between NFT count and CDR score. (Right column) Pearson r correlation was performed between NFT count and MMSE

NFT Spatial Clustering Analysis

In our analysis of NFT clustering, we found that degree of NFT clustering significantly predicted cognitive impairment over a range of distance threshold values (r) (Fig. 5 b), with a maximum odds ratio (OR 1.27, p = 0.0039) at r = 800 px (405.28 microns) (Table 2). We found NFT clustering significantly predicted cognitive impairment across the range of distance threshold values, r, between 300 and 1200 pixels (151.98 microns—607.92 microns) (Fig. 6). With age adjustment, mean clustering coefficient did not significantly predict cognitive impairment (OR 1.16, p = 0.1162) (Table 2).

Fig. 5
figure 5

NFT position as a geometric network and subsequent graph metrics. a A representation of NFT position as a geometric network. Each NFT is represented as a node in a unidirectional binary graph, where an edge exists between two nodes if the Euclidean distance between them is less than some value r. In this figure r = 250 μm. b Group comparison of non-cognitively impaired (NCI) vs cognitively impaired (CI) mean clustering coefficient. Asterisk denotes p < 0.05. Two-sample t-test between groups yielded a t-statistic of  − 2.97 and p = 0.0031. c: Example of hippocampal whole slide image with high mean clustering coefficient (0.75). d Example of hippocampal whole slide image with low mean clustering coefficient (0.47)

Fig. 6
figure 6

Odds ratio of cognitive impairment on mean NFT clustering coefficient for a range of given distance thresholds. Since the cutoff of r in our mean NFT clustering coefficient metric has no ground truth, we tested it across a large range of values. Red lines bounding shaded areas demark the upper and lower bounds of the 95% confidence interval. Mean NFT clustering coefficient significantly predicts cognitive impairment for distance thresholds between 300 px (151.98 microns) and 1200 px (607.92 microns)

Discussion

Machine learning has emerged as a rigorous and reproducible quantitative approach for assessing neurodegenerative lesions in human autopsy brain tissues, including neurofibrillary tangles and Aβ plaques, key components of AD, aging, and related diseases. It is unclear, however, whether these AI-derived traits are clinically relevant. Improving our ability to assess clinical correlates of neuropathological features, which remain modest even with widely deployed approaches [39], is an important priority. Here we show, in an autopsy cohort of 706 subjects meeting the neuropathological criteria for PART, that AI-derived measurement of NFT burden, derived from digitized WSIs of the hippocampus immunohistochemically stained for p-tau in the medial temporal lobe, significantly predicts antemortem cognitive impairment. This AI classifier greatly outperformed Braak staging, the gold standard approach of NFT burden measurement, which did not predict cognitive impairment in this selected cohort. This supports our previous findings that widely deployed approaches may not fully capture clinically relevant disease burden in brains with PART [19].

While previous digital pathology studies have found correlations between p-tau burden and cognitive impairment [8, 18, 19, 40, 41], this is the first study, to our knowledge, to perform clinicopathologic correlations using AI-assisted NFT counts in a population of non-AD or related disease patients. Previous work using positive pixel counts in p-tau immunohistochemically stained digitized sections have provided a reliable estimate of p-tau burden [19, 42,43,44,45], however NFT segmentation via convolutional neural networks (CNNs) gives highly sensitive and specific measurements of NFT burden which are unbiased by neuropil threads or other tau-based pathologic structures [29]. In addition, AI-based CNNs generate novel metrics describing the size, morphology and spatial distribution of NFTs. Notably, of the computational measures of p-tau burden, we found that AI-derived metrics of NFT counts were the only measures to detect an age-independent relationship between NFT burden and cognitive impairment. Thus, we conclude that AI-derived measures of NFT burden are a valuable and precise histologic tool that can be implemented at scale to assess subtle relationships which may underlie clinically relevant signals without requiring the labor of manually counting NFTs on hundreds of WSIs. In summary, studies like this which leverage AI-derived histomics assist in demonstrating the feasibility of deploying such metrics in clinicopathologic correlation studies in neuropathology.

In addition to rapidly quantifying tangle burden on a large dataset of donors, we also introduced a novel metric of NFT mean clustering coefficient which was able to quantify the spatial density of NFTs in a given sample. We found that NFT mean clustering coefficient reliably predicted cognitive impairment in our population of PART patients. This metric provides a novel insight into the distribution of p-tau in a given section, a measure which so far has only been indirectly approximated [46]. We hypothesize the utility of this metric can assist in predicting cognitive impairment in tauopathies which are more focally distributed such as CTE [44, 46]. This approach to measuring disease burden has the theoretical potential to capture mechanisms of p-tau spread through a given region, which is currently under investigation by several other groups [47,48,49,50,51,52,53,54]. Previous work has shown the extent to which graph-based spatial measures can estimate disease burden in histopathology [55]. Of note, Signaevsky et al. 2022 found that graph-based metrics of spatial distribution of αα-synuclein lesions had the highest predictive value in diagnosing Parkinson’s disease over all other measures of α-synuclein burden [33]. Future studies will seek to leverage several more AI-generated features of neurodegeneration, including but not limited to tangle shape and morphology, white matter involvement, and other pathological classifiers.

While our study demonstrated a strong correlation between NFT burden and cognitive impairment, there are notable limitations. We designated cognitive status using a weak threshold based on limited available clinical information, including three different measures of cognitive impairment [30]. Correlative studies within prospective cohorts with antemortem neuropsychological assessments would allow for the potential to analyze differential relationships between anatomic subregional vulnerability and specific cognitive domain deficits. Telyan et al. 2020 found longitudinal decline within specific cognitive domains in a population of PART patients [56], however it remains unknown what histopathologic features underlie deficits in each domain. Correlative studies within prospective cohorts with antemortem neuropsychological assessments would allow for the potential to analyze differential relationships between anatomic subregional vulnerability and specific cognitive domain deficits. Further, the timeframe under which patients' clinical data were obtained before death was variable, and some may have progressed in this time window. Additionally, the cohort was not population based. For all these reasons, our clinical classification is inherently noisy. While this approach has modest sensitivity for cognitive impairment, we nevertheless found that our measures of NFT burden significantly correlated with each individual cognitive measure independently, demonstrating the utility of this AI-derived metric to detect a signal despite a high degree of noise. Another limitation is the use of coarse neuroanatomical annotations which did not follow subregion boundaries with known selective vulnerability profiles in PART [57, 58]. Follow up studies are ongoing to establish protocols for detailed hippocampal subregion annotations for future analysis, as well as leverage subregion specific p-tau burden metrics in clinicopathologic, genomic, and transcriptomic correlative studies. Further, this study did not account for the contributions of certain pathologic features (e.g., TDP-43, cerebrovascular disease, degree of neuronal loss) relevant to both cognitive impairment and the degree of neurofibrillary degeneration [19, 30, 59, 60]. Thus, future studies are necessary to measure the extent to which our observed associations would remain after accounting for their confounding influence. While this study establishes clinicopathologic correlations between AI-derived measures of NFT burden in a population of PART patients, further studies are required to validate these findings in other populations and tauopathies such as AD, FTLD, and CTE.

In conclusion, here we demonstrate that our AI-derived measures of neurofibrillary degeneration offer a rapid, robust, and reproducible approach to identifying histopathological features which predict antemortem cognitive impairment independently of age. These results support our prior work showing a strong correlation between cognitive impairment and the degree of NFT pathology using positive-pixel counts in the medial temporal lobe in PART. Further, this study demonstrates that AI-derived metrics have the potential to provide novel histologic signatures for clinicopathologic correlation in future studies.