Background

Alzheimer’s Disease (AD) is a degenerative neurological condition that impairs memory and thinking skills, and has become a public health crisis. The cost of AD to society is substantial as AD patients require significant expenditures for health care, intensive long-term services, and support. In the US, the total health care costs for AD treatment in 2020 is estimated at $305 billion [1].

The US National Alzheimer’s Project Act (NAPA) is a National Plan for Alzheimer’s disease and related dementias (AD/ADRD) that mobilizes public and private resources with the goal of preventing and effectively treating AD and ADRD by 2025. The AD+ADRD Research Implementation Milestones detail specific steps and success criteria towards achieving the goal of NAPA, and have eight focus areas including Enabling Infrastructure. Data Sharing and Reproducibility is one of the subareas under Enabling Infrastructure, and further contains seven specific milestones (Milestones 3.A – 3.G). Milestone 3.A aims at providing resources to make datasets from high value, publicly funded clinical research/cohort studies widely accessible, (re)usable and interoperable.

Major strides in data sharing for AD research in the US include the National Alzheimer’s Coordinating Center [NACC] [2] and the Alzheimer’s Disease Neuroimaging Initiative [ADNI] [3], which provide valuable resources for discoveries regarding AD pathophysiology, diagnosis, and treatment. On the other hand, the US National Institutes of Health (NIH) has launched NIH Common Data Elements (CDE) Repository to provide open access to structured definitions of data elements recommended or required by NIH Institutes and Centers or other organizations, in order to facilitate the interoperability among data sets in various disease research areas including neurological conditions. However, the extent to which these three resources (NACC, ADNI, and NIH CDE) are interoperable with each other with regard to AD-related data elements is unclear.

The goal of this work is to map data elements among NACC, ADNI and NIH CDEs in order to better understand their interoperability between each other. We explore and compare bag-of-words and word embedding models to perform the data element map**s between different resources.

National Alzheimer’s coordinating center

NACC has developed and maintains a large database of more than 156,000 participants’ information from Alzheimer’s Disease Research Centers (ADRCs) funded by the National Institute on Aging [4]. The NACC database includes the longitudinal clinical data, neuropathological data, imaging and fluid biomarker data, and genetic data. The Uniform Data Set (UDS) is the primary data set for researchers analyzing clinical and demographic data.

Alzheimer’s disease neuroimaging initiative

ADNI is a collaborative effort supported by the NIH and private sector to develop clinical, imaging, genetic, and biochemical biomarkers for early detection and tracking of AD [3]. ADNI data have been widely used by researchers around the world, which has resulted in over 2,100 publications. The study data shared by ADNI involve subject demographics, family history, diagnosis and neuropsychological assessments, biospecimen data, genetic results, MRI and PET images, medical history, and neuropathology results.

NIH Common Data Elements

CDEs are standardized, precisely defined questions (or variables) paired with a set of specific allowable responses (or permissible values), used across different sites or studies [5]. Use of CDEs can facilitate data sharing and standardization to improve data quality and enhance research reproducibility, as well as enable data harmonization and integration from multiple sources, including electronic health records.

NIH has encouraged the use and development of CDEs in patient registries, clinical studies, and other human subjects research to improve accuracy, consistency, and interoperability among data sets within various areas of health and disease research. The NIH CDE Repository, hosted by the National Library of Medicine, has integrated a total of more than 27,000 CDEs, including over 18,000 elements from the National Institute of Neurological Disorders and Stroke (NINDS) and over 1,500 elements from the National Cancer Institute (NCI).

Data element map**

Map** data elements across different sources has been an active research area in biomedical data integration. For instance, Mougin et al. performed data element map** from a collection of gene-related, protein-related, and disease-related sources to the National Cancer Institute (NCI) Cancer Data Standards Registry (caDSR) and the Unified Medical Language System (UMLS) [6]. Pathak et al. mapped phenotype data elements from five eMERGE (Electronic Medical Records and Genomics) Network sites to the caDSR and SNOMED CT [7]. Liu et al. mapped data elements in the Dental Information Model (DIM) to the caDSR common data elements [8]. To the best of our knowledge, this paper is the first study to map AD-related data elements.

Methods

In this work, bag-of-words and word embeddings models are leveraged to map AD-related data elements between NACC, ADNI and NIH CDE Repository. The map** results are manually reviewed by human experts. For valid map**s, the consistency of permissible value is further examined.

Materials

We downloaded structured data dictionaries (in CSV and PDF) from NACC, ADNI and NIH CDE Repository on November 23, 2021. The CSV data dictionaries provided by NACC contain data elements in the UDS, Neuropathology (NP) data set, and genetic data. The PDF data dictionaries provided by NACC contain data elements in the imaging and biomarker data sets. Figure 1 shows an example data element in NACC’s imaging data dictionary. We used the open-source pdftotext utility (part of the Xpdf software suite [9]) to convert PDF data dictionaries to plain text files, which were further parsed to extract attributes of data elements and store them in CSV. Figure 2 shows two examples of data elements in NIH CDE Repository.

Fig. 1
figure 1

An example data element from NACC’s imaging data dictionary in PDF. The form name of this data element is “Imaging”. The short descriptor of this data element is “Left insula gray matter volume (cc)

Fig. 2
figure 2

Two examples of common data elements in NIH CDE Repository

For data element map**, we leverage “Form” and “Short descriptor” of NACC data elements, “CRF NAME” and “TEXT” of ADNI data elements, as well as “Name” and “Question Texts” of data elements in NIH CDE Repository.

Data element preprocessing

For NACC and ADNI, since the same data element may be collected in different study phases or visits (i.e., the same information being captured multiple times), we keep one instance and remove duplicated ones for map**. Table 1 shows an example of such a data element in ADNI, which was captured in four phases (i.e., ADNI1, ADNIGO, ADNI2, and ADNI3). We further filter out ambiguous data elements containing “other” and “specify”, because they are always associated with another data element and likely to cause incorrect map**s. For instance, data element “Other, specify” in NIH CDE Repository may relate to “Tremor type”, “Employment current status code”, or “Primary caregiver relation patient type”, etc. We also disregarded a few ADNI data elements which only have “CRF NAME” but lack of “TEXT”.

Table 1 Example of duplicated data elements in ADNI

Furthermore, in NACC’s CSV data dictionaries, the forms are using short names (e.g., “a1”, “a2”, “b6”). We found the full names of the forms through NACC’s website, and converted the short names to full names for NACC’s data dictionaries. For example, form “b6” has a full name of “Behavioral Assessment - Geriatric Depression Scale”. In this paper, we use the full names of the forms when describing examples. In ADNI’s data dictionary, for some imaging-related data elements, words in phrases describing brain regions are concatenated without spaces. We pre-process such cases and add spaces between the concatenated words. For example, after pre-processing “Cortical Thickness Average of RightTemporalPole” we obtain “Cortical Thickness Average of Right Temporal Pole”.

In addition, we normalize the text describing data elements before performing the map**. The normalization consists of the following three steps: (1) convert the text to be lowercase, and remove punctuations like [! ” \(\#\) $ \(\%\)] as well as extra whitespaces in the text; (2) filter out stop words (such as “a”, “the”, “on”, and “is”) from the text using the open-source Natural Language Toolkit (NLTK) in Python [10]; and (3) perform lemmatization using WordNet Lemmatizer in NLTK. For instance, both the short descriptor of NACC data element “Body bradykinesia and hypokinesia” and the name of NIH CDE “Body bradykinesia - hypokinesia” are normalized to “body bradykinesia hypokinesia”.

Bag-of-words based map**

For bag-of-words based map**, we model each data element as a bag-of-words. Given two data elements from different sources, we calculate the cosine similarity score of the bags-of-words of these two data elements. For instance, the cosine similarity between the NACC data element with form name of “Behavioral Assessment - Geriatric Depression Scale” and short descriptor of “Are you basically satisfied with your life?” and ADNI data element with CRF name of “Geriatric Depression Scale” and text of “1. Are you basically satisfied with your life?” is 0.9. Note that the form name and short descriptor of NACC are combined as “Behavioral Assessment - Geriatric Depression Scale Are you basically satisfied with your life?”, and the CRF name and text of ADNI are combined as “Geriatric Depression Scale 1. Are you basically satisfied with your life?” when conducting the map** process.

We consider data elements with a cosine similarity score of above 0.6 as mapped ones for further review and evaluation. We perform the data element map**s between different sources as follows: NACC to ADNI, NACC to NIH CDE, and ADNI to NIH CDE. If a data element from source A has multiple mapped data elements in source B, we only keep the one with the highest similarity score.

Word embeddings based map**

For word embeddings based map**, we explore word embeddings obtained by two widely used pre-trained models, Word2Vec [Evaluation and value map**

To compare the effectiveness of the bag-of-words and word embedding based models, we manually review and evaluate the map** results between data elements in different sources. For each pair of data elements evaluated as a correct map**, we further examine their value types (e.g., numerical, categorical, date, and text) and permissible values for further map**. For mapped data elements with a numerical type, we check if they share the same unit; and for mapped data elements with a categorical type, we check if the permissible values are completely identical, or partially identical, or different.

Results

The data dictionaries that we downloaded contain 1,195 NACC data elements, 13,918 ADNI data elements, and 27,213 NIH CDEs. After the data element preprocessing, the numbers of NACC and ADNI data elements were reduced to 1,099 and 7,584 respectively.

Data element map**

For the bag-of-words based approach (see Table 2), 156 pairs of mapped data elements from NACC to ADNI were identified, and 77 of them were evaluated as valid (or correct) map**s; 228 pairs from NACC to NIH CDE were identified and 73 of them were valid; and 382 pairs from ADNI to NIH CDE were identified and 90 of them were valid.

Table 2 Map** result of the bag-of-words approach

For the Word2Vec based approach (see Table 3), 287 pairs of mapped data elements from NACC to ADNI were obtained, among which 120 were evaluated as valid map**s; 223 pairs from NACC to NIH CDE were obtained and 63 of them were valid; and 545 pairs from ADNI to NIH CDE were obtained and 80 of them were valid.

Table 3 Map** result of the Word2Vec approach

For the BioWordVec based approach (see Table 4), 633 map** pairs of data elements from NACC to ADNI were identified and 164 of them were evaluated as valid; 765 pairs from NACC to NIH CDE were identified and 89 of them were valid; and 2,448 pairs from ADNI to NIH CDE were identified and 150 of them were valid.

Table 4 Map** result of the BioWordVec approach

Figure 3 displays a bar graph comparing the number of valid map**s obtained by each approach among three resources. Overall, the BioWordVec approach identified more numbers of valid map**s than the bag-of-words approach and Word2Vec approach.

Fig. 3
figure 3

Comparison of numbers of valid map**s identified among three resources by different approaches

Table 5 lists nine examples of valid map**s identified by different approaches. For instance, the bag-of-words approach identified a valid map** from the NACC data element with form name of “Unified Parkinson’s Disease Rating Scale (UPDRS)” and short descriptor of “Finger taps - right hand” to the NIH CDE with name of “Movement Disorder Society - Unified Parkinson’s Disease Rating Scale (MDS UPDRS) - finger tap right hand score” and question texts of “FINGER TAPPING”.

Table 5 Examples of valid map**s identified by three approaches

We further calculated the precision and recall of each approach applied to different map** resources. Since different approaches may identify the same map**, we aggregated the map** results obtained by three approaches and removed duplicated map**s. From NACC to ADNI, there were a total of 720 map**s identified by three approaches, among which 175 were evaluated as valid. From NACC to NIH CDE, there were a total of 867 map**s identified by three approaches, among which 107 were evaluated as valid. From ADNI to NIH CDE, a total of 2,772 map**s were identified by three approaches, and 171 of them were evaluated as valid. Figure 4 displays the Venn diagram of the aggregated valid map**s identified by three approaches from NACC to ADNI, NACC to NIH CDE, and ADNI to NIH CDE, respectively.

Fig. 4
figure 4

Venn diagrams of aggregated map**s among three resources. A total of 175 map**s were identified from NACC to ADNI, 107 from NACC to NIH CDE, and 171 from ADNI to NIH CDE

Based on the aggregated valid map**s, we compute the precision and recall of different approaches with regard to disparate map**s resources (see Figs. 5 and 6). It can be seen that the bag-of-words approach achieved the best precision, while the BioWordVec approach attained the best recall.

Fig. 5
figure 5

Precision of different approaches

Fig. 6
figure 6

Recall of different approaches

Permissible value map**

For the map**s evaluated to be valid, we examined whether the two data elements involved in each map** share the same value type (see Table 6). From NACC to ADNI, 173 out of 175 pairs of mapped data elements have the identical value type. From NACC to NIH CDE, 105 out of 107 pairs of mapped data elements share the same value type. From ADNI to NIH CDE, 164 out of 171 pairs of mapped data elements have the identical value type. For example, both the NACC data element with form name of “Unified Parkinson’s Disease Rating Scale (UPDRS)” short descriptor of “Finger taps - right hand” and the NIH CDE with name of “Movement Disorder Society - Unified Parkinson’s Disease Rating Scale (MDS UPDRS) - finger tap right hand score” and question texts as “FINGER TAPPING” have a numerical value type. An example of disparate value type is the map** between the NACC data element with form name of “Subject Health History” and short descriptor of “Average number of packs smoked per day” and the ADNI data element with CRF name of “Medical History” and text of “16a. During periods of smoking, the average number of packs/day”, where the former data element has a categorical value type while the latter one has a numerical value type.

Table 6 Result of value type map**

For mapped data elements sharing the same value type (numerical or categorical), we further checked the consistency of their permissible values. More specifically, we compared units for numerical data elements if applicable, and value lists for categorical data elements.

For the mapped data elements of numerical type, we classified their unit consistency checking results into three categories: identical, disparate, and not available (see Table 7). For example, the NACC data element with form name of “Imaging” and short descriptor of “Segmented right hippocampus volume (cc)” and its mapped ADNI data element with CRF name of “UCSF SNT Hippocampal Volumes” and text of “Right Hippocampus Volume” have disparate measurement units: the former is “cc”, while the latter is “mm3”. If at least one of the two data elements does not provide unit information, we categorized their unit consistency result as not available.

Table 7 Result of unit consistency checking for mapped numerical data elements

For the mapped data elements of categorical type, we classified their value list consistency results into three categories: identical, partially identical, and disparate (see Table 8). For instance, the ADNI data element with CRF name of “Geriatric Depression Scale” and text of “8. Do you often feel helpless?” and its mapped NIH CDE with name of “Geriatric Depression Scale (GDS) - feel helpless indicator” and question texts of “Do you often feel helpless?” in Table 5 have partially identical value list. Specifically, permissible values of ADNI’s data element is a proper subset of permissible values of NIH CDE. The former has two permissible values: “1=Yes” and “0=No”; and the latter has three permissible values: “Yes”, “No”, and “Unknown”.

Table 8 Result of consistency checking of permissible values for mapped categorical data elements

Discussion

In this work, we have explored bag-of-words, Word2Vec, and BioWordVec based approaches to map data elements in NACC, ADNI, and NIH CDE. In total, the three approaches mapped 175 out of 1,099 (15.92%) NACC data elements to ADNI; 107 out of 1,099 (9.74%) NACC data elements to NIH CDE; and 171 out of 7,584 (2.25%) ADNI data elements to NIH CDE. This indicates that there is a critical need to develop standardized CDEs for AD research to facilitate data sharing and interoperability of various AD-related data sets. Given the wide usage of ADNI and NACC, they may serve as valuable references for creating such standardized AD-related CDEs.

Comparison of three approaches

Bag-of-words approach represents text as vector by a bag-of-words. The similarity between two texts is calculated by cosine distance between the vectors. When conducting our map** task, bag-of-words approach has an advantage of being simple and not requiring very complicated processing. In our study, leveraging bag-of-words approach to map data elements costed much less time than the Word2Vec approach and BioWordVec approach. However, bag-of-words approach lacks representation between similar words and may not perform well when text has repeated words. For example, bag-of-words approach failed to identify a correct map** of “Subject Demographics - Marital status” in NACC and “Participant Demographics - 4. Participant Marital Status” in ADNI while Word2Vec approach successfully identified it. Here, “Participant” appears twice in ADNI’s data element and brings down the similarity score of bag-of-words approach.

For Word2Vec, the word vector is a low-dimensional real number vector obtained by training, which solves bag-of-words approach’s problem of semantic deficiency due to the independence of words. Since Word2Vec takes the context into consideration, it can identify certain correct map**s that bag-of-words approach cannot. However, one disadvantage of Word2Vec is that since there is a one-to-one relationship between words and vectors, there is a problem for polysemy.

As for BioWordVec, it has shown effectiveness and utility in multiple NLP tasks in the biomedical domain. Word embeddings trained on the PubMed and PMC corpora significantly outperform those trained on Google News. It improves the quality of biomedical word representations and better capturing their semantics. For example, bag-of-words approach and Word2Vec approach failed to identify a correct map** of “Imaging - Left pars triangularis mean cortical thickness (mm)” in NACC and “Longitudinal Free Surfer - Cortical Thickness Average of Left Pars Triangularis” in ADNI while BioWordVec successfully identified it. After further examination, we found that Word2Vec’s Google News word vectors set does not contain the vector of “Triangularis”, but BioWordVec’s bio word vectors set contains it. Furthermore, the similarity of “mean” and “average” in BioWordVec is 0.8524 while their similarity in Word2Vec is 0.1956. Comparing to the other two methods, BioWordVec identified the greatest number of correct map**s in this work.

Limitations and future work

With regard to the performance of different approaches, since there was no gold standard available for the map** results, the reported performance metrics (precision and recall) were based on the manual review and evaluation of the candidate map**s identified by the three approaches. It is possible that certain valid map**s were not obtained by any of the three approaches (i.e., missed by our approaches), indicating that the actual recall may be lower than the reported recall. Furthermore, it can be seen from Fig. 5 that the three approaches showed limited precision (ranging from 6.13% to 49.36%). These indicate that additional work is still needed to improve both precision and recall of the map** approaches.

Analysis of false positive cases

For potential improvement of precision, we examined some of the false positive cases (i.e., invalid map**s) and analyzed potential reasons. One reason is that the map** approaches were not capable of distinguishing the nuances between the data elements. For example, NACC data element with form name of “Co-participant Demographics” and short descriptor of “Co-participant’s month of birth” was mapped to ADNI data element with CRF name of “Participant Demographics” and text of “2a. Participant Month of Birth” with similarity score of 0.74, but they are targeting different subjects (co-participant versus participant). Another example is that NACC data element with form name of “Neuropsychological Battery Scores” and short descriptor of “Rey Auditory Verbal Learning (Immediate) Trial 3 Total recall” was mapped to ADNI data element with CRF name of “Neuropsychological Battery” and text of “Rey Auditory Verbal Learning Test Trial 1 Total” with similarity score of 0.73. However, NACC refers to Trial 3, while ADNI refers to Trial 1. A potential way to avoid such false positives is to assign importance weights for the different words between the two data elements and assess whether their difference is significant or not.

Analysis of false negative cases

For potential improvement of recall, we manually examined some data elements in NACC and ADNI and identified a few causes for false negative cases (i.e., missed map**s by three approaches). One is that our approaches did not leverage acronyms. For example, NACC uses “APOE genotype” while ADNI uses “Apolipoprotein-E”; thus this map** was missed by our approaches. In future work, we plan to leverage the Unified Medical Language System (UMLS) to perform synonym substitution and explore to what extent this may help identify additional matching data elements between different resources.

Another scenario is that ADNI leverages some of the NACC’s neuropathology data elements, but it sometimes provides a longer description of the original NACC’s data elements. For example, ADNI’s data element with CRF name of “NACC Neuropathology Data Form” and text of “Primary pathological diagnosis judged to be responsible for subject’s cognitive status - Prion-associated disease” reuses NACC’s data element with form name of “Neuropathology” and short descriptor of “ Prion - associated disease - primary”, but it has a much longer text description. Our approaches failed to identify such cases.

Additionally, some missed map**s were due to different design or organization of forms in NACC and ADNI. For example, consider NACC’s data element with form name of “Neuropsychological Battery Scores” and short descriptor of “MoCA: Orientation - Place” and ADNI’s data element with CRF name of “MoCA” and text of “Place”. While ADNI has a dedicated form for MoCA, NACC’s Neuropsychological Battery Scores form covers a wider range of tests including MoCA. Since our approaches combine the form name with description, “Neuropsychological Battery Scores MoCA: Orientation - Place” for NACC is much longer than “MoCA Place” for ADNI, leading to low similarity score. Assigning importance weights as mentioned above may help improve the precision.

Additional work is still needed to find out the remaining scenarios of valid map**s between different sources that have been missed by three approaches. Since manual review is a time-consuming and labor-intensive process, a more feasible solution is to randomly select a subset of data elements from one source and invite domain experts to manually map them to the other source. Such subsets would serve as partial gold standards to uncover valid map**s missed by our approaches and provide insight into enhancing the approaches.

Conclusions

In this paper, we have explored three models (bag-of-words, Word2Vec, and BioWordVec) for representing and map** data elements among NACC, ADNI, and NIH CDE Repository, in order to understand how AD-related data elements in these resources are interoperable with each other. Our results showed that the bag-of-words based approach attained the best precision, while the BioWordVec based approach achieved the best recall. Although the map** approaches need further improvement, our result indicates a vital need to create standardized AD-related CDEs leveraging rich data elements in NACC and ADNI to enhance the interoperability of various datasets for AD research.