Assessing clusters of comorbidities in rheumatoid arthritis: a machine learning approach

Solomon, Daniel H.; Guan, Hongshu; Johansson, Fredrik D.; Santacroce, Leah; Malley, Wendi; Guo, Lin; Litman, Heather

doi:10.1186/s13075-023-03191-8

Assessing clusters of comorbidities in rheumatoid arthritis: a machine learning approach

Research
Open access
Published: 22 November 2023

Volume 25, article number 224, (2023)
Cite this article

Download PDF

You have full access to this open access article

Arthritis Research & Therapy Aims and scope Submit manuscript

Assessing clusters of comorbidities in rheumatoid arthritis: a machine learning approach

Download PDF

Daniel H. Solomon^1,2,
Hongshu Guan¹,
Fredrik D. Johansson³,
Leah Santacroce¹,
Wendi Malley⁴,
Lin Guo⁴ &
…
Heather Litman⁴

1147 Accesses
1 Citation
2 Altmetric
Explore all metrics

Abstract

Background

Comorbid conditions are very common in rheumatoid arthritis (RA) and several prior studies have clustered them using machine learning (ML). We applied various ML algorithms to compare the clusters of comorbidities derived and to assess the value of the clusters for predicting future clinical outcomes.

Methods

A large US-based RA registry, CorEvitas, was used to identify patients for the analysis. We assessed the presence of 24 comorbidities, and ML was used to derive clusters of patients with given comorbidities. K-mode, K-mean, regression-based, and hierarchical clustering were used. To assess the value of these clusters, we compared clusters across different ML algorithms in clinical outcome models predicting clinical disease activity index (CDAI) and health assessment questionnaire (HAQ-DI). We used data from the first 3 years of the 6-year study period to derive clusters and assess time-averaged values for CDAI and HAQ-DI during the latter 3 years. Model fit was assessed via adjusted R² and root mean square error for a series of models that included clusters from ML clustering and each of the 24 comorbidities separately.

Results

11,883 patients with RA were included who had longitudinal data over 6 years. At baseline, patients were on average 59 (SD 12) years of age, 77% were women, CDAI was 11.3 (SD 11.9, moderate disease activity), HAQ-DI was 0.32 (SD 0.42), and disease duration was 10.8 (SD 9.9) years. During the 6 years of follow-up, the percentage of patients with various comorbidities increased. Using five clusters produced by each of the ML algorithms, multivariable regression models with time-averaged CDAI as an outcome found that the ML-derived comorbidity clusters produced similarly strong models as models with each of the 24 separate comorbidities entered individually. The same patterns were observed for HAQ-DI.

Conclusions

Clustering comorbidities using ML algorithms is not computationally complex but often results in clusters that are difficult to interpret from a clinical standpoint. While ML clustering is useful for modeling multi-omics, using clusters to predict clinical outcomes produces models with a similar fit as those with individual comorbidities.

Real-world data in rheumatoid arthritis: patient similarity networks as a tool for clinical evaluation of disease activity

Article Open access 04 September 2023

Models solely using claims-based administrative data are poor predictors of rheumatoid arthritis disease activity

Article Open access 08 May 2017

Clinical phenotypes of comorbidities in end-stage knee osteoarthritis: a cluster analysis

Article Open access 17 April 2024

Introduction

Most patients with rheumatoid arthritis (RA) have multimorbidity, but not all. Prior studies found that between 50 and 84% of patients with RA had some comorbidity with a mean of 2 comorbidities based on the Charlson Index [1, 2]. In addition, patients with RA develop comorbidities at an increased rate after the diagnosis of RA compared with matched controls [3]. Some of the excess morbidities may be directly related to RA (i.e., interstitial lung disease) and others are likely part of the systemic inflammatory milieu caused by RA. Comorbidities are important in RA as they strongly associate with disease activity, response to treatment, and overall mortality [4, 5].

Several prior studies have used machine learning (ML) to identify clusters of comorbidities in RA [6, 7], and similar clustering analyses have been pursued across other rheumatic diseases [8, 9]. Develo** clusters of comorbidities through the use of ML is relatively easy to achieve, but it is important to consider the purpose of the clustering: do clusters of comorbidities (versus individual comorbidities) provide new insights or act to predict or possibly explain clinical outcomes. While prior comorbidity cluster studies have created clusters, it has not always been clear what motivated prior studies. In addition, prior comorbidity cluster studies have been derived from single academic medical centers with unclear generalizability. They also have defined comorbidity clusters using data at one point in time without respect to the longitudinal accumulation of comorbidities.

Relatively little work has focused on determining the value of comorbidity clusters in the longitudinal modeling of clinical outcomes. We compared results for different ML algorithms employed to cluster patients based on comorbidities among RA patients in CorEvitas, a large US-based registry. We assessed how clusters of patients with given comorbidities predict future outcomes, including physical function and RA disease activity, and compared the prediction of outcomes using comorbidity clusters versus individual comorbidities. We hypothesized that the supervised ML algorithms would be fitted to predict outcomes as well as individual comorbidities.

Methods

Study population and design

We used the CorEvitas RA registry to identify a cohort of patients potentially eligible. From this group, patients were required to have at least 6 years of experience in the registry, between 2011 and 2021, but patients could have entered the CorEvitas before 2011. The first visit in the CorEvitas RA registry was considered baseline with follow-up through the last visit in the registry. The full longitudinal dataset was used to identify patients in comorbidity clusters during the first phase of these analyses. For the second phase, the comorbidity clusters were assessed using the first 3 years of consecutive available data with the next 3 consecutive years used to determine clinical outcomes.

Comorbidities of interest

The comorbidities of interest are collected at baseline and then updated in CorEvitas. These include conditions summarized in Supplemental Table 1. The list of comorbidities included is quite similar to what has been reported in prior papers examining frequent comorbidities in RA [2, 10]; this grou** of comorbidities has been found to be associated with relevant clinical outcomes in RA.

Comorbidities are recorded at the time of enrollment in the registry and updated by patients and clinicians at subsequent visits that typically occur twice per year. Since we focused on chronic comorbidities, i.e., comorbidities accumulated over time. In other words, if one of these chronic comorbidities (e.g., diabetes or coronary artery disease) was reported, then it was assumed to be ongoing at subsequent visits.

Specific questions on comorbidities in CorEvitas changed in 2011. To determine the impact of changes in the collection of comorbidities, a secondary analysis was conducted only using participants who entered in 2011 or after (see Supplemental Table 2). The reporting of comorbidities appeared similar to the total cohort. Thus, this sub-analysis was not pursued further.

Outcomes

The first phase of analyses focused on deriving comorbidity clusters using ML algorithms; therefore, the clusters were the outcomes. The second phase focused on whether comorbidity clusters associated with future clinical outcomes. The clinical outcomes of interest in phase two were the clinical disease activity index (CDAI) and function as measured by the Health Assessment Questionnaire—Disability Index (HAQ-DI) [11, 12].

CDAI and HAQ-DI are measured at almost all visits in CorEvitas. CDAI is a continuous scale from 0 to 76 with well-accepted thresholds for different levels of disease activity [11]. CDAI includes four components: patient global arthritis activity (0–10), physician (assessor) global arthritis activity (0–10), tender joint count (0–28), and swollen joint count (0–28). Since we assessed the outcomes during the final 3 years of the study period, the time-averaged CDAI from those years was used as the primary disease activity outcome. The time-averaged CDAI was calculated based on a weighted average of the CDAI, using the number of months between visits as the weighting factor. In other words, the CDAI at a given visit was multiplied by the number of months after a given visit; each segment (CDAI x months) was added together and then divided by 36 months. A secondary outcome was the change in time-averaged CDAI between the first 3 years and the second 3 years of the study period.

The HAQ-DI encompasses 20 items across eight domains, each item scored 0–3 based on how much help is required to complete a given task (i.e., dressing and grooming, arising, eating, walking, hygiene, reach, grip, and activities) [12]. The average score for each domain is calculated, and then the average across the eight domains is used as a summary. The same method was used for the HAQ-DI to assess outcomes during the final 3 years of the study period, using a time-averaged HAQ-DI. Just as with the CDAI, change in time-averaged HAQ-DI was considered a secondary outcome.

Statistical analyses

We assessed patient characteristics at baseline and at year 3 of follow-up and then examined the comorbidity distribution across the population throughout 6 years of longitudinal follow-up. During the first phase of this work, the results of five different ML algorithms for clustering the patients’ comorbidities over the 6-year period were examined; the ML algorithms included K-mode, K-mean, agglomerative hierarchical divisive analysis clustering (DIANA), agglomerative nesting clustering (AGNES), and model-based clustering (VarSelLCM) [13, 14]. Three, four, five, and six clusters were each assessed. We chose 5 as the number of clusters for all clustering algorithms based on the “elbow” method from the K-mode clustering [15]. The data were clustered by patient. (For K-means, center = 5; for K-modes, modes = 5; for AGNES and DIANA, cut the tree at k = 5. For VarselCluster, we selected all comorbidity variables and chose the highest probability group among 5 groups as the patient’s cluster group.)

In the second phase of this work, we compared the performance of the different clustering algorithms, with respect to their association with clinical outcomes. For all ML algorithms, the five-cluster solution was chosen based on statistical methods that look for an inflection point in the sum of squares [13] (see Supplemental Fig. 1). The two clinical outcomes selected were the time-averaged CDAI and time-averaged HAQ-DI. The clusters were defined using data from the first 3 years of follow-up and the clinical outcomes defined in the next 3 years.

To understand the value of the different clusters, we compared the model fit for three sets of models. These included the following as independent variables: (a) only demographics and RA variables; (b) demographics, RA variables, and each comorbidity; (c) demographics, RA variables, and the clusters. This was repeated for each of the ML clustering algorithms. Sensitivity analyses considered sex-stratified models, models with only comorbidities recorded since baseline, and the secondary outcome (change in CDAI or HAQ-DI).

R (version 4.3.0) and SAS (version 9.4) statistical computing packages were used for all analyses.

Results

Among all RA patients in the CorEvitas RA registry, nearly 12,000 patients had accumulated the minimum 6 years of longitudinal data. Characteristics of the study cohort at baseline and year 3 are shown in Table 1. At baseline, patients were on average 59 (SD 12) years of age, 77% were women, CDAI was 11.3 (SD 11.9, moderate disease activity), HAQ-DI was 0.32 (SD 0.42), and disease duration was 10.8 (SD 9.9) years. Almost all reported current use of a DMARD at both years 1 and 3. In addition, the use of medications for common comorbidities was frequent. Table 2 shows the percentage of patients that reported a comorbidity over the 6-year study period. The median number of comorbidities at baseline was 2 (IQR 1, 3). As anticipated, cardiovascular comorbidities (i.e., coronary artery disease, hypertension, diabetes, and hyperlipidemia) are all common. Osteoporosis is reported in almost one-quarter of patients, acute kidney injury in one-sixth, and mental health issues in over half.

Table 1 Characteristics of patients with rheumatoid arthritis from the CorEvitas registry included in the analyses, at baseline and after 2 years of follow-up

Full size table

Table 2 Comorbidities of patients with rheumatoid arthritis from the CorEvitas registry included in the analyses, at baseline and during follow-up

Full size table

Clusters of patients were generated based on their comorbidities using five different ML algorithms; five cluster results were a focal point as they appeared to best describe the data (Supplemental Fig. 1 and Supplemental Table 3a-e) [13]. Using 5 clusters, 24 comorbidities, and 11,883 subjects, each ML algorithm provided different clusters. K-modes and K-means gave similar results: both generated one cluster that had few patients who had few comorbidities; both generated one cluster with many patients having cardiovascular comorbidities; and both generated a cluster with mental health issues and fibromyalgia. The model-based clustering algorithm generated clusters with a broad distribution of comorbidities. Lymphoma and skin cancers were relatively more frequent in one cluster and fibromyalgia and mental health issues in another cluster. The agglomerative hierarchical methods gave similar answers to each other that were different than the first three methods. The DIANA and AGNES algorithms each created one cluster with much higher frequencies of all comorbidities.

To better understand the potential role of the different ML clustering algorithms in clinical research, we examined their relationship with two different outcomes—CDAI and HAQ-DI. Distribution of CDAI and HAQ-DI during the first 3 years and the subsequent 3 years were assessed (Fig. 1). At baseline, approximately 50% of CDAI scores were in the remission or low disease activity range and the remainder were evenly split between moderate and high disease activity. At follow-up, these proportions remained stable. For the HAQ-DI, at baseline, approximately 90% had scores of 1 or below (little or no assistance with typical activities). This remained stable at follow-up.

The regression models for time-averaged CDAI are shown in Table 3. The first model (far left) includes only demographics and RA variables and no comorbidities; the R² was 0.30 and RMSE 7.07. Adding all comorbidities individually demonstrated a slightly higher R² (0.33) and a slightly lower RMSE (6.95). When the five clusters from K modes were substituted for the individual comorbidities, the model fit hardly changed. The same was observed for clusters generated by regression and the DIANA agglomerative hierarchical algorithm. The same pattern was observed for the time-averaged HAQ-DI endpoint (Table 4). The sensitivity analyses—sex-specific stratified analyses, baseline versus post-baseline comorbidities, and change in time-averaged CDAI or HAQ-DI—found very similar results (see Supplemental Tables 4, 5, and 6). However, the models with a change in time-averaged outcome had slightly better model fit than the models with the primary outcome.

Table 3 Multivariable regression models comparing models for time-averaged CDAI outcome, individual comorbidities versus comorbidity clusters

Full size table

Table 4 Multivariable regression models comparing models for time-averaged HAQ-DI outcome, individual comorbidities versus comorbidity clusters

Full size table

Discussion

It has become popular to attempt to examine how patients cluster based on their comorbidities in rheumatic diseases [6,7,8]. This is often achieved with some form of ML clustering algorithm. While clustering of patients based on comorbidities is intended to provide a deeper understanding of the heterogeneity of diseases, such as RA, the clusters are not always very interpretable; further, it is hard to gauge whether the clusters have provided more information than the individual comorbidities. To examine this issue, we used a very large longitudinal RA registry to characterize comorbidity clusters using ML algorithms. The clusters varied based on the ML algorithm used. To examine whether the different algorithms provided new information about the individual comorbidities, clinical outcomes models were assessed for CDAI and HAQ-DI. Model results demonstrated that the clusters performed similarly to each other and similar to models with individual comorbidities; this was true across outcomes.

Two recent studies, both using ML, have examined whether informative patient clusters based on comorbidities among patients with RA could be identified. The first one used data from a single center registry and developed principal components which were then clustered using K-mean clustering [7]. From 1443 patients, 5 clusters were determined that differed in disease activity, comorbidity scores, and outcomes such as infection. This study was limited in several important ways. In addition to it comprising only one academic rheumatology practice, cluster analyses were applied to baseline comorbidities only, without accounting for the development over time of additional comorbid conditions. Further, it was not clear how the clusters of comorbidities were added incrementally over considering each comorbid condition individually. The second study, from the Mayo Clinic, included 1409 patients with RA [16]. Several ML algorithms were used, including hierarchical clustering, network analysis, and latent class analysis. Different methods yielded different numbers of clusters.

Machine learning algorithms permit relatively easy methods to cluster many variables across hundreds or thousands of patients. These methods have become popular in various types of high dimensional data, such as proteomics, transcriptomics, and genomics. It is natural for clinical researchers to import such methods into analyses of clinical variables; however, it is not clear whether these methods add much over more traditional analyses. While the current analyses did not suggest that comorbidity clusters explained much of the variation in CDAI or HAQ-DI, these clusters may be more useful in explaining other clinical outcomes, i.e., treatment response. Our findings suggest that ML clustering algorithms can be used on comorbidity data to define groups of patients based on the many varied conditions patients have other than RA. However, the algorithms can be difficult to interpret. Not surprisingly, the clusters derived from these ML algorithms are not better predictors than individual comorbidities, but they do produce clinical models with similar overall fit. Similar fit with the clusters is impressive; however, the clusters cannot be produced without knowing the individual comorbidities. Thus, the value of clustering comorbidities in clinical analyses is not perfectly clear. In addition, since the clusters collapse 24 variables into five, this approach is statistically more efficient.

The fact that the clusters have similar value as the individual comorbidities suggests that the 23 different comorbidities may not need to be collected if the clusters are known. Since the clusters do not have clear face validity, it is not apparent that clinicians can recognize patients that occupy one cluster or another. Clustering algorithms can be used to describe phenotypes of heterogeneous disease, like RA; however, comorbidity data may not be that helpful for defining these sub-phenotypes. Further, the different ML algorithms defined different clusters, suggesting that the clustering did not describe a “biologic” truth; rather, it likely represented a statistical phenomenon.

This study has several important strengths, including a very large sample size with longitudinal data. In addition, the cohort is derived from many practices across the USA, including both community-based and academic rheumatology practices. Limitations include the fact that comorbidity reporting may be incomplete and not consistently defined across clinicians. Also, some of the comorbid conditions are self-reported and thus there is likely misclassification.

Conclusions

In conclusion, we defined clusters of RA patients based on comorbidities, using a ML algorithm. Different algorithms produced different clusters, many of which were hard to understand clinically. However, in clinical outcomes models, the clusters performed similarly to each other and to the individual comorbidities. Comorbidity clusters seem to be useful in clinical outcomes models due to their statistical efficiency. However, it is not clear that they provide new insights beyond the individual variables. While ML clustering algorithms have a clear role in multi-dimensional biologic data, their role in clinical research in rheumatology needs continued assessment. We recommend that future comorbidity clustering studies be designed with a clear purpose in mind for the clustering, such as identifying a small number of clusters that best predict future outcomes.

Availability of data and materials

Data are available from CorEvitas, LLC through a commercial subscription agreement and are not publicly available. No additional data are available from the authors. Qualified investigators can approach CorEvitas for permission to use the data.

Abbreviations

AGNES:: Agglomerative nesting clustering
CDAI:: Clinical Disease Activity Index
DIANA:: Divisive analysis clustering
DMARD:: Disease-modifying anti-rheumatic drug
HAQ-DI:: Health Assessment Questionnaire—Disability Index
ML:: Machine Learning
RA:: Rheumatoid arthritis
RMSE:: Root mean square error

References

Yoshida K, Lin TC, Wei MY, Malspeis S, Chu SH, Camargo CA Jr, et al. Roles of postdiagnosis accumulation of morbidities and lifestyle changes in excess total and cause-specific mortality risk in rheumatoid arthritis. Arthritis Care Res (Hoboken). 2021;73(2):188–98.
Article CAS PubMed Google Scholar
Dougados M, Soubrier M, Antunez A, Balint P, Balsa A, Buch MH, et al. Prevalence of comorbidities in rheumatoid arthritis and evaluation of their monitoring: results of an international, cross-sectional study (COMORA). Ann Rheum Dis. 2014;73(1):62–8.
Article PubMed Google Scholar
Luque Ramos A, Redeker I, Hoffmann F, Callhoff J, Zink A, Albrecht K. Comorbidities in patients with rheumatoid arthritis and their association with patient-reported outcomes: results of claims data linked to questionnaire survey. J Rheumatol. 2019;46:564–71. https://doi.org/10.3899/jrheum.180668.
Article PubMed Google Scholar
England BR, Yun H, Chen L, Vanderbleek J, Michaud K, Mikuls TR, et al. Influence of multimorbidity on new treatment initiation and achieving target disease activity thresholds in active rheumatoid arthritis: a cohort study using the Rheumatology Informatics System for Effectiveness registry. Arthritis Care Res. 2023;75(2):231–9.
Radner H, Yoshida K, Frits M, Iannaccone C, Shadick NA, Weinblatt M, et al. The impact of multimorbidity status on treatment response in rheumatoid arthritis patients initiating disease-modifying anti-rheumatic drugs. Rheumatology. 2015;54(11):2076–84.
Article CAS PubMed Google Scholar
Crowson CS, Gunderson TM, Davis JM, 3rd, Myasoedova E, Kronzer VL, Coffey CM, et al. Using unsupervised machine learning methods to cluster comorbidities in a population-based cohort of patients with rheumatoid arthritis. Arthritis Care Res (Hoboken). 2023;75(2):210–9.
Curtis JR, Weinblatt M, Saag K, Bykerk VP, Furst DE, Fiore S, et al. Data-driven patient clustering and differential clinical outcomes in the Brigham and women’s rheumatoid arthritis sequential study registry. Arthritis Care Res. 2021;73(4):471–80.
Article Google Scholar
Demanse D, Saxer F, Lustenberger P, Tankó LB, Nikolaus P, Rasin I, et al. Unsupervised machine-learning algorithms for the identification of clinical phenotypes in the osteoarthritis initiative database. Semin Arthritis Rheum. 2023;58:152140.
Article CAS PubMed Google Scholar
Richette P, Clerson P, Périssin L, Flipo R-M, Bardin T. Revisiting comorbidities in gout: a cluster analysis. Ann Rheum Dis. 2015;74(1):142–7.
Article PubMed Google Scholar
Aslam F, Khan NA. Tools for the assessment of comorbidity burden in rheumatoid arthritis. Front Med. 2018;5:39.
Article Google Scholar
Aletaha D, Smolen J. The Simplified Disease Activity Index (SDAI) and the Clinical Disease Activity Index (CDAI): a review of their usefulness and validity in rheumatoid arthritis. Clin Exp Rheumatol. 2005;23(5 Suppl 39):S100–8.
CAS PubMed Google Scholar
Pincus T, Summey JA, Soraci SA Jr, Wallston KA, Hummon NP. Assessment of patient satisfaction in activities of daily living using a modified Stanford Health Assessment Questionnaire. Arthritis Rheum. 1983;26(11):1346–53.
Article CAS PubMed Google Scholar
Kodinariya TM, Makwana PR. Review on determining number of cluster in K-means clustering. Int J. 2013;1(6):90–5.
Google Scholar
Marbac M, Sedki M. Variable selection for model-based clustering using the integrated complete-data likelihood. Stat Comput. 2017;27(4):1049–63.
Article Google Scholar
Syakur MA, Khotimah BK, Rochman EMS, Satoto BD. Integration K-means clustering method and elbow method for identification of the best customer profile cluster. IOP Conf Ser Mater Sci Eng. 2018;336(1):012017.
Article Google Scholar
Crowson CS, Gunderson TM, Davis JM 3rd, Myasoedova E, Kronzer VL, Coffey CM, et al. Using unsupervised machine learning methods to cluster comorbidities in a population-based cohort of patients with rheumatoid arthritis. Arthritis Care Res (Hoboken). 2023;75(2):210–9.
Article PubMed Google Scholar

Download references

Acknowledgements

The authors would like to thank all the participating providers and patients in the CorEvitas Rheumatoid Arthritis Registry who contributed data to this study.

Funding

This study is sponsored by CorEvitas, LLC. DHS also receives support from NIH-NIAMS (P30 AR072577) which supports his time. CorEvitas has been supported through contracted subscriptions in the last 2 years by AbbVie, Amgen, Inc., Arena, Boehringer Ingelheim, Bristol Myers Squibb, Celgene, Chugai, Eli Lilly and Company, Genentech, Gilead Sciences, Inc., GlaxoSmithKline, Janssen Pharmaceuticals, Inc., LEO Pharma, Novartis, Ortho Dermatologics, Pfizer, Inc., Regeneron Pharmaceuticals, Inc., Sanofi, Sun Pharmaceutical Industries Ltd., and UCB S.A.

Author information

Authors and Affiliations

Division of Rheumatology, Brigham and Women’s Hospital, 60 Fenwood Road, Boston, MA, 02115, USA
Daniel H. Solomon, Hongshu Guan & Leah Santacroce
Harvard Medical School, Boston, MA, USA
Daniel H. Solomon
Department of Computer Science and Engineering, Chalmers University of Technology, Gothenburg, Sweden
Fredrik D. Johansson
CorEvitas, Waltham, USA
Wendi Malley, Lin Guo & Heather Litman

Authors

Daniel H. Solomon
View author publications
You can also search for this author in PubMed Google Scholar
Hongshu Guan
View author publications
You can also search for this author in PubMed Google Scholar
Fredrik D. Johansson
View author publications
You can also search for this author in PubMed Google Scholar
Leah Santacroce
View author publications
You can also search for this author in PubMed Google Scholar
Wendi Malley
View author publications
You can also search for this author in PubMed Google Scholar
Lin Guo
View author publications
You can also search for this author in PubMed Google Scholar
Heather Litman
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

DHS wrote the main manuscript text and HG prepared figures and tables. All authors reviewed and revised the manuscript.

Corresponding author

Correspondence to Daniel H. Solomon.

Ethics declarations

Ethics approval and consent to participate

CorEvitas registry is reviewed and approved by the Western IRB and all subjects have consented to have their de-identified data used for these types of analyses.

All participating investigators were required to obtain full board approval for conducting research involving human subjects. Sponsor approval and continuing review were obtained through a central IRB (New England Independent Review Board, NEIRB No. 120160610). For academic investigative sites that did not receive a waiver to use the central IRB, approval was obtained from the respective governing IRBs and documentation of approval was submitted to the Sponsor prior to initiating any study procedures. All registry subjects were required to provide written informed consent prior to participating.

Consent for publication

All subjects have given consent for publication.

Competing interests

DHS receives research support through Brigham and Women’s Hospital from CorEvitas, Janssen, Moderna, and Novartis. He also receives royalties for unrelated chapters in UpToDate. HG, LS, and FJ have no conflicts to report. HJL is employee and shareholder of CorEvitas, LLC. WM and LG are employees of CorEvitas, LLC.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: Supplemental Table 1.

Comorbid Conditions. Supplemental Table 2. Comorbidities of Patients with Rheumatoid Arthritis from the CorEvitas Registry Included in the Analyses, at Baseline and During Follow-Up, Restricting to Patients who Entered the Cohort after 2011. Supplemental Figure 1. Assessing Inflection Points in Sum of Squares Relative to One Cluster. Supplemental Table 3a. K Modes Clustering Results. Supplemental Table 3b. K Means Clustering. Supplemental Table 3c. Regression Based Clustering Algorithm. Supplemental Table 3d. DIANA Agglomerative Hierarchical Clustering. Supplemental Table 3e. AGNES Agglomerative Hierarchical Clustering. Supplemental Table 4. Multivariable regression models comparing models for time averaged HAQ-DI outcome, sex-stratified. Supplemental Table 5. Multivariable regression models comparing models for time averaged HAQ-DI outcome, baseline versus post-baseline comorbidities. New Supplemental Table 6. Multivariable regression models for change in time-averaged CDAI and change in time-averaged HAQ-DI as outcomes.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Solomon, D.H., Guan, H., Johansson, F.D. et al. Assessing clusters of comorbidities in rheumatoid arthritis: a machine learning approach. Arthritis Res Ther 25, 224 (2023). https://doi.org/10.1186/s13075-023-03191-8

Download citation

Received: 31 July 2023
Accepted: 11 October 2023
Published: 22 November 2023
DOI: https://doi.org/10.1186/s13075-023-03191-8

Assessing clusters of comorbidities in rheumatoid arthritis: a machine learning approach