Background

Cluster analysis

Cluster analysis (CA) is a statistical technique that helps reveal hidden structures by grou** entities or objects (e.g., individuals, products, locations) with similar characteristics into homogenous groups while maximizing heterogeneity across groups [1, 2]. Entities or objects of interest are grouped together based on attributes that make them similar, with the final goal being to distinguish these entities or objects by clustering them into comparable groups and to separate them from differing groups. Conceptually, CA aims to identify cluster solutions that are relatively homogeneous within each group, leading to clusters that have high intra-class similarity, while maximizing heterogeneity between the groups, leading to low inter-class similarity across clusters. Geometrically, the objects within a cluster are close together, while the distance between clusters is further apart. CA is useful to identify groups when it is not clear which entity belongs to which group, and how many groups may best be used to cluster the entities; thus, CA helps to identify a latent structure within a dataset [13].

CA has been widely used in varied applications including finding a true typology, prediction based on groups, hypothesis generation, data exploration, and data reduction-or grou** similar entities into homogeneous classes, consequently organizing large quantities of information and enabling labels that facilitate communication [1, 4, 5]. Numerous specific examples of the use of CA have been reported in the literature, such as characterizing psychiatric patients on the basis of clusters of symptoms [6]; finding a group of genes that have similar biological functions [7]; or identifying medical patient groups most in need of targeted interventions [4, 5].

Less well investigated is the utility of CA in identifying macro-structures associated with changes in treatment outcomes documented in large healthcare claims databases. A particular challenge for the use of CA in healthcare claims datasets is that the distribution of healthcare expenditure data are commonly severely skewed, which complicates analyses [8, 9]. In spite of this challenge, CA may aid in identifying clusters of patients who experienced similar change in costs of care before and after treatment, and particular interest may lie in focusing attention on consistently high-cost groups or groups for whom healthcare costs dramatically increase after a change in treatment. This study employed CA to the patients with end-stage renal disease (ESRD) who were initiated on hemodialysis (HD) for their healthcare cost change patterns before and after HD and explored the feasibility of application of CA method in highly skewed claims data.

Affecting an estimated 600,000–900,000 patients in the United States, chronic kidney disease (CKD) is a complicated clinical issue increasingly recognized as both a pressing public health concern and a growing worldwide epidemic [1015]. Kidney function progressively declines in a proportion of patients with CKD, particularly without adequate therapy. However, often, even with adequate therapy, CKD eventually progresses to devastating ESRD [16]. Two types of dialysis are widely used: hemodialysis (HD) and peritoneal dialysis (PD). The most common and costly of the two, HD, uses a dialysis machine and a special filter called a dialyzer to clean blood outside of the body [17, 18]. The less commonly type is PD, a procedure in which blood is cleaned inside the body via the introduction of dialysate into the abdominal cavity [18].

Even though HD is the most expensive treatment for patients with ESRD [16, 17] little has been reported beyond the aggregate level on the economic impact of the transition of ESRD patients who had previously not received dialysis to HD [19]. Hence, examining healthcare cost patterns of patients with ESRD who initiated HD and classifying these patients into groups may provide useful information to healthcare decision-makers in relation to the cost burden of HD therapy. The objectives of this analysis were: 1) to apply CA techniques to an evaluation of change in all-cause healthcare costs in patients with ESRD before and after initiating HD; 2) to explore the feasibility of application of this method to administrative claim database with highly skewed cost information; 3) to present clusters that show meaningful patterns of change of costs before and after initiating HD; and 4) to further examine these clusters to identify differences in comorbidities and other variables in the pre- and post-HD period, to see if different clinical or demographic patterns may explain the variations in overall costs across clusters.

Methods

Study design and data

This retrospective, cross-sectional, observational study with 2007 to 2011 data was conducted using the Truven Health Analytics’ MarketScan® Commercial Claims and Encounter and Medicare Supplemental Databases [20]. The MarketScan database, one of the most commonly used for health economics outcomes research (HEOR), is one of the largest administrative claim databases that provides healthcare costs and resource utilization in real-world settings. The databases reflect inpatient, outpatient, and outpatient prescription drug information for approximately 53 million employees and their dependents covered under commercial health insurance plans sponsored by more than 300 employers in the United States. This database provides detailed cost (payment) and healthcare utilization information for services performed in both inpatient and outpatient settings, in addition to standard demographic variables (i.e., age, sex, employment status, and geographic location). Medical claims are linked to outpatient prescription drug claims and person-level enrollment data through the use of unique enrollee identifiers [20]. The study did not require informed consent or institutional review board approval because all study data were accessed using techniques compliant with the Health Insurance Portability and Accountability Act of 1996. Thus, no identifiable protected health information was extracted during the course of the study.

Sample selection and patient population

Patients aged ≥18 years were included in the analyses if 1) the patient had at least one confirmed diagnosis of ESRD and 2) initiated at least 2 HD sessions between 2008 and 2010. An “index date” was defined as the first HD claim within that time span. Patients were excluded if they did not have continuous enrollment for the 12 months prior to (the “pre-” HD period) or 12 months following (the “post-” HD period) the index date (pre- and post-HD periods thus may have included data from 2007 or 2011 as relevant based on index date). Patients who had a transplant or underwent PD were not excluded due to sample size and generalizability consideration. Therefore, there could be cases that patients had PD or transplant before index HD or switched to PD or had transplant after their index HD. Diagnoses were based on International Classification of Disease, Ninth Revision, Clinical Modification (ICD-9-CM) codes. Codes considered to indicate ESRD included ICD-9-CM codes 404.02, 404.12, 404.92, 404.03, 404.13, and 404.93 (hypertensive heart and CKD without heart failure and with CKD Stage V or ESRD), as well as ICD-9-CM codes 585.5 (CKD Stage 5/ESRD) and 585.6 (ESRD) (Appendix 1 includes a full set of patient medical codes that qualified a patient for inclusion in this study). Persons receiving HD were identified using Healthcare Common Procedure Coding System, Current Procedural Terminology, and ICD-9 codes, which are listed in Appendix 1 [2123].

Variables for clustering

The variables used for clustering were “all-cause medical costs”, or direct costs for each patient reported in the pre- and post-HD periods. All-cause medical costs included hospitalization, office, and emergency department visit costs for all purposes, including dialysis costs. Healthcare costs included payments from both insurance and out of pocket costs from patients including deductible copays and coinsurances.

Variables for describing clusters

The variables for describing patients in clusters included gender (male or female), geographic region (Northeast, North central, South, or West), insurance type (Health Maintenance Organization [HMO] or Point-of-Service [POS] capitation, Fee-for-Service [FFS]), age (stratified as 18–24, 25–34, 35–44, 45–54, 55–64, and ≥ 65 years), and the comorbidity measures—Charlson Comorbidity Index (CCI), Elixhauser Comorbidity Index (ECI), and the Agency for Healthcare Research and Quality”s (AHRQ) top 10 Clinical Classification Software (CCS) categories. The CCI composite comorbidity score was calculated from medical records as a weighted sum of the presence of 19 documented health conditions including diabetes, peripheral vascular disease, or congestive heart failure. Weighting was accomplished by assigning a value of 1, 2, 3, or 6 to each appropriate comorbidity condition and summing these values-thus, higher values reflect greater comorbidity [2426]. The ECI score was used to measure the burden of comorbid conditions not directly related to HD. ECI distinguishes 30 comorbid conditions identified using ICD-9-CM codes from complications by considering only secondary diagnoses unrelated to the primary diagnosis [27]. The mean ECI score for each cluster was determined; like the CCI, higher scores reflect greater comorbidity burden. The AHRQ CCS for the ICD-9-CM provides a system for classifying ICD-9-CM diagnoses or procedures into a manageable number of clinically meaningful categories. One use of the CCS method is to identify the most frequent types of conditions present in study populations. The single-level diagnosis CCS approach combines illnesses and conditions into 285 mutually exclusive categories [22, 28]. The same individual might receive a flag for as many CCS categories as the recorded diagnoses support. The CCS uses a broad definition for each disease and, unlike Charlson instruments, the CCS is reported to make little distinction regarding disease severity.

Statistical analysis

The goal of these analyses was to cluster patients in terms of all-cause costs in the “pre” period and “post” period. Values for all-cause costs were normalized by subtracting the minimum from each value and dividing that difference by the range of all values. CA was conducted on normalized all-cause costs. Patients with similar cost patterns were “grouped” together into a set of clusters based on their costs in the pre- and post-HD period using different CA methods. Patterns of demographic information and comorbidities within each cluster were reviewed and compared/contrasted across clusters. Two major CA methods, K-means (non-hierarchical) and hierarchical CA with various linkage methods, were applied to normalized costs within the pre- and post-HD periods to identify clusters. PROC FASTCLUS and PROC CLUSTER procedures in SAS, Version 9.3, were used to conduct the cluster analyses. All other analyses were also performed using SAS, Version 9.3 [29, 30].

Several important questions must be addressed when conducting CA [1], including: What measures of similarity should be chosen to compare the entities under consideration? How should clusters be formed? And what is the optimal number of clusters? Similarity between objects is most often assessed by a distance measure, with higher values (i.e., greater distances between cases) representing greater dissimilarity between entities. Various measures are available to express similarity or dissimilarity between pairs of objects. In these analyses, we used Euclidean distance, or straight-line distance between individuals in the database-this is the most commonly used type of similarity measure when analyzing ratio or interval-scaled data [31]. Mathematically, the Euclidean distance between any 2 entities, such as B and C, with regard to 2 variables, x and y, can be expressed by the following formula [31]:

$$ {d}_{Euclidean}\left(B,C\right) = \sqrt{{\left({x}_B-{x}_C\right)}^2 + {\left({y}_B-{y}_C\right)}^2} $$

The values obtained from comparing all entities on both x and y (in this case, pre- and post-HD costs) form a distance matrix capturing the distances between all pairs of entities.

Clusters can be formed using either hierarchical or non-hierarchical methods. Hierarchical CA attempts to identify relatively homogenous groups of cases based on selected characteristics using an algorithm that either agglomerates or divides entities to form clusters [32]. Agglomerative algorithms begin with each entity in a separate cluster; in each subsequent step, the two clusters that are most similar are combined to build an aggregate cluster. This process is repeated until all objects are finally combined into a single cluster. Once formed, clusters cannot be split, and similarity decreases during each step. A variety of “linkage” methods may be chosen to facilitate an agglomerative algorithm and define how similar or dissimilar any two clusters may be, including, single-, complete-, or average-linkage methods, flexible beta method, McQuitty’s method, as well as the centroid method or Ward’s method (Table 1).

Table 1 Common agglomerative algorithms for forming clusters

In a divisive algorithm, analyses start with a single cluster containing all entities, which is then divided at each subsequent step into two additional clusters that contain the most dissimilar objects. Splitting continues until all observations are in a single-member cluster. The end product of either an agglomerative or divisive hierarchical clustering method is the construction of a hierarchy or structure depicting the formation of clusters.

The K-means method is the primary example of non-hierarchical CA. In contrast to hierarchical analyses, non-hierarchical approaches do not involve the construction of groups via iterative division or clustering; instead, they assign objects into clusters once the number of clusters is specified. To accomplish this, starting points (or cluster seeds) for each cluster must be identified, and each observation is assigned to one of the cluster seeds via some process or algorithm. In K-means CA, “k” points are entered into the space represented by the entities being clustered-these points represent initial group centroids [33]. The n observations are then partitioned into k clusters in which each observation belongs to the cluster with the nearest mean. Once all objects have been assigned, the positions of the k centroids are recalculated. These steps are repeated until the centroids no longer move, yielding a separation of the objects into groups from which the metric to be minimized can be calculated. Both hierarchical and K-means CA methods have their strengths and weakness (Table 2), and they are sometimes used in complementary fashion to converge upon an optimal cluster solution.

Table 2 Strengths and weaknesses of hierarchical and K-means CA methods

The process of conducting CA leads to a set of decisions related to the CAs performed: which method is best, and what is a reasonable number of clusters to form? In this regard, there is no right or wrong approach; ultimate consideration is given to develo** a model that not only represents the data appropriately, but can be easily interpreted and understood in the context of the entities investigated-thus, successful CA requires experience and perspective to inform the selection of meaningful clusters. In this study, a final model was chosen based the following criteria: 1) In order to have a meaningful number of clusters, it was important not to have too few observations (<10) in the smallest cluster or too many small clusters; 2) As to generate a reasonable clustering pattern, it was essential to have interpretable clustering patterns; and 3) Having a reasonable number of clusters for further analysis. Selecting the number of clusters can be aided by maximizing key statistical elements of the CA: larger values of the Pseudo-F Statistic (PsF) [34] and the Cubic Clustering Criterion (CCC) [35] suggest better model fit in terms of number of clusters [29, 30, 36].

Results

Patients

After applying the entry criteria for this study and from 140,720 individuals, a total of 18,380 individuals were identified in the MarketScan Database (Fig. 1). The average age was 63.2 years (standard deviation [SD] = 14.1); 46 % were aged ≥65 years, and 29 % were aged 55 to 64 years. Of the total individuals, 58 % were males, 84 % had FFS insurance plans, and 14 % had HMO or POS capitation plans. At baseline, average ECI scores were 5.8 (SD = 2.6) in the full sample and CCI scores were 4.6 (SD = 2.3); at follow-up, ECI scores had increased to 7.1 (SD = 3.0) while CCI scores had increased to 5.3 (SD = 2.4).

Fig. 1
figure 1

Patient selection diagram. Abbreviations: ESRD, end-stage renal disease; HD, hemodialysis

Overall costs, pre- and post-HD periods

Medical costs for all patients during the pre- and post-HD periods are summarized in Table 3. We defined annual medical costs ≤ $50,000 as “average”, $50,001 to ≤ $500,000 as “high”, and > $500,000 as “very high”.

Table 3 All-cause medical costs in the 12-month baseline and follow-up periods

Clustering techniques

Hierarchical CA with the average, centroid, single-linkage, complete-linkage, and McQuitty’s similarity methods led to cluster solutions that included clusters with unreasonable sample sizes (i.e., prone to the creation of very small clusters with <10 observations; Table 4). Both K-means CA and hierarchical CA with either the flexible-beta method or Ward’s method yielded reasonable solutions. However, the K-means solutions were more meaningful and more easily interpreted, particularly for cluster number <5, circumstances in which both Ward’s method and the flexible-beta method generated at least one cluster with large variation, which is not helpful in practice (Appendix 2, Appendix 3, and Appendix 4, respectively).

Table 4 Summary of results from clustering analysis methods applied

Upon inspection, the best K-means solution included 4 clusters (Fig. 2). More formal criteria associated with each of the K-means solutions suggested 4 clusters yielded maximum separation between clusters (4-cluster solution: PsF = 13,979.98; CCC = −63.928 compared with PsF = 10,502.25 and CCC = −99.702 for a 3-cluster solution, and PsF = 13,109.62 and CCC = −70.634 for a 5-cluster solution). Empirically, the 4-cluster solution was judged to be more appropriate and more easily interpretable than either the 3- or 5-cluster solution. Thus, a 4-cluster K-means solution was chosen for further investigation (Fig. 3). The 4 clusters in this model included a cluster with average costs pre-HD and high costs post-HD (Cluster 1: Average to High); a cluster (the smallest) with very high costs in the pre-HD period and high costs in the post-HD period, along with a substantial decrease in average cost from pre- to post-HD (Cluster 2: Very High to High); a group (the largest) exhibiting average costs in both the pre- and post-HD periods, with a small decrease in average costs from baseline to follow-up (Cluster 3: Average to Average); and finally, the second largest group, exhibiting “high” costs in both the pre- and post-HD periods, along with relatively sizeable cost increases from baseline to follow-up (Cluster 4: Increasing Costs, High at Both Points). Figure 3 and its corresponding table summarize the cost changes in the 12-month pre- and post-HD periods, respectively. Cluster 1 (Average to High) reveals median costs that increased from $185,070 to $884,605; Cluster 2 (Very High to High) shows that the median costs decreased from $910,930 to $157,997; Cluster 3 (Average to Average) reports that the median costs were relatively stable and remained low from $15,168 to $13,026, and Cluster 4 (Increasing Costs, High at Both Points) reveals that the median costs increased from $57,909 to $193,140.

Fig. 2
figure 2

Scatter plot by cluster of all-cause medical costs in pre- and post-HD periods by K-means CA with four cluster solutionsa. Footnote: aPseudo F Statistics = 13,979.98; Approximate Expected Over-All R 2 = 0.79; Cubic Clustering Criterion = −63.93. Each cluster is labeled by corresponding number

Fig. 3
figure 3

All-cause medical costs in pre- and post-HD periods by clustera

Basic demographic information and clinical characteristics of the sample divided into the four clusters suggested by K-means analysis are summarized in Table 5; the top 10 CSS disease categories in the baseline and follow-up period for each cluster are reflected in Appendix 5. Patients in Cluster 3 (Average to Average) (i.e., those with stable average costs before and after initiating HD) tended to be older, with an average age of 63.9 years compared with an average age of 55.5 through 57.6 years in the other three clusters. Otherwise, there was little to no meaningful difference across each cluster in terms of gender, living region, or health insurance type (Table 5). Economically, Clusters 1 (Average to High) and 4 (Increasing Costs, High at Both Points) were both associated with increasing costs from pre- to post-HD. Clinically, substantial increases in comorbidity scores, including both the ECI and the CCI, were observed from baseline to the follow-up period in both these groups. In contrast, Cluster 2 (Very High to High) experienced a reduction in costs after starting HD, from very high to high costs, and both ECI and CCI scores were relatively stable after initiating HD. In addition, relatively stable ECI and CCI scores were reported in Cluster 3, where stable average costs before and after HD were identified. Cluster 3 (Average to Average) exhibited notably low comorbidity scores during the post-HD period when compared with the three other clusters (Table 5).

Table 5 Demographic and clinical characteristics of patients grouped into 4 proposed clusters using K-means CA

Discussion

In this retrospective observational analysis of claims data from commercially insured ESRD patients initiating HD, CA successfully revealed a latent structure underlying all-cause cost data before and after the start of HD. Several clustering techniques were applied, including both K-means CA and a set of hierarchical clustering analyses with multiple agglomerative algorithms that included average, centroid, single- and complete-linkage methods; McQuitty’s similarity method; and both the flexible-beta and Ward’s methods. Models generated by both K-means and hierarchical cluster CA with flexible beta and Ward’s methods produced clusters of reasonable sample size. K-means CA yielded the most informative categorization of patients generating more reasonable clusters from a practical perspective than did the other statistical methods. In addition, the K-means solutions were the most easily interpreted. In contrast, Ward’s and the flexible-beta methods led to solutions with at least one cluster with large variability (or spread), which can be difficult to interpret. Among the models suggested by K-means CA, a 4-cluster solution appeared to be the most appropriate for these data: associated criteria suggested a 4-cluster solution offers maximum separation of clusters compared with either a 3- or 5-cluster solution. In addition, a 4-cluster solution was more interpretable, and thus more appropriate to apply than other methods.

Mean all-cause medical costs in this sample of privately insured patients ranged from approximately $45,000 (USD) prior to the initiation of HD to $49,000 (USD) after; median costs ranged from $17,000 in the 12 months before HD initiation to $16,000 in the 12 months following HD initiation. Interestingly, these reported costs are generally lower than those found in other analyses in other populations. In 2004, the average annual Medicare expenditure for an ESRD patient started on HD was reported to be $72,000 (USD) [37], increasing to $77,500 (USD) in 2012 [11]. Other estimates suggest annual all-cause costs for HD patients to be as high as $174,000 (USD) in a privately insured population [17]. It is worth noting that the current results reflect payment from insurance claims made in the “real-world setting”. Importantly, a switch to HD from no dialysis in the present data set was only associated with a modest increase in average and median annual costs for ESRD patients on the whole, suggesting that the transition to HD does not generally add substantial costs to average annual care for a patient and may be associated with quite similar costs for the majority of late-stage patients with renal disease in comparison to their cost of care immediately before initiating HD. It is interesting to note that in both the pre- and post-HD assessment periods, 75 % of patients had costs below the average of $45,000 and $49,000 (USD), respectively-thus, it appears as if a relatively small fraction of patients are driving up the overall increase in costs after initiating HD, a contention supported by CA.

More specifically, CA demonstrated that the data could be reasonably represented by 4 clusters of patients: those with average costs before and after initiating HD (90 % of the full sample); those with high costs before and high/increased costs after (8 %); those with average costs who incur high costs after initiating HD (0.6 %); and a cluster with very high costs prior to initiating HD who see their annual costs reduced to a high level (0.5 %). Thus, overall costs stay stable for most ESRD patients initiating HD, suggesting transition to HD per se is not an important driver of cost for the majority of patients. A minority of patients drive an increase in overall costs after HD initiation.

Because of the different cost patterns in each group, it is worthwhile to better understand patients in each cluster to help predict and contain the costs of HD. Comorbidities seem to be particularly relevant to costs, with increasing comorbidity scores from baseline to follow-up periods in those clusters associated with an increase in costs during follow-up, and more stable comorbidity scores associated with more stable costs (or even declining costs). This is consistent with other research: one study demonstrated that an increased level of comorbidity was associated with higher cost in the 2 years prior to starting HD [13], while another demonstrated a clear relationship between CCI scores and costs [38]. These data suggest timely management of comorbidities or the prevention of comorbidities may be critical for containing costs in patients starting HD. Interestingly, the older age of the patients in the most stable cost cluster (i.e., Cluster 3) suggests that there may be a difference in expression of ESRD in these patients compared with the other clusters, perhaps a factor that manifests itself as both a later-in-life need for HD as well as better overall health (e.g., fewer comorbidities).

In aggregate, costs are high at an absolute level, both before and after the initiation of HD, suggesting that the healthcare costs of the majority of ESRD patients not treated with HD are not substantially lower than the costs of care for these patients immediately after starting HD. Thus, HD does not add substantial costs for most patients and seems like an economically feasible option in most patients with CKD, given the overall high cost of care for these patients prior to initiating HD. True cost containment for patients with ESRD likely requires more aggressive or widespread intervention before patients reach this advanced stage of disease, where costs are high before and after HD. One overall strategy that may reduce costs includes early referral to a nephrologist in the period before starting HD [16]. HD is not an important cost driver for the majority of patients, so limiting HD may not contain costs for these patients. There is a need to better understand the fraction of the population that is driving higher post-HD costs, and consider ways to mitigate the costs associated with their transition to HD.

Limitations

Interpretation of these results must be informed by limitations of these analyses. First, these analyses were conducted only in those employed individuals with commercial insurance coverage and some individuals with Medicare coverage; thus, these results from a relatively healthy population may not be fully generalized to individuals with Medicare, Medicaid, other insurance, and no insurance. Second, administrative claims data cannot capture deaths and changes of employment; therefore, the cost not captured due to loss to follow-up may lead to selection bias. In addition, administrative claims data are not collected for research purposes and measurement error may have been introduced by coding that was in error or driven by reimbursement needs more so than research needs. Further, administrative claims data does not collect clinical information that would have been valuable additions to these analyses, such as laboratory test results or vital signs. Access to patients’ claims prior to their enrollment in MarketScan databases is not available. Retrospective analysis limits the study to those who are clinically diagnosed and incur health care resource utilization through claims; resource utilization not identified by claims would not be included in these analyses. Finally, treatment costs in future studies should examine what cost drivers may have influenced increases or decreases in costs for each cluster.

Conclusions

CA was a useful statistical technique for evaluating a claims data set that included skewed healthcare cost data. One implication of these analyses is that costs for most patients with ESRD stay relatively stable after starting HD; a minority of patients drive overall increasing annual costs after initiation of dialysis. These increasing costs may be driven, in part, by a greater comorbidity burden among these patients.