Introduction

Assessment of treatment response in soft tissue sarcomas (STS) by conventional radiologic imaging has long posed unique set of challenges for clinicians [1,2,3,4]. Owing to their highly variable internal compositions, tumors undergoing a biologic response to chemotherapy may not actually diminish in size due to factors such as cystic degeneration, hyalinization, fibrosis, centralized necrosis, and intratumoral hemorrhage, all of which have the potential to affect estimations of whole-tumor volume [2, 3, 5,6,7,8,9,10]. Thus, appraisals of treatment response that depend on evaluations of tumor size—including the World Health Organization (WHO) response evaluation criteria and the oft-cited Response Evaluation Criteria In Solid Tumors (RECIST)—may fail to appreciate satisfactory biologic response to chemotherapy in tumors that do not demonstrate macroscopic shrinkage on radiologic imaging [2, 3, 7, 8, 11,12,13,14,15,16]. The Choi criteria and modified Choi criteria, which were later proposed in an effort to incorporate additional features such as changes in attenuation or signal intensity on CT or MRI, were shown to better correlate with pathologic response [7, 8, 14, 17,18,19]. However, the Choi criteria were notably not originally designed for STS and still rely heavily on size-based estimations, thus calling into question in their ability to accurately resolve complex architectural changes in STS, particularly in cases of synovial sarcoma [4, 7, 20]. In the age of targeted molecular therapies, there exists a growing need for modernized response criteria that more accurately reflect the scope of phenotypic heterogeneity [6, 13, 21, 22].

Radiomics is defined as the conversion of medical imaging into multi-dimensional mineable data for clinical decision support to bolster accurate diagnosis, prognostication, and prediction of treatment response [4, 23,24,25,26,27,28]. In comparison with standard biopsy techniques, radiomics analysis offers the advantage of being able to non-invasively quantify heterogeneity of entire tumor volumes at given time points of interest, which in theory should allow for better characterization of chemotherapeutic response than use of size-based criteria alone [6, 20,21,22, 26, 27, 29,30,31]. Radiomics has already been successfully applied to a variety of clinical applications related to STS, including stratification of benign from malignant soft tissue neoplasms, prediction of histologic grade, and assessment of metastatic risk [27, 31,32,33,34], though lack of standardized protocols has hindered widespread adoption of radiomics workflows in clinical practice [16, 24, 25, 35].

Standard-of-care typically encourages the use of anthracycline-based regimens as first-line chemotherapy in patients with newly diagnosed STS, which have demonstrated improved overall and metastasis-free survival in phase 3 clinical trials [2, 3, 29, 30]. Yet, ongoing research in sarcoma care remains limited in part due to the previously detailed shortcomings of traditional size-based response criteria, which calls into question their appropriateness for use as endpoints in clinical trials [2, 3, 20]. Thus, ongoing collaborations between leading agencies including the US Food and Drug Administration and the US National Cancer Institute have since called for the validation of quantitative imaging techniques to serve as surrogate biomarkers, as these may in fact more accurately reflect early biological changes in tumor physiology [6, 13, 20, 22]. In a previous pilot study [20], we were able to demonstrate that quantitative-MRI (q-MRI) evaluation of enhancing tumor volume was able to accurately stratify responders from non-responders in a small cohort of patients with histopathologically diagnosed STS treated with standard-of-care neoadjuvant chemotherapy (NAC). Therefore, based on studies correlating intratumoral heterogeneity on radiologic imaging with higher histologic grade and poorer patient outcomes [5, 23, 24, 29, 32, 33, 36], we hypothesized that change from baseline of radiomics metrics taken pre- and post-NAC (i.e., delta-radiomics) might be able to better predict response to NAC in STS. While a small body of evidence does suggest a role for radiomics-based predictive modeling in stratifying response to neoadjuvant therapy [30, 34, 37, 40] (Fig. 1). Subsequently, 1708 radiomics features were extracted from the 3D-ROIs using MATLAB® (MathWorks) software running our comprehensive institutional radiomics pipeline, which has been rigorously benchmarked against an Image Biomarkers Standardization Initiative (IBSI) phantom and reference values [41] (Fig. 2). Delta values were then calculated from the extracted radiomics features. Generally speaking, delta-radiomics capture either the change or the percent change in radiomics features across different points in time [4, 30, 34, 37, 21, 22], few studies have thus far investigated the utility of MRI-based radiomics features to serve as surrogate predictors of neoadjuvant response in STS [4] (Table 3). To the best of our knowledge, only one previously published study by Crombé et al. similarly utilized an MRI-based delta-radiomics approach for predicting treatment response specifically to NAC. In their procedure, the authors calculated the absolute change in 33 radiomics features in 65 patients with STS following anthracycline-based NAC, from which only a subset of pre-selected delta features was used to train 4 decision classifiers [30]. Likewise, though Peeken et al., Gao et al., and Miao et al. all suggest an ability for delta-radiomics–based decision classifiers to predict STS response to radiotherapy [34, 37, 38], these studies also employed feature reduction or recalling techniques prior to model training. While data filtering has become an unfortunately common practice to address high dimensionality in radiomics datasets, these approaches have the potential to induce information leakage. Information leakage further leads to disruption of test data independency, thereby resulting in problems of overfitting [28, 39]. We demonstrate these phenomena explicitly through the results of our filtered analyses, whereby restricting our machine learning inputs to only variables which were significant at the p ≤ 0.05 and p ≤ 0.01 levels in our univariate analyses yielded comparable AUCs to those reported by Crombé et al., Peeken et al., Gao et al., and Miao et al. [30, 34, 37, 38].

Table 3 Summary of characteristics of comparator studies

Publication bias has emerged as a growing area of concern among radiomics studies. As recently as 2018, Buvat et al. reported that a mere 6% of all PET radiomics studies in the published literature explicitly reported negative results [51]. Moreover, in a systematic review of 52 sarcoma-specific radiomics studies, Crombé et al. found that no studies specifically described negative findings [36], further highlighting the need for more balanced publication practices within the field. As discussed above, our result was not able to reproduce separation of neoadjuvant responders from non-responders using machine learning augmented MRI-based radiomics analyses [30, 34, 37, 38]. We believe this is in large part due to our more rigorous approach to our machine learning methodologies without reliance on data filtering and feature selection techniques featured in related works [27, 28, 39]. In particular, Crombé et al. even further report that they constructed their models by first selecting one feature per category and then increasing the number of included features in a “forward stepwise fashion” as determined by univariate p-values [30]. Such steps are not only unnecessary but actually bias and invalidate the results of modern machine learning approaches such as RF—which was notably their top performing classifier—as these algorithms are designed to work with high dimensionality datasets without pre-selection of so-called candidate features [27, 47, 52].

One other notable aspect of our study’s methodology was our inclusion of scans from multiple image acquisition centers. Issues with reproducibility in radiomics studies has garnered progressively more attention in recent years, as it has become increasingly clear that radiomics-based machine learning procedures based on single-center, single-vendor datasets generalize poorly to multicentric data pools [36, 48, 53, 54]. Moreover, as we have discussed in our prior work [27], databases derived from single-center cohorts are poorly reflective of modern clinical practice models [29, 48]. Thus, our study is in line with literature supporting the use of multicentric datasets in radiomics studies [16, 26, 27, 36, 53, 55], which theoretically would help mitigate confounding effects of signal noise introduced as a result of heterogeneity in acquisition parameters.

Finally, though the results of our machine learning process failed to reach overall statistical significance, we do note an increased representation of LTE-derived metrics in the univariate analyses, with 46.04% of all metrics reaching statistical significance at the p ≤ 0.05 level deriving from LTE-based computations. LTE-based measures belong to a group of spatial filtering techniques that reflect the properties of n x n-sized “convolution kernels” [56,57,58]. Using this method, spatial domain filters are generated from the vector products of one-dimensional convolution masks, each representing a different texture feature [58]. In the case of our institutional radiomics pipeline, LTE-based metrics accounted for 1472 individual radiomics features out of a total of 5585 features extracted from 9 separate texture families during the course of this study. This subset of our findings do support previously published data suggesting that spatial filtering techniques are well-suited to detect features indicative of tumor heterogeneity [26, 27], possibly as a consequence of more completely capturing voxel-to-voxel variation through the creation of neighborhood-based matrices [56, 58].

Our study was limited by several factors. First, while our study population was similar in size and composition to the cohort reported on by Crombé et al. [30], it is possible that our study was underpowered to detect a significant result, whereby 100 subjects is often regarded as the threshold sample size for radiomics studies [23]. Although feature selection can theoretically lower the cohort threshold size, we feel that routine use of these procedures should generally be avoided in radiomics studies for reasons as discussed thoroughly above. Thus, given the relative rarity of STS in the general population, multi-institutional collaborations may be necessary in future studies to accrue adequate sample sizes [4, 5, 10, 20, 27, 32, 37, 48]. Second, the retrospective nature of our data collection poses a risk for selection bias given that our subjects were screened for enrollment eligibility from a larger pool of cases discussed at our institution’s Orthopedic and Sarcoma Tumor Boards [59]. Third, though efforts are currently being made to standardize post-acquisition harmonization techniques [10, 24, 31, 60, 61], such applications lack general consensus regarding proper implementation and execution [26, 27]. Furthermore, while post-processing data harmonization techniques such as ComBat have shown some ability to ameliorate scanner and protocol variabilities in multicentric studies, such batch adjustment methods have limitations when used in small sample sizes and rely on stringent data distribution assumptions. [62]. Thus, these methods were of limited applicability to our dataset given concerns for adverse effects due to outliers as well as missing and skewed data distributions. Future efforts to validate post-processing methods aimed at mitigating signal instability across heterogeneous acquisition parameters will undoubtedly aid in the construction of large, multicentric datasets for future research. Additional future directions may also include focused studies correlating delta-radiomics changes with histologic subtype and histopathologic findings of percent necrosis, as well as those specifically focused on stratifying post-treatment changes related to specific chemotherapeutic regimens.

In conclusion, though our machine learning analyses did not show statistically significant separation of NAC responders from non-responders, we were able to identify increased representation of LTE-derived metrics in univariate analyses. These and other spatial filtering metrics may pose a promising area for future radiomics research due to their ability to more accurately reflect subtle variations in the imaging grayscale [26, 27, 56, 58]. Larger sample sizes in future cohorts are warranted so as to obviate the need for data reductive techniques, which carry with them an inherent risk of introducing information leakage and thus biasing the decision classifiers [28, 39].