Introduction

The novel coronavirus SARS-CoV-2 has spread within a few months from the beginning of 2020 to become a world-wide pandemic1. At the end of 2020, nearly 100 million people have verified infection, leading to more than 2 million deaths, more than all yearly deaths of lung cancer worldwide. This rapid outbreak brings intense pressure on healthcare systems, with an urgent demand for effective diagnostic, prognostic and therapeutic procedures for COVID-19. Scientists and clinicians are addressing this call with remarkable effort and energy by collecting data and information on diverse domains, as shown by the thousands of articles published on the topic since the beginning of the outbreak38. For large datasets JADBio may decide to use a simple Hold-Out. Notice that configurations are cross-validated as atoms, i.e., all the combined algorithmic steps are simultaneously cross-validated as one. This avoids serious methodological errors, such as performing feature selection on the whole dataset first, and then estimating performance by cross-validating only the modeling algorithm (see [39, page 245] for an eye-opening experiment of the severity of this type of error). Once all decisions are made, JADBio searches in the space of admissible configurations to identify the one leading to the optimal performance20. The final model and final selection of features are produced by applying the winning configuration on the full dataset: on average this will lead to the optimal model out of all tries. Thus, JADBio does not lose samples to estimation. To estimate the performance of the final model JADBio uses the Bootstrap Bias Correction estimation method or BBC38. BBC is conceptually equivalent to adjusting p-values in hypothesis testing for the fact that many hypotheses have been tested; similarly, BBC adjusts prediction performances for the fact that many configurations (combinations of algorithms) have been tried. JADBio’s performance estimation has been validated in large computational experiments20: on average it is conservative. Hence, JADBio removes the need for an external validation set, provided it is applied on populations with the same distribution as the training data (JADBio does not remove the need for externally validating the presence of other factors that may affect performance such as systematic biases, batch effects, data distribution drifts, and others). JADBio provides an option for “aggressive feature selection” preference. Aggressive feature selection tries feature selection algorithms that on average return fewer selected features at a possible expense of predictive performance. The feature selection algorithms may also return multiple selected feature subsets (signatures) that lead to equally predictive models, up to statistical equivalence based on the training set. Regarding the specific algorithms tried, for classification tasks JADBio employs Lasso40 and Statistical Equivalent Signatures or SES41 for feature selection. Such algorithms do not only remove irrelevant features, as differential expression analysis does, but also features that are redundant given the selected ones, i.e., they do not carry any informational added value for prediction. Hence, feature selection considers features in combination, while differential expression analysis does not. In addition, we would like to note that SES performs multiple feature selection and not single feature selection as Lasso. Specifically, as its name suggests, SES reports multiple feature selection subsets that lead to equally predictive models (up to statistical equivalence). This is important for providing choices to the designer of diagnostic assays and laboratory tests to select to measure the markers that are cost-effective to measure reliably. For modeling, JADBio tries Decision Trees, Ridge Logistic Regression, Random Forests, (linear, polynomial, and RBF) Support Vector Machines, as well as the baseline algorithm that classifies to the most prevalent class. However, we do note that this list is only indicative, as we constantly keep enriching the algorithmic arsenal of JADBio. All of the above are transparent to the user who is not required to make any analysis decisions. JADBio outputs (i) the (bio)signature, i.e., the minimal subset of features that ensures maximal predictive power, (ii) the optimal predictive model associated with the selected biosignature, (iii) estimates of predictive performance of the final model along with its confidence interval, (iv) numerous other visuals to interpret results. These include graphs that explain the role of the features in the optimal model [called ICE plots42], estimates of the added value of each feature in the model, residual plots, samples that are identified as “hard to predict”, and others.

Threshold selection and optimization for clinical application is an important issue. Using a default 0.5 threshold on the probability is problematic: (i) the probabilities output by Random Forests and similar machine learning models are not trustworthy. They indicate the relative risk of the patient but they are not calibrated43. (ii) A 0.5 threshold assumes that the cost of false negative predictions is equal to the cost of false positive predictions. Obviously, this is not the case for COVID-19 severe cases: falsely predicting patients as ‘non-severe’ critically affects their survival while falsely 'severe' considered patients may just take stronger treatments and/or make unnecessary use of medical resources. This means that, for clinical applications, the classification threshold needs to be optimized. JADBio facilitates threshold optimization as follows: the circles on the ROC curve of the model correspond to different classification thresholds. Each circle reports a different tradeoff between false positive (FPR) and true positive rate (TPR). The user can click on a circle and select the threshold that optimizes the trade-off between FPR and TPR for clinical application. We note that the FPR, TPR and all other metrics (along with confidence intervals) reported in each circle are also adjusted for multiple tries and the “winner’s curse” using the BBC protocol to avoid overestimation.

To avoid comparing predictive performance between training and test sets with different class balance, we employ only performance metrics that are independent and invariant to the class distribution (balancing), ie the Area Under the ROC Curve (AUC) and the Average Precision (equivalent to the area under the precision-recall curve). For the purpose of the current analysis, all comparisons employ the AUC metric.

Datasets

Principal Case: Proteomic and metabolomic profiles from sera of COVID-19 patients with severe and non-severe disease were retrieved from Shen et al. publication9. Three datasets were downloaded: a training dataset (C1 Training) of 13 severe and 18 non-severe COVID-19 patients containing 1638 features, a validation dataset (C2 Validation) of four severe and six non-severe COVID-19 patients containing 1589 features and another validation dataset (C3 Validation) of 12 severe and seven non-severe COVID-19 patients containing 29 targeted features. Although the authors kindly agreed to provide all data and information, we could not validate our models in C3 as the values were obtained through a different technology (targeted metabolomics). In any case, the features measured in C3 only partially overlapped with those selected through AutoML, so such validation would not be relevant.

Case Study 1: Gene expression profiles from host/viral metagenomic sequencing of upper airway samples of COVID-19 patients were compared with patients with other viral and non-viral acute respiratory illnesses (ARIs). There were 93 COVID-19 patients, 100 patients with other viral ARI and 41 patients with non-viral ARI that were retrieved from Mick et al.’s publication29 and the GSE156063 dataset from GEO database. In specific, three datasets were used: A. a training dataset of 93 COVID-19 patients and 141 patients with ARIs (viral and non-viral) of 15,981 features, B. a training dataset of 93 COVID-19 patients and 100 patients with other viral ARI of 15,981 features and C. another training dataset of 93 COVID-19 patients and 41 patients with non-viral ARI of 15,981 features.

Case Study 2: Nasopharyngeal swab samples analyzed with RNA-sequencing from COVID-19 versus non-COVID-19 patients were compared. The dataset contained 35,787 features from nasopharyngeal swab samples of 430 individuals with PCR-confirmed SARS-CoV-2 presence (COVID-19 patients) and 54 non-COVID-19 patients (negative controls) that were retrieved from Lieberman et al.’s publication30 and the GSE152075 dataset from GEO database.

Data preprocessing

The C1 and C2 cohort data of Shen et al. were used as they were publicly deposited. The raw Mick et al. data were preprocessed as in the original publication (variance stabilization) using the code provided by the authors. For the Lieberman et al. dataset, we followed the data preprocessing performed in the original publication and filtered out from all analyses those genes whose average counts were 1 or 0.

Pathway analysis

The biological involvement and related pathways of identified features were searched using the GeneCards—The Human gene database tool (https://www.genecards.org/).

Results and Software availability

The complete analysis results for each dataset such as the configurations tried, the configurations selected to produce the final model, the estimation protocol and the out-of-sample predictions of the cross-validations are reported via unique links in Supplementary Table 1. These results could be employed to compare JADBio against any other methodology, architecture, or algorithm on the same datasets. JADBio is available as SaaS platform at www.jadbio.com where a free trial version is offered. In addition, a free research license is guaranteed for researchers that wish to reproduce results or compare against the current version of JADBio (restrictions apply). Researchers can apply for this license by sending a request on the platform’s webpage. JADBio mainly uses in-house implementations of individual, widely accepted, algorithms (e.g., Decision Trees, Random Forests, etc.) and of unique algorithms namely the SES multiple feature selection algorithm, and the BBC-CV for adjusting CV performance for multiple testing. Open-source implementations of the latter two are available at the MXM R Package41. Supplementary Table 6 reports all individual algorithms employed by JADBio.