1 Introduction

Infectious pulmonary disease (PD) is one of the primary causes of morbidity and mortality [1, 2]. It degenerates lung tissue and obstructs the airflow leading to difficulty in breathing and death in many cases. According to the World Health Organization (WHO), PDs like coronavirus disease (COVID-19), tuberculosis (TB), pneumonia, and lung cancer have taken millions of lives worldwide in the year 2019–2020 [3, 4]. Early and accurate detection of these diseases is crucial to reducing the high mortality rate. In clinical application, CXR image is one of the popular diagnostic tools used for making radiological decisions before medication and invasive surgeries. However, the current interpretation procedure of CXR images is manual, time-consuming, and requires high perceptual and cognitive components [5]. In addition, it also suffered from high inter-observer and intra-observer variability and missed pathologies due to subtle anatomical structures [5,6,7,8]. To overcome the above shortcomings, several automatic CAD systems have been proposed. It uses machine learning (ML) and deep learning (DL) approaches to perform quantitative analysis and deeper visualization of such progressive lung diseases. However, in many cases, the overlap** radiographic findings with the other seasonal abnormalities and normal anatomical structure lead to missed diagnosis [9, 10], resulting in a lack of clinical acceptability and trust issues in automation. An important rationale for this is the lack of integration between the patient symptomatology, clinical pathologies, and radiologist feedback with the CAD system [5, 11,12,13].

To improve the diagnostic performance and robustness of the CAD system, appropriate integration of patient’s metadata and expert’s feedback with the CAD system are required [11, 12, 14]. Furthermore, in this direction, Ai et al. [15] have performed a correlation study between radiological findings and RT-PCR (reverse transcription-polymerase chain reaction) tests to detect COVID-19. The results demonstrate that 97% of the COVID-19 positive patients exhibit chest CT findings. Another similar work by Nair et al. [16] has suggested different imaging factors to be considered while designing a COVID-19 detection system using radiographic images. Chen et al. [17] have tabulated the clinical and epidemiological characteristics of 99 COVID-19 patients. In addition to the studies discussed above, the clinical importance of incorporating metadata and expert feedback into a CAD system can be justified by the fact that different PDs exhibit different symptomatology, pathology, and imaging findings depending on the underlying cause of inflammation, which provides valuable information for differential diagnosis [9, 18]. However, to the best of the author’s knowledge, only a few studies have been reported in the literature focused on integrating patient metadata and expert feedback with CAD systems. This may be due to several imaging-side and integration-side challenges that need to be addressed beforehand to achieve the desired integration goal.

The various identified imaging side challenges include the complex and subtle radiographic responses [19, 20]; inherent quantum noise [21]; varying density of bones, soft tissues, and other organs [22]; different X-ray projection; and various algorithmic limitations like an optimal selection of learning parameters and kernel functions. On the other hand, the integration-side challenges include patient paper record, categorical nature of patient’s metadata (like pain, cough, headache), collecting histopathological information from physically challenged patients, unstructured nature of collected data, and selection of appropriate weight for each decision criteria etc. The effect of imaging side challenges can be reduced by selecting an optimal texture preserving denoising filters; adjusting to the appropriate X-ray energy, beam filter, or current; and training supervised models with larger datasets and tuned parameters. On the other hand, the integration side challenges can be minimized by using the channelized information exchange between various departments in the healthcare system. Also, the literature has not considered the integration of patients’ symptoms, clinical findings, and expert feedback, which plays an essential role in discriminating the abnormalities that exhibit similar imaging responses.

This paper aims to establish a channelized information exchange between patient-CAD system experts in different healthcare departments. Therefore, we proposed an inverted pyramid approach to streamline the information exchange. Subsequently, three integration frameworks, (i) direct integration, (ii) rule-based integration, and (iii) weight-based integration, have been proposed to integrate the patient’s symptoms, clinical pathologies, and expert’s feedback with the CAD system. The proposed system enables comprehensive patient health data analysis and provides an aggregated diagnostic decision. Furthermore, it helps in grading the patient severity to prioritize the treatment and for allocation of scarce medical resources. The rest of the paper is outlined as follows. Section 2 presents the recent and relevant literature; Sect. 3 describes the various materials and methods used in this study and the proposed integration framework. The detailed experimental results and discussion are presented in Sects. 4 and 5, followed by a conclusion in Sect. 6.

2 Literature review

This section presents an overview of relevant studies related to the automatic CAD system, channelized information exchange, and feasibility for integrating patients’ metadata and expert feedback with the CAD system.

In recent years, several automatic CAD systems have been reported to address the shortcomings of manual interpretation procedures [23,24,25,26,27]. Oh et al. [23] proposed a patch-based convolutional method for detecting COVID-19 using a limited set of training images. Another similar work by Li et al. [24] introduced COVNet to extract visual features from CT (computed tomography) images to classify COVID-19 and community-acquired pneumonia. A noteworthy contribution by Ke et al. [27] used heuristic algorithms to detect inflammatory lung disease in CXR images. Chandra et al. [6] have presented a hierarchical feature extraction method for detecting TB. The study by Jaiswal et al. [28] discussed the deep learning method for pneumonia detection and localization using CXR images. Li and Shen et al. [29] have proposed a technique based on solitary features to detect lung nodules from CXR images. However, in the above literature, the CAD systems were designed to identify a single abnormality overlooking the other co-existing pathologies. Moreover, some studies have reported the radiologist level performance using their proposed CAD system but still lack clinical acceptability and trust in automation [6].

To overcome the above shortcomings, digitization, channelized communication, and appropriate integration of patient health records are crucial [5, 11,12,13]. Moreover, such integration helps pulmonologists correlate histopathology with the radiological findings and for oncologists to perform a multidisciplinary investigation of molecular signature to provide specific therapy [34]. Unfortunately, the different departments in the healthcare industry are not well formalized and lack an appropriate protocol for coordination and information exchange [12, 35, 36]. Several studies have recently been reported in the literature that reveals the need for channelized information exchange between different departments involved in the diagnosis procedure [6, 9, 14, 37]. Also, the literature presented in Table 1 reveals that the CAD system exhibits comparably limited performance (in terms of area under the curve (AUC)) when used in isolation. However, integrating patients’ metadata, clinical findings, and radiologist feedback could contribute to a more accurate prognosis and reduce the probable chances of misclassification. The improved performance is due to supplementary information that enables multidisciplinary analysis of different lung diseases.

Table 1 Literature related to the integration of patient symptoms, clinical findings, radiologist feedback, and CAD system. (Abbreviations: CC clinical data-CAD, PC patient symptoms-CAD, CR CAD-radiologist feedback, PCR patient-CAD-radiologist, CT computed tomography, ACC accuracy, AUC area under curve)

In this direction, Jorritsma et al. [5] have performed an investigatory study on radiologist-CAD interaction and suggested different ways to improve CAD performance and trust calibration. Welter et al. [12] worked on generic integration of content-based image retrieval (CBIR) with clinical information to fill the integration gap and integrate the CBIR into CAD. Another noteworthy contribution by Zhang et al. [11] has presented an AI (artificial intelligence) system for automatic detection of COVID-19. The author has integrated the clinical metadata with the AI system to develop a clinical prognosis model, which assigns severity scores to the patients. However, this method uses complex clinical parameters that require additional tests, resulting in a significant delay in diagnosis. Singh et al. [13] have described the integration of radiologist feedback with the CAD system to improve the system’s performance and clinical usability. However, this method has not considered the patient symptoms for evaluation. Sverzellati et al. [10] have proposed an integrated radiological algorithm to triage the massive load of the COVID-19 pandemic using CT images. However, this method is designed considering a specific scenario and has minimal or no usability out of the scenario. Recently, Yanase et al. [44] reported seven key challenges that modern CAD systems face and suggested various techniques to deal with those challenges.

After a comprehensive analysis of the studies reported above, it is observed that the CAD system individually exhibits limited detection performance due to subtle disease manifestations, which raise trust issues in automation. Furthermore, the existing CAD-based studies are not integrated with patient metadata and radiologist feedback, integration of which is the primary goal of this study.

3 Materials and methods

This section elaborates on the proposed inverted pyramid approach for information exchange, three integration frameworks, and various materials and methods used to evaluate the proposed approaches.

3.1 Data collection and demographics

This study used a private COVID-19 dataset collected from Pt. Jawaharlal Nehru Memorial Medical College, Raipur (C.G.), India. The dataset primarily consists of patients’ symptomatology, clinical pathologies, expert feedback, and CXR image. The patient’s metadata and expert feedback are collected by a junior medical attender in a structured manner (with some predefined range as shown in Table 2,) where the missing value and absence of any data is represented with value 1. The dataset is divided into two parts—“training–testing” and “validation” set to evaluate the internal and external (generalization capability) accuracy/error of the proposed integration frameworks (discussed in Sect. 3.4), respectively [45]. The detailed demographics of the collected X-ray images and associated metadata are recapitulated in Table 3. It is important to mention here that all the data are collected after obtaining ethical permission from the institute ethical committee (IEC).

Table 2 Numerical range for scoring the severity of categorical data collected from different healthcare departments
Table 3 Detailed demographics of the collected dataset

3.2 Feature extraction and classification

To compute the CAD’s decision to be integrated with the patient’s metadata and expert’s feedback, three texture features: 8 first-order statistical feature (FOSF) [46], 88 grey level co-occurrence matrix feature (GLCM) [47,48,49] (the detailed description is presented in Appendix), and 8100 histogram of oriented gradient (HOG) [50] features are extracted from the input CXR images. The selection of these radiomic features is based on their ability to encode natural texture patterns, which could be efficiently applied in medical applications [9, 21, 47, 51]. Having extracted 8213 features including 5 symptomatological, 2 pathological, and 10 radiological features (described in Table 2), we observed that not all the features are informative for classification purposes; therefore, to select the optimal feature and minimize the curse of dimensionality, we employed binary gray wolf optimization (BGWO) [52]. The method mimics the efficient encircling and hunting strategy of gray wolf and select 1239 optimal features.

The selected features are passed to different benchmark classifiers. In this study, the support vector machine (SVM) [53], decision tree (DT) [54], Naive Bayes (NB) [55], and k-nearest neighbor (KNN) [56] are employed. The training of the models mentioned above is performed using a tenfold cross-validation setup, and in each fold, different hyperparameters were optimized automatically using the Bayesian automatic optimization technique [57]. The method selects the optimal internal parameters of the classifier by iterative tuning (in this study 30 epochs), which is difficult or time-consuming to optimize manually. Moreover, this technique significantly reduces the cross-validation loss and boosts the classifier’s performance. The reason behind the choice of the above classification techniques is that they are efficient and pervasively used in the literature for the classification of thoracic diseases [6, 9, 50]. Furthermore, in RBI and WBI approach, total 18 features (5 symptomatological, 2 pathological, 10 radiological and 1 CAD decision) are used for evaluating the performance.

3.3 Proposed inverted pyramid approach for information exchange

Having reviewed the literature presented in Sect. 2, it has been observed that the multidisciplinary analysis of lung disease may reduce the missed diagnosis. However, making decisions by considering several factors is a highly cognitive task and thus require an appropriate integrated framework. To realize such integration, a structured information exchange and achieving system between different healthcare departments is needed.

To streamline the information exchange, we proposed an inverted pyramid approach, as shown in Fig. 1. The figure depicts that the diagnostic procedure starts with broader aspects with a high degree of uncertainty. Here, we collect the patient’s disease-specific symptoms. Subsequently, as the diagnosis proceeds, the potential findings get refined with an increased degree of certainty. At each stage, the diagnostic findings and feedback should be shared with other departments and patients to refine the findings further. Finally, the collected patient’s metadata and expert’s feedback are integrated to obtain a robust and aggregated decision. This type of channelized information exchange helps to reduce the probable chances of the missed diagnosis in CAD systems. The usefulness of the proposed communication approach is also verified by the collaborating physicians from Pt. Jawaharlal Nehru Memorial Medical College, Raipur, Chhattisgarh.Footnote 1 The physicians suggest that streamlined communication would simplify the diagnostic process and increase patients’ survival expectancy.

Fig. 1
figure 1

Conceptual model for the structured exchange of patient’s health data (like symptoms, clinical and radiological pathologies, and expert’s feedback) among various healthcare departments, patients, and physicians

3.4 Proposed integration framework

The patients suffering from inflammatory lung disease may experience one or more symptoms (like shortness of breath, dry cough, fatigue, chest pain), which can also be observed by clinical examination or by interpreting the chest radiographs. However, in many situations, the other comorbidities or seasonal flu mimics the actual disease patterns/symptoms, leading to delayed diagnosis or death in some cases [9]. Therefore, to develop a robust end-to-end system by integrating the findings from different healthcare departments, we proposed three integration frameworks:

Direct integration (DI).

Let \({X}_{pi}^{d}\) be the information collected (shown in Eq. 1) for patient \(P=1, 2,\dots p\) at different departments \(d\) (as shown in Fig. 2), where \(i=1, 2, \dots n\) denotes individual symptoms, clinical readings, or radiological findings collected from the different departments:

Fig. 2
figure 2

Prototype for direct integration of patient symptoms, clinical findings, radiologist feedback, and CAD system

$$\begin{array}{ccc}{X}_{pi}^{d}=\left[\begin{array}{ccc}\begin{array}{cc}{x}_{11}& {x}_{12}\\ {x}_{21}& {x}_{22}\end{array}& \cdots & \begin{array}{c}{x}_{1n}\\ {x}_{2n}\end{array}\\ \begin{array}{cc}\vdots & \vdots \end{array}& \ddots & \vdots \\ \begin{array}{cc}{x}_{p1}& {x}_{p2}\end{array}& \cdots & {x}_{pn}\end{array}\right]& \mathrm{where},& d=1, 2, 3, \dots .k\end{array}$$
(1)

Initially, the categorical data (for example, dry cough, running nose, chest pain, feeling unwell) collected from the patient from different departments are converted into numerical form. One way is to convert them into binary, where \({1}^{^{\prime}}\) represents the presence of specific symptoms/abnormality while \({0}^{^{\prime}}\) represents the absence. However, to achieve the maximum benefit from the collected information, such categorical data should be collected with some predefined range, as shown in Table 3. Subsequently, the numerical data is normalized to scale them into a common range, and final data matrix is created by concatenating the collected data and extracted imaging features as shown in Eq. 2. For normalization, we used min–max normalization, which is pervasively used in literature to scale the input data [56]. Finally, the combined features are passed to different machine learning algorithms (discussed in Sect. 3.2) for model training:

$${X}_{\mathrm{norm}}=\mathrm{norm} \left[\begin{array}{ccc}\begin{array}{cc}{x}_{11}^{1}& {x}_{12}^{1}\\ {x}_{21}^{1}& {x}_{22}^{1}\end{array}& \cdots & \begin{array}{c}{x}_{1l}^{1}\\ {x}_{2l}^{1}\end{array}\\ \begin{array}{cc}\vdots & \vdots \end{array}& \ddots & \vdots \\ \begin{array}{cc}{x}_{p1}^{1}& {x}_{p2}^{1}\end{array}& \cdots & {x}_{pl}^{1}\end{array} \begin{array}{cc}\begin{array}{ccc}\begin{array}{cc}{x}_{11}^{2}& {x}_{12}^{2}\\ {x}_{21}^{2}& {x}_{22}^{2}\end{array}& \cdots & \begin{array}{c}{x}_{1m}^{2}\\ {x}_{2m}^{2}\end{array}\\ \begin{array}{cc}\vdots & \vdots \end{array}& \ddots & \vdots \\ \begin{array}{cc}{x}_{p1}^{2}& {x}_{p2}^{2}\end{array}& \cdots & {x}_{pm}^{2}\end{array}& \begin{array}{cc}\begin{array}{c}\begin{array}{c}\dots \\ \dots \end{array}\\ \begin{array}{c} \\ \begin{array}{c} \\ \dots \end{array}\end{array}\end{array}& \begin{array}{ccc}\begin{array}{cc}{x}_{11}^{k}& {x}_{12}^{k}\\ {x}_{21}^{k}& {x}_{22}^{k}\end{array}& \cdots & \begin{array}{c}{x}_{1n}^{k}\\ {x}_{2n}^{k}\end{array}\\ \begin{array}{cc}\vdots & \vdots \end{array}& \ddots & \vdots \\ \begin{array}{cc}{x}_{p1}^{k}& {x}_{p2}^{k}\end{array}& \cdots & {x}_{pn}^{k}\end{array}\end{array}\end{array}\right]$$
(2)

Although the DI approach is easy, it has certain limitations. First, converting categorical data into numerical data is a complicated and erroneous task with inter-observer and intra-observer variability. Second, it gives equal weightage to all the input data from different departments, which may not be justifiable in all cases. For example, symptoms of many seasonal diseases like flu mimic other chronic lung diseases like COVID-19 and tuberculosis, which misled the classifiers, resulting in many false positives. To overcome these shortcomings, rule-based integration may be an alternative solution.

Rule-based integration (RBI).

Unlike DI, the rule-based approach can efficiently map the non-linear relationship. Instead of converting the collected categorical data, one can directly define a set of decision rules, which considers several interdisciplinary factors to make the diagnostic decision. The prototype of the proposed RBI approach is shown in Fig. 3.

Fig. 3
figure 3

Prototype for rule-based integration of patient symptoms, clinical findings, radiologist feedback, and CAD system

In this approach, the posterior probability for the input CXR image is computed using extracted imaging features (discussed in Sect. 3.2) and SVM (quadratic kernel). The obtained prediction probability is subsequently combined with the collected patient’s symptomatological, pathological, and radiological features to form an integrated feature set. Finally, a well-defined set of decision rules is applied to draw an aggregated decision. Here, each feature \({x}_{pi}^{d}\) acts as the decision variable (antecedent or precondition) for branching at the node or assigning a class label, as shown in Eq. 3. The decision condition may consist of more than one decision variable of type, numerical or categorical.

$$Rule:\;if\;\left(condition\right)\rightarrow split\;node\;\mathrm{or}\;assign\;class\;label\;y$$
(3)

The selection of the decision variables (for branching) when moving from the root node to the leaf node is based on various statistical approaches like entropy, Gain ratio, Gini Index, Information Gain [56].

The key importance of this approach is that it can handle both categorical and numerical data. Moreover, it offers a multidisciplinary formulation of decision rules, efficiently dealing with complicated real-time medical situations. However, this method is still susceptible to erroneous prognosis due to subtle seasonal comorbidities.

Weight-based integration (WBI).

The WBI approach shown in Fig. 4 presents a method for weighted aggregation of patient’s metadata, knowledge, and judgment of domain experts with CAD system to make an aggregated diagnostic decision. The core idea and mathematical background of the WBI method are described in [58, 59]. In this approach, the collected data \({x}_{pi}^{d}\) is first preprocessed, which includes conversion of categorical data into numerical data (using some predefined range, as shown in Table 3) and normalization (as described in Eq. 2). Subsequently, the corresponding departments’ experts are asked to assign an appropriate weight \({W}_{i}^{d}\) to each decision criterion such that the aggregation of all the weights in the weight matrix should be 100% (or 1), as described in Eq. 4. Note that the decision criteria with a high prevalence in specific lung diseases are assigned high weight compared to those that are less prevalent.

Fig. 4
figure 4

Prototype for weight-based integration of patient symptoms, clinical findings, radiologist feedback, and CAD system

$${W}_{i}^{d}=\left[{w}_{1}^{1}, {w}_{2}^{1}, \dots {,w}_{l}^{1} {w}_{1}^{2}, {w}_{2}^{2}, \dots {,w}_{m}^{2} \begin{array}{cc}\cdots & \cdots \end{array} {w}_{1}^{k}, {w}_{2}^{k}, \dots {,w}_{n}^{k}\right]$$
$$\mathrm{where}, 0<w<1; \mathrm{and} \sum_{d=1}^{k}\sum_{i=1}^{l, m, n}{w}_{i}^{d}=1$$
(4)

Subsequently, the weight \({W}_{i}^{d}\) for each decision, criteria are multiplied and powered with the corresponding criteria value in the normalized decision matrix \({X}_{\mathrm{norm}}\) to compute weighted decision matrix \({WDM1}_{p}\) and \({WDM2}_{p}\) as described in Eqs. 5 and 6, respectively:

$${WDM1}_{p}= \left[\begin{array}{ccc}\begin{array}{cc}{w}_{1}\left({x}_{11}\right)& {w}_{2}\left({x}_{12}\right)\\ {w}_{1}\left({x}_{21}\right)& {{w}_{2}(x}_{22})\end{array}& \cdots & \begin{array}{c}{w}_{n}\left({x}_{1n}\right)\\ {{w}_{n}(x}_{2n})\end{array}\\ \begin{array}{cc}\vdots & \vdots \end{array}& \ddots & \vdots \\ {w}_{1}\begin{array}{cc}\left({x}_{p1}\right)& {{w}_{2}(x}_{p2})\end{array}& \cdots & {w}_{n}\left({x}_{pn}\right)\end{array}\right]$$
(5)
$${WDM2}_{p}= \left[\begin{array}{ccc}\begin{array}{cc}{\left({x}_{11}\right)}^{{w}_{1}}& {\left({x}_{12}\right)}^{{w}_{2}}\\ {\left({x}_{21}\right)}^{{w}_{1}}& {\left({x}_{22}\right)}^{{w}_{2}}\end{array}& \cdots & \begin{array}{c}{\left({x}_{1n}\right)}^{{w}_{n}}\\ {\left({x}_{2n}\right)}^{{w}_{n}}\end{array}\\ \begin{array}{cc}\vdots & \vdots \end{array}& \ddots & \vdots \\ \begin{array}{cc}{\left({x}_{p1}\right)}^{{w}_{1}}& {\left({x}_{p2}\right)}^{{w}_{2}}\end{array}& \cdots & {\left({x}_{pn}\right)}^{{w}_{n}}\end{array}\right]$$
(6)

From the weighted decision matrix, weighted sum model (WSM) [58], weighted product model (WPM) [58], and weighted aggregated sum-product model (WASPM) [59] scores are calculated using Eqs. 7, 8, and 9, respectively. In Eq. 9, the term \(\lambda\) represents the joint generalization criteria, which controls the relative weight assigned to the weighted sum and weighted product scores. For example, if \(\lambda =1\), the overall decision is based on weighted sum score. Similarly, when \(\lambda =0\), then weighted product score only will contribute to the final decision. The optimal value of \(\lambda\) will be chosen by medical expert based on the problem domain. Finally, the obtained computed scores are used to make decisions based on certain threshold values. The domain experts decide the threshold value by analyzing the prevalence of different factors like symptoms, clinical or radiological findings, and assigned weights:

$${WSM}_{p}^{\mathrm{score}}=\sum_{i=1}^{n}{w}_{i}{x}_{pi}$$
(7)
$${WPM}_{p}^{\mathrm{score}}=\prod_{i=1}^{n}{\left({x}_{pi}\right)}^{{w}_{i}}$$
(8)
$${WSPM}_{p}^{\mathrm{score}}=\left(\lambda \right){WSM}_{p}^{\mathrm{score}}+\left(1-\lambda \right){WPM}_{p}^{\mathrm{score}}$$
(9)

Moreover, these scores can also be used to assign severity scores to the patients to prioritize the treatment procedure and allocation of scarce medical resources.

3.5 Experimental setup

To experimentally validate the proposed integration frameworks, three experiments were designed as shown in Fig. 5. The input dataset has three classes (i.e., COVID-19, other abnormalities, and normal) to train the different machine learning algorithms and the proposed integration frameworks. The dataset is divided into two parts—“training–testing” set and “validation” set. Initially, in DI approach, the different machine learning models (DT, NB, KNN, and SVM) are trained using the selected strong features. Subsequently, in RBI and WBI approach, only the symptomatological, pathological, and radiological features along with CAD decision are used for training the model. Finally, the performance metrics using both “training–testing” and “validation” set are evaluated as described in the following three experiments:

  1. Experiment 1:

    Classification performance of DI framework (described in Sect. 3.4(a)) is assessed using different benchmark classifiers.

  2. Experiment 2:

    Classification performance of RBI framework (described in Sect. 3.4(b)) is evaluated by creating a set of aggregation rules using a decision tree.

  3. Experiment 3:

    Classification performance of WBI framework (described in Sect. 3.4) is assessed using weighted sum, weighted product, and weighted aggregated sum-product score.

Fig. 5
figure 5

Experimental setup for evaluation of the proposed integration frameworks

3.6 Performance evaluation metrics

The promising performance of the proposed integration framework is assessed by using eight different evaluation metrics: accuracy, sensitivity, specificity, precision, F1 score, Matthews correlation coefficient (MCC), Kappa, and error. The detailed description of these metrics is shown in Eqs. 10, 11, 12, 13, 14, 15, 16 and 17 [9, 21], where true positive (tp) represents the positive instances that are also predicted as positive by the proposed system; true negative (tn) represents negative/normal instances that are also predicted as negative by the proposed system; similarly, the false negative (fn) and false positive (fp) denote the wrong predictions made by the proposed system; p = tp + fn and n = tn + fp. Furthermore, the statistical significance of the obtained results is validated using non-parametric Friedman’s average ranking and Shaffer [60] and Holm [61] post hoc multiple comparison methods [6, 9]:

$$\mathrm{Accuracy }=\frac{tp+\mathrm{tn}}{\mathrm{p}+\mathrm{n}}\times 100$$
(10)
$$\mathrm{Specificity }=\frac{tn}{n}\times 100$$
(11)
$$\mathrm{Precision }= \frac{tp}{tp+fp}\times 100$$
(12)
$$\mathrm{Sensitivity }=\frac{tp}{p}\times 100$$
(13)
$$\mathrm{F}1\mathrm{ score }= \frac{2\times \mathrm{precision }\times \mathrm{ sensitivity}}{\mathrm{precision }+\mathrm{ sensitivity}}\times 100$$
(14)
$$\mathrm{Area under curve }(\mathrm{AUC}) =\frac{1}{2}\left(\frac{tp}{p}+ \frac{tn}{n}\right)$$
(15)
$$\mathrm{Matthews correlation coefficient }(\mathrm{MCC}) =\frac{tp \times tn - fp \times fn}{\sqrt{\left(tp+fp\right) \times p \times n\times \left(tn + fn\right)}}$$
(16)
$$\mathrm{Cohens Kappa }=\frac{\mathrm{Accuracy}-\mathrm{random accuracy}}{1-\mathrm{random accuracy}}$$
(17)

where,

$$\mathrm{random accuracy}=\frac{n\times \left(tn+fn\right) + p\times \left(fp+tp\right)}{p+n}$$

4 Result and discussion

This section presents the obtained experimental results and a detailed discussion of the proposed integration frameworks. Initially, the performance of different benchmark classifiers and proposed integration frameworks is evaluated using the training–testing set and tenfold cross-validation setup. The extracted imaging features (discussed in Sect. 3.2) are classified using four benchmark classifiers, and the results are shown in Table 4. From the results, it can be observed that the SVM classifier with a quadratic kernel outperformed the other classifiers and achieved an average accuracy (ACC) = 87.270%, and F1 score = 80.85% in a tenfold cross-validation setup. However, the method exhibits a low sensitivity of 80% that shows a poor detection rate. Furthermore, to validate the performance of the DI framework (described in Experiment 1), the supervised models are re-trained using the combined feature set (discussed in Sect. 3.4.a). The obtained results shown in Table 5 show that the combined feature set significantly improves the classification performance of all the benchmark classifiers used in this study. Specifically, the SVM (quadratic kernel) achieved ACC of 92.73%, and F1 score of 88.64%, which shows 5.46% and 7.79% improvement in ACC and F1 score, respectively, which justified the promising performance of the DI approach and thus justify the validity of Experiment 1.

Table 4 Classification performance of different benchmark classifiers using radiomic features extracted from “training–testing” set. (Abbreviations: DT Decision Tree, NB naïve Bayes, KNN k-nearest neighbor, SVM support vector machine, FPR false positive rate, MCC Matthews correlation coefficient)
Table 5 Classification performance of different benchmark classifiers using integrated features (symptomatological, pathological, radiological, and radiomic features) extracted from training–testing set

Besides, the promising performance of the SVM classifier using quadratic kernel, the other classifier, achieved significantly lower performance. The low performance of DT classifier (ACC = 67.27% using radiomic feature and ACC = 72.73% using integrated feature) is due to its greedy approach, which easily suffers from overfitting when dataset is small. Similarly, the low performance of KNN (ACC = 81.82% using radiomic feature and ACC = 85.45% using integrated feature) is due to the imbalance class label in our study. The NB classifier is known for its real-time multiclass prediction and scalability; however, it demonstrated marginally low accuracy of 81.82% (using radiomic feature), and ACC = 85.45% (using integrated feature) is due to its assumption that each feature contributes equally, which does not hold true in most of the cases. In contrast, the SVM classifier with kernel function shows better performance than others because it easily handles complex and non-linear data. In addition, it can easily deal with the overfitting problem due to the small dataset. Moreover, the appropriate kernel function helps to fit the optimal hyperplane for accurate classification.

Although the DI approach significantly improves performance, the overall performance is not up to the mark. The low sensitivity of 86.67% is still not clinically acceptable due to a significant number of missed detection. The low performance is due to the fact that all the features are assigned equal weights irrespective of their prevalence, and the large image feature vectors supersede the collected feature. In addition, the conversion of categorical data to numeric form in DI framework incurred additional overhead. The above limitations are addressed in the RBI framework.

Experiment 2 aims to validate the performance of the RBI method using a set of decision rules. The obtained result (using training–testing set) shown in Table 6 reveals that the RBI exhibits significantly better performance (ACC = 96.36% and F1 score = 95.16%) compared to the DI framework (Table 5). The formed decision tree is shown in Fig. 6. From the figure, it is found that the multilobular involvement in case of COVID-19 patients is a highly prominent feature, which is also confirmed by the collaborating radiologist. In addition, the CAD-based diagnostic decision (in the form of posterior probability) also plays a significant role in the accurate identification of pulmonary diseases. Moreover, the higher sensitivity and specificity scores of 93.94% and 97.96% demonstrate improved detection performance and reduced false-negative rate. The substantial improvement in the performance is due to the fact that the large radiomic features are efficiently handled by the CAD system, which does not supersede the patient’s symptomatological, pathological, and radiological features. However, one limitation of the RBI approach is that it still assumes equal weight for all input features, limiting its performance. This shortcoming is addressed in the WBI approach.

Table 6 Classification performance of rule-based integration approach using features extracted from training–testing set. (Abbreviations: FPR false positive rate, MCC Matthews correlation coefficient)
Fig. 6
figure 6

Decision tree–based rules to aggregate patient’s symptomatological, pathological, radiological features, and CAD system decision (posterior probability)

In Experiment 3, the WSM, WPM, and WASPM scores are computed using different threshold values (shown in Table 7). Here, the choice of a specific threshold value is based on the empirical evaluations that accurately classify the computed score to normal, COVID-19, and other classes. From the obtained results (using training–testing set) shown in Table 8, it is observed that WPM (ACC = 98.18% and F1 score = 97.43%) and WASPM (ACC = 98.18% and F1 score = 97.73%) achieved better performance compared to the RBI and DI framework. Also, the high sensitivity and specificity scores of 98.67% and 99.26%, respectively, for WASPM, demonstrate the proposed method’s robustness performance. The higher performance of WPM and WASPM is due to the fact that each collected parameter and CAD decision is assigned an optimal weight, which emphasizes the important features leading to improved performance.

Table 7 Empirically selected threshold value for different class labels and value of λ for WASPM. (Abbreviations: WSM weighted sum model; WPM weighted product model; WASPM weighted aggregated sum product model)
Table 8 Classification performance of weight-based integration approach using features extracted from training–testing set. (Abbreviations: FPR false positive rate, MCC Matthews correlation coefficient) Note: the bold text show the better performance

Besides evaluating the classification performance using training–testing set that demonstrates the proposed frameworks’ internal accuracy/error, evaluating the proposed frameworks on external “validation” set reveals how the method would generalize to unknown data. Therefore, the trained model (using training–testing set) is validated using a separate validation set, and the results are shown in Tables 9, 10, and 11 for DI, RBI, and WBI, respectively. The results of benchmark classifiers presented in Table 9 for the DI approach show that the performance is significantly promising using external validation data. Similarly, the validation results of RBI and WBI justified the generalization capability of the proposed integration framework. Also, from the comparative study of the validation results, one can conclude that the WBI approach (WPM and WASPM) shows better performance than DI and RBI, which justified the robustness of the proposed method. However, it is worth mentioning that the validation performance is assessed using a small dataset (15 instances), which could be elaborated in the future scope of this study.

Table 9 Classification performance of different benchmark classifiers using integrated features (symptomatological, pathological, radiological, and radiomic features) extracted from “validation” set
Table 10 Classification performance of rule-based integration approach using features extracted from validation set. (Abbreviations: FPR false positive rate, MCC Matthews correlation coefficient)
Table 11 Classification performance of weight-based integration approach using features extracted from validation set. (Abbreviations: FPR false positive rate, MCC Matthews correlation coefficient)

The accuracy measure presented above could be misleading as it does not contemplate all the four parameters of the confusion matrix. For example, it gives higher classification accuracy when TP < FP and TN < FN. Therefore, we used MCC and Kappa measures to perform the comprehensive evaluation of the obtained results. From the MCC and Kappa analysis of conventional CAD system (shown in Table 4) and proposed integration framework (shown in Tables 5, 6, and 8) using training–testing set, it is observed that the proposed DI (MCC = 0.867, Kappa = 0.836), RBI (MCC = 0.937, Kappa = 0.918), and WBI method (MCC = 0.969, Kappa = 0.959) achieved better performance in terms of MCC and Kappa compared to conventional CAD system (MCC = 0.755, Kappa = 0.714). Also, the MCC and Kappa analysis of the proposed frameworks (MCC = 0.898 and Kappa = 0.850 for DI, MCC = 0.902 and Kappa = 0.850 for RBI, MCC = 1.00 and Kappa = 1.00 for WPM and WASPM) using validation set verified its robustness and generalization capability for unknown data. In addition, the promising performance of the proposed methods can also be verified from the improved value of sensitivity, specificity, and precision.

Furthermore, it is worth mentioning here that one-to-one comparison with the state-of-the-art methods is not feasible due to differences in integrated parameters, datasets, and simulation environments. However, the existing state-of-the-art methods shown in Table 1 employed partial integration of the parameters (i.e., Clinical Data—CAD and CAD—Radiologist Feedback), limiting the performance and clinical acceptability. Moreover, to the best of the author’s knowledge, none of the studies has considered the patient’s symptoms, clinical pathologies, and expert feedback in the integration framework. Together, these parameters play a vital role in identifying diseases that exhibit similar radiographic patterns [9, 18]. This paper addressed the above shortcoming and enabled the comprehensive, multifactor analysis of different decision criteria (or patient health data) collected through various departments’ diagnostic procedures and provides an aggregated diagnostic decision. Moreover, these decisions are based on some confidence scores (in case of WBI), predefined rules (in case of RBI), or machine learning algorithms (in case of DI), which help to build an appropriate level of trust among patients and experts on CAD-based automation.

4.1 Statistical evaluation

The statistical significance of the obtained classification results is evaluated using Friedman’s average ranking and post hoc multiple comparison procedures for both training–testing and validation set (Shaffer [60] and Holm [61]). Initially, the test assumes the null hypothesis that the performance of all the integration frameworks (DI, RBI, WBI) is equal. Alternatively, the WBI framework exhibits significantly higher performance compared to others. Table 12 shows Friedman’s average ranks of the classifier’s performance for both training–testing and validation set. From the results, it is found that the procedure rejects the null hypothesis while accepting the alternate (at \(\alpha =0.05 \mathrm{or} 95\%\) significance level), which reveals the significant differences in the performance of different proposed frameworks. Moreover, the lowest average rank obtained for WASPM (\(\mathrm{rank}=1.4286\)) justified the better performance compared to other methods. Furthermore, to assess the point where the significant differences exist, Shaffer and Holm–based pairwise post hoc multiple comparisons (10 pairs denoted by ‘i’) is performed, and the results are shown in Tables 13 (training–testing set) and 14 (validation set). From the obtained results shown in Table 13, it is found that the Shaffer and Holm test rejects those hypothetical assumptions that have an unadjusted p-value ≤ 0.005 and p-value ≤ 0.00625, respectively, which present sufficient evidence for better performance of WBI framework and justifies the validity of Experiment 3. Similarly, for validation set shown in Table 14, Shaffer and Holm test rejects those hypothetical assumptions that have an unadjusted p-value ≤ 0.0050 and p-value ≤ 0.0083, respectively, which also justified the promising performance.

Table 12 Friedman’s average ranking of different proposed integration frameworks based on classification performance metrics (with 4 degrees of freedom)
Table 13 Shaffer and Holm post hoc multiple comparisons at \(\alpha =0.05\) of different integration frameworks classification performance using training–testing set. (Abbreviations: DI direct integration, WBI weight-based integration, RBI rule-based integration, SVM support vector machine, WSM weighted sum model, WPM weighted product model, WASPM weighted aggregated sum-product model)
Table 14 Shaffer and Holm post hoc multiple comparisons at \(\alpha =0.05\) of different integration frameworks classification performance using validation set

In addition, we also performed Nemenyi critical distance [62] analysis on the obtained result at significance level α = 0.05. This test describes that the method for which the difference in mean ranks is greater than the critical distance of 2.3054 (for training–testing and validation set) has a significant difference in performance between them. The obtained critical distance diagram with mean rank shown in Fig. 7 depicts that the WASPM (WBI) achieved a higher mean rank of 4.57(for training–testing set) and 4.50 (for validation set) outperformed the DI and RBI. Here, the higher rank denotes better performance. Furthermore, according to the Nemenyi procedure, the WSM, WPM, and WASPM do not significantly differ in their performance.

Fig. 7
figure 7

Nemenyi critical distance diagram with mean rank (at significance level α = 0.05) for different integration frameworks using a “training–testing” set, b “validation” set. (Note: Higher rank represents better performance)

Besides the significant strength, the major limitations of the proposed work are inter-observer and intra-observer variability during the quantization of categorical data, weight assignment, and explicit parameter tuning for detecting different abnormalities. Moreover, the healthcare departments are not well-coordinated, and they lack the appropriate electronic patient health record (EPHR) system or picture archiving and communication system (PACS), limiting patient data availability.

4.2 Suggested directions for future research

The future aspects of this study should emphasize on experimental evaluation of proposed approaches for various inflammatory lung diseases. Also, the obtained experimental results can be validated and statistically verified with the subjective analysis by expert physicians. Besides the above, some other areas that required more focused research include:

  • Development of easy-to-use interface, secure communication protocol, and patient record archiving system.

  • A more extensive study on the large dataset is required to assess whether the proposed approaches are worthy of improving the trust calibration on automation and its clinical acceptability.

  • Development of machine learning–based weight assignment algorithms to automatically tune each decision criteria weights based on their prevalence in historical medical records.

  • The dependencies between various healthcare departments involved in the diagnostic procedure should be balanced as the strong coupling involves a significant diagnosis delay. In contrast, no coupling may lead to degraded diagnostic performance.

5 Conclusion

The comprehensive analysis of various associated symptoms and clinical and radiological findings of infectious PD is critical for early and accurate identification. This paper initially investigated the feasibility of integrating the patient’s metadata, clinical findings, and expert feedback with the CAD system. Subsequently, we proposed three integration frameworks (direct integration, rule-based integration, and weight-based integration) that enable the comprehensive analysis of patient health records collected through different diagnostic procedures from various healthcare departments. The experimental results show that the prosed method (WBI (ACC = 98.18%, F1 score = 97.73%, and MCC = 0.969), RBI (ACC = 96.36%, F1 score = 95.16%, and MCC = 0.937), and DI (ACC = 92.73%, F1 score = 88.64%, and MCC = 0.867)) outperforms the conventional CAD system (ACC = 87.27%, F1 score = 80.85%, and MCC = 0.755), which is also verified from an external validation dataset (WBI (ACC = 100.00%, F1 score = 100.00% and MCC = 1.0), RBI (ACC = 93.33%, F1 score = 92.593%, and MCC = 0.902), and DI (ACC = 93.33%, F1 score = 92.208%, and MCC = 0.898)). Furthermore, the minimum average rank for the Friedman test and significant p-value for Shaffer and Holm–based post hoc procedures reveals the statistical significance of the obtained results. Besides the used parameters, the proposed methods also support easy expansion to accommodate necessary changes to fit the new requirements. This integrated diagnosis approach would be of paramount importance to combat pandemic situations like COVID-19.

In the future, this study can be extended by including extensive experimental evaluation for evaluating the appropriate number of features for given sample size so that the proposed model does not overfit, development of secure communication protocol for structured data exchange, and development of strategies to assess the trust calibration on automation and its clinical acceptability.