1 Introduction

In this era of technology and advancements, we have come across multiple transformations made by ML/DL systems in industries such as governance, manufacturing, and transportation. Over the past couple of years, the utilization of intelligent systems has increased manifold in various domains, including our routine life. One such realm is healthcare [1, 2], which earlier had been impervious to large-scale technological disruptions. The Healthcare industry across the globe has evolved extensively with the advent of machine intelligence. Nasr et al. [3] explore current state-of-the-art smart healthcare systems, highlighting significant topics such as wearable and smartphone devices for fitness monitoring, ML for illness prediction, and assistive frameworks, including social robots designed for assisted living environments. Bharadwaj et al. [4] confer applications of ML algorithms integrated with the Healthcare Internet of Things (H-IoT) in terms of their compensations, choice, and potential future aspects. The acceptance of ML/DL techniques has sustained exceptional results in versatile tasks such as brain tumor segmentation [5], Saliva sample classification of COPD patients [6], Chronic Neurological Disorder assistance [7], anomaly recognition in the Artificial Pancreas [8], clinical image reconstruction [9], cancerous cell classification, to name a few. It is expected that in the coming years' intelligent software systems will take over much of the human labor, put by radiologists and physicists in examining medical documents. ML will transform conventional medical practice and research. Healthcare has emerged as an active application area for ML/DL models in achieving human-level performance in various pathological tasks [10]. Some of the investigations reported that the intelligent models outperformed clinical experts in certain respects. Esteva et al. [11] illustrate the categorization of skin lesions with a single CNN evaluated against 21 board-certified dermatologists on biopsy-proven clinical diagnosis of the scariest skin cancer The findings show that AI can classify skin cancer with a degree of accuracy equivalent to dermatologists. Rajpurkar et al. [16].

Dhief et al. [17] presented an extensive review of IoT frameworks and state-of-art techniques used in healthcare and voice pathology surveillance systems whereas Alhussein et al. [18] investigated the voice abnormality detection system using DL on mobile healthcare frameworks. Researchers and physicians are reviewing numerous approaches to utilize the skill of DL methods for Intensive Care Unit (ICUs) and critically acclaimed concerns [19,20,21], similarly, Ganainy et al. [22] proposed a real-time consultation system in the clinical context which forecasts the Mean Arterial Pressure (MAP) values’ current status at the ease of bed accessibility using new ML structures. The majority of intelligent applications utilizing customer records have received disappointing results at some point in their performance due to their obsession with metrics [23,24,25]. Envisioning the privacy concerns that arise while dealing with data transmission or analysis to model a predictive system settles at a compromising state [26,27,28]. This paper attempts to acknowledge the diverse techniques of ML and their diligence in the Healthcare ecosystem. A brief of subsequent sections is provided next [29,30,31,32].

  • This paper shares a concise statistical background of ML Algorithms while discussing multiple ML models, their application in clinical aspects, along with certain hindrances, and any possible solutions to tackle those shortcomings.

  • This paper outlines various challenges related to medical analysis using ML and DL techniques.

  • This paper analyses and lists different heterogeneous sources contributing to healthcare data and the flaws associated.

  • This paper describes the applications of ML in healthcare for medical prognosis, computer-aided detection, diagnosis, and treatment. Further, the associated drawbacks are outlined as well.

  • This paper lists different types of vulnerabilities in the ML pipeline and their sources. Further, the work highlights various techniques to avoid information breaches and preserve the privacy of data for clinical users.

The remainder of the paper is organized as follows. Section 2 presents the various ML algorithms, their applications, and their mathematical background. Section 3 presents the different applications of ML in the healthcare systems and tries to bring the present scenario where utilization of the intelligent systems to automate regular tasks is demonstrated. Section 4 witnesses the probable vulnerabilities that are encountered during the preparation of ML models in the healthcare pipeline. Section 5 presents a study to recognize the privacy challenges concerning the involvement of AI systems and various approaches to preserving privacy concerns. Conclusively, Sect. 6 presents imminent prospects and areas that require further research followed by the chapter conclusion in Sect. 7.

2 Background of ML Algorithms

The majority of develo** countries have invested their time and money in advanced technical prospects that in some way or other prove to be cost-effective in the long run. Development is often associated with the advent of automated machinery and mechanical systems as we grow towards becoming a data-centric world. Management and effective use of data at the industrial level is an irksome task if humans run the errands, this is where the applicability of various ML/DL-based intelligent systems gain its importance. ML algorithms are developed specifically for supporting models to solve a problem in different domains (e.g., Healthcare, Fintech, Industrial, etc.) [33]. Okay et al. [34] demonstrate that applying (Interpretable Machine Learning) IML models to sophisticated and difficult-to-interpret ML approaches provides thorough interpretability while preserving accuracy, which is challenging when crucial medical choices are at stake. Ileberi et al. [35] implement an ML-based framework, Synthetic Minority over-sampling Technique (SMOTE), for credit card scam exposure since it outstrips other prevailing methodologies. Ahsan et al. [36] propose a unique prognostics framework based on statistics-driven ML modeling for forecasting qualification test results of electronic components, allowing a decrease in qualification test cost and time. Hari et al. [37] offer a supervised ML method built by modeling the behavior of Gallium Nitride (GaN) power electronic devices for reliably forecasting the current waveforms and switching voltage of these innovative devices. Seng et al. [38] concentrate on how computer vision (CV) and ML practices may be applied to existing vinification actions and vineyard organizations to obtain industry-relevant outcomes. Rehman et al. [39] provide an ML technique for the localization of brain tumor cells utilizing the textonmap image on FLAIR scans of Magnetic Resonance Images (MRI). Singh et al. [40] offer a unique ensemble-based classification technique that combines AI, fog computing, and smart health to create a reliable platform for the early identification of COVID-19 infection. Comparatively, Vyas et al. [41] offer an ML model powered by a multimodal method for assessing a patient's readiness to suggest the hospital plays an important part in action design based on patient choice. Some ML algorithms and their purposes are discussed in the forthcoming sections. A summary of the different ML algorithms discussed in this chapter is depicted in Fig. 1.

Fig. 1
figure 1

Illustration of various ML algorithms and their categories

2.1 Regression Models

Regression analysis is a statistical modeling method that aims to define a relationship between a dependent and independent variable (linear or polynomial) [42]. This predictive modeling technique can be utilized for forecasting, time-series modeling, predictive analysis, etc. Various types of regression methodologies subsisting are Linear, Polynomial, Logistic, Multivariate Regression, Ridge, and Bayesian Linear Regression. Some of these are discussed next.

2.1.1 Linear Regression

Linear regression models have transformed the statistical view of supervised learning for quantitative response prediction of a relation linking the independent (input vector) and dependent variable (output vector). The relationship is represented by a linear function (regression technique) with a formidable perfection. In the ML arena, Linear regression models outperform simplicity while preserving considerable interest and ease of interpretability. Velez et al. [43], presented a straightforward definition of ML as “the capacity to explain or show human eccentricities in understandable terms”. Linear regression targets to access a direct relationship (function) f that justifies the relationship between an input vector having dimension and a real-value output y (i.e., f(x)) as

$$y= {\alpha }_{0}+ {\chi }^{\top }\alpha$$
(1)

where \({\alpha }_{0}\) \(\in\) \(R\) is identified as the intercept of the function and \(\alpha {R}^{d}\) is the coefficient vector corresponding to the individual input variables. To calculate the regression coefficients \({\alpha }_{0}\) and \(\alpha\), a training set (\({\rm A}\), \({\rm A}\)) is required where A ∈ \({R}^{k \times d}\) denotes k training inputs, \({a}_{1}, {a}_{2}, \dots .., {a}_{k}\) and, \({\rm A}\) denotes k training outputs where each \({a}_{i }\in {R}^{d}\) is affirmed with the real-entity output \({B}_{i}\). The prime objective is to reduce the empirical risk, quantifying via \({\alpha }_{j}\) the relation between predictor \({A}_{j}\) and the response, for each \(j=\mathrm{1,2},3\dots , d.\) Loss functions are a measure of the amount of deviation resulting from the actual outputs concerning model performance. The least squared estimate is one of the widely used loss functions for regression models and also has minimal variance amongst all unbiased linear estimates. Working a regression model by reducing the Residual Sum of Squares (RSS) between the predicted outputs and the labels is expressed as [44]

$$RSS \left(\alpha \right)= \sum_{i=1}^{k}{\left({B}_{i}-{\alpha }_{0}-\sum_{j=1}^{d}{x}_{ij}{\alpha }_{j}\right)}^{2}$$
(2)

Certain downsides include high variance, where a model may properly reflect the data set but may overfit to noisy or otherwise unrepresentative training data, reducing prediction accuracy and making it unsuitable for fitting. However, alternative approaches like Linear Dimension Reduction (LDR), this approach generates a low-dimensional linear map** of the original high-dimensional or noisy data that maintains some characteristic of interest, denoises or compresses the data, extracts important feature spaces, and other benefits, further, forward or backward elimination allows to avoid overfitting and reduce robustness. The processing and manipulation of data are often associated with noise, creating a diminishing impact on the model's performance [45]. The link between regularization and robustness due to noise is represented as:

$$\underset{{\alpha }_{0},\alpha}{min} \;{max }_{\Delta \in u} g\left(B-{\alpha }_{0}-\left(A+\Delta \right)\alpha \right)$$
(3)

In this regard, the noise is expected to vary accordingly to an uncertainty set \(u\in {R}^{k\times d}\), and the learner inherits the robust behavior, where \(g\) is a convex function that calculates the remainder [46]. Regression models can sometimes renounce the correct interpretability due to a significant no of features against fewer data, to overcome the shortcomings and multicollinearity, various feature selection strategies are applied.

2.1.2 Shrinkage Models

To produce a more predictable model the value of regression coefficients is depreciated with the help of some regularization methods also known as Shrinkage methods at the cost of importing some bias in model ascertainment. The principal intention behind shrinkage methods is penalizing the regression coefficients on the loss function towards a fundamental point, like the mean. Some common shrinkage methods include Ridge Regression which penalizes the norm-2 of the regression coefficients

$${\mathcal{L}}_{ridge}\left({\alpha }_{0}, \alpha \right)=\sum_{i=1}^{k}{\left({B}_{i}-{\alpha }_{0}-\sum_{j=1}^{d}{x}_{ij}{\alpha }_{j}\right)}^{2}+\partial \sum_{j=1}^{d}{\alpha }_{j}^{2}$$
(4)

where \(\partial\) controls shrinkage magnitude, lasso regression penalizes norm-1 and tries to minimize the quantity by

$${\mathcal{L}}_{lasso}\left({\alpha }_{0}, \alpha \right)=\sum_{i=1}^{k}{\left({B}_{i}-{\alpha }_{0}-\sum_{j=1}^{d}{x}_{ij}{\alpha }_{j}\right)}^{2}+\partial \sum_{j=1}^{d}|{\alpha }_{j}|$$
(5)

Least Absolute Shrinkage and Selection Operator (Lasso) Regression is an extension of linear regression supplemented by shrinkage. The lasso approach favors models with fewer parameters, well-suited for models with high degrees of multicollinearity, or for develo** automation of some rudiments of model selection. Lasso models are more interpretable as compared to ridge regression due to large \(\partial\) which compels some of the estimated coefficients to be equivalent to absolute zero. The estimation accuracy of subset selection is driven solely by the disturbance present in the input dataset, to reduce the effect of foreign particles and to shun numerical issues, the Tikhonov regularization term (\(\frac{1}{2\wedge } \parallel \alpha {\parallel }_{2}^{2}\)) with weight \(\wedge\)> 0 is introduced along with the cutting plane approach [47].

2.1.3 Regression Models Beyond Linearity

Linear correlation is naturally extended to complex non-linear terms, which may apprehend composite relationships between predictors and regressors. Non-linear regression models extend to include step functions, exponential, local regression, smoothing, regression splines, and polynomial regression into the Familia. Otherwise, the Generalized Additive Models (GAMs) [48] maintain the additivity of the original predictors \({x}_{1}, \dots ., {x}_{n}\), and the relation between every feature and the response y is expressed using nonlinear functions \({g}_{j}\left({x}_{j}\right)\) such as

$$y= {\alpha }_{0}+\sum_{j=1}^{d} {g}_{j}\left({x}_{j}\right)$$
(6)

To preserve a certain level of predictors interpretability concerning linear models, GAMs escalate the flexibility and accuracy of prediction with the aid of non-parametric models such as boosting and random forest. The predictors are expressed in the form of \({x}_{i}\times {x}_{j}\). The efficacy of GAMs is underrepresented in scenarios where observations exceed predictors. Piecewise affine forms appear as suitable models when the correlated function is found separable, discontinuous, or fuzzy to complex nonlinear expressions [49, 50].

2.2 Classification

Classification refers to segregation or map** of unlabelled data items (entity α) based on a trained dataset (\(A, B\)) where every \({\alpha }_{i}\) has a predefined class relative \({B}_{i}\) in a specific category. Classification admits multiclass and binary approaches including logistic regression, Linear Discriminant Analysis (LDA), Support Vector Machines (SVMs), and decision tree mechanisms [51].

2.2.1 Logistic Regression

In critical domain functional relationship \(\left(y= {g}_{x}\right)\) between \(y\) and \(x\) is absent. Considering this situation, the relation between \(y\) and \(x\) has to be described in a general way by a framing a probability function \(E\left(\frac{y}{x}\right) ,\) considering that the train data preserves independent bits from\(E\). Here the label \(y\) is assumed to be binary, i.e., \(y\in \left\{\mathrm{0,1}\right\},\) the finest class membership conclusion is to choose the label \(y\) that amplifies the distribution \(E\left(\frac{y}{x}\right)\) imperatively. Logistic regression examines the probability of belonging to a class for one in the two categories of the dataset by [52]

$$E\left(y=1|x,{\alpha }_{0},\alpha \right)={1}^{-1}\left(\varkappa ,{\alpha }_{0},\alpha \right)=-\frac{I}{1+{e}^{-\left({\alpha }_{0}+{\alpha }^{\top }x\right)}}$$
(7)

The prominent decision boundary between the binary classes is marked by a hyperplane (that maximizes the measure of deviation) is described as \({\alpha }_{0}+{\alpha }^{\top }x=0\). The parameters \({\alpha }_{0}\) and \(\alpha\) are obtained by maximum-likelihood estimation method

$$- {\sum }_{i=1}^{k} \left({y}_{i}loglog H \left({x}_{i},{\alpha }_{0},\alpha \right)+ \left(1- {y}_{i}\right)log\left(1-H\left({x}_{i},{\alpha }_{0}, \alpha \right)\right)\right)$$
(8)

To conclude at a globally optimal solution, \({1}^{st}\) order method such as gradient descent for positioning a differential function's local bottom, taking recurrent steps in the conflicting course of the function's incline at the current point, in the steepest descent direction and \({2}^{nd}\) order such as Newton's method where each iteration entails fitting a parabola to the graph of a differential function at a trial value p and then determining the minimum or maximum of that parabola (called saddle point), come into play. Further tuning of the logistic regression models can be achieved by variable selection to avoid overfitting, forward selection to add variables, or backward elimination to withdraw variables based on the statistical relevance of the coefficients.

2.2.2 Decision Trees

Classification is often associated with a non-parametric model, Decision Trees (DT) for a conclusive decision on any hypothetical or real-world instance using distribution rules expressed as a tree data structure. Statistical indicators (such as mean, median, or mode) recline the intuitive prediction of the model on the segmented training data. DTs are good for large datasets with less dimension and can handle both numerical and categorical values. Entropy is calculated for each candidate i.e., the average weighted probability, and combined them to find the average of each node, represented as \(H\left(s\right)= - {\sum }_{i=1}^{j} {p}_{i}log{p}_{i}\), where ‘H’ represents the entropy for the given weight ‘s’ and ‘\({p}_{i}\)’ if the frequency of the probability of an element per class ‘i’ in the data. Subtlety, the Gini Impurity is given as \(Gini=1- {\sum }_{i=0}^{j} {\left({p}_{i}\right)}^{2}\) evaluates the impurity of each candidate node and hence the root with the least impurity can be picked easily. Similarly, the Information Gain (IG) which quantifies the quantity of split is represented as

$$IG\left({D}_{p},f\right)=I\left({D}_{p}\right)-\frac{{N}_{left}}{N}I\left({D}_{left}\right)-\frac{{N}_{right}}{N}I\left({D}_{right}\right)$$
(9)

simplifying it to \(IG\left(s, a\right)\). This can be estimated as

$$IG\left(s, a\right)=H\left(s\right)-H(s |a)$$
(10)

where ‘H(s)’ is the entropy for the data given the variable ‘a’. To avoid overfitting of data, pruning along with other techniques such as Smit and Konin are taken into consideration. Pruning of a tree is an essential measure to ensure unbiased decisions, represented as

$${R}_{\alpha }\left(T\right)=R\left(T\right)+ \alpha \left|\tilde{T }\right|$$
(11)

where ‘R(T)’ is the total misclassification rate of terminal nodes, ‘T’ no of terminal nodes and ‘\({R}_{\alpha }\left(T\right)\)’ is the cost complexity measure. Various recursive procedures help in the splitting of training datasets to parse them through segmentation. Since recursive procedures have a distinguished greedy nature, it has failed at times to settle at global optimum, giving chances to implement certain other alternatives such as the heuristic approach based on mathematical programming paradigms (i.e., linear optimization) and dynamic programming. Consider an example of a simple classification tree, where the tree determines the health status and need of exercising for elderly people based on their activities. Figure 2 represents the decision process. Okaty et al. [53] propose a fresh stratum-based DT model for precise localization of anatomical landmarks in clinical image scrutiny. Liang et al. [54] provide an effective and privacy-preserving DT classification strategy for health monitoring systems (PPDT). They turn a DT classifier into a boolean trajectory, then encode with symmetric key encryption. Zhu et al. [55] present a novel Multi-ringed (MR) Forest framework based on DTs for the reduction of false positives in pulmonary node detection. Various algorithms that utilize fed data to generate decision trees are Classification and Regression Tree (CART), Iterative Dichotomiser 3 (ID3), ID 4.5, etc.

Fig. 2
figure 2

Decision tree to predict the need for exercising for elderly people based on their activities

2.2.3 SVM

Under the hood of supervised machine learning algorithms in the statistical learning category, SVMs receive vital attention in the optimization approaches. SVMs intend to identify a hyperplane with a maximum margin separating two significant classified classes. Given a training set \(\left(A, B\right)\) with \(m\) training inputs where \(A\in {R}^{k\times d}\) and \(B\in {\left\{-1, 1\right\}}^{k}\) being the binary response variable, SVM identifies the margin of separation as \({w}^{\top }+\gamma =0\). Provided, \(w\) represents the vector of coefficients for input variables and \(\gamma\) is the intercept of the distinguishing hyperplane [56].

2.2.3.1 Hard margin SVM

Hard margin SVM is known as the simplest version of SVMs that proceeds with an assumption that a hyperplane exists which physically separates data into two different classes avoiding misclassification. This optimization technique is categorized as a linearly constrained convex quadratic problem. Following this model's training, a hyperplane is identified which separates the data kee** the distance to the closest data point from the margin of separation maximum. The distance of a data point \({a}_{i}\) to the hyperplane is given by

$$\frac{{B}_{i}\left({w}^{\top }{a}_{i}+\gamma \right)}{{\| w\| }_{2}}$$
(12)

where \({\| w\| }_{2}\) expresses the norm-2. Therefore, the data points with labels \(B= -1\) are on one side of the hyperplane such that \({w}^{\top }{a}_{i}+\gamma \le 1\) while the data point with labels \(B= -1\) are on the other side \({w}^{\top }{a}_{i}+\gamma \ge 1\). Now to find the hyperplane an optimization function has to be dealt with,

$$\frac{1}{{\parallel w\parallel_{2} }}\parallel w\parallel_{2}^{2}$$
(13)

s.t., \({B}_{i}\left({w}^{\top }{a}_{i}+\gamma \right)\ge 1 {\forall }_{i=1,\dots , k}\), \(w \in\) \({R}^{k}\), \(\gamma \in\) \(R\), which is recognized as a convex quadratic problem. Often the accuracy of optimization by forcing the separability of data on a linear hyperplane is traded off which rules out the practicability of this version of SVM, this is where soft-margin SVMs outperform hard-margin SVMs.

2.2.3.2 Soft margin SVM

The convex quadratic problem becomes infeasible when data is not separable on linear terms. An alternative to this problem exists by minimizing the errors average. To minimize the data points tinkering on the unfavorable side of the hyperplane a slack variable \({\xi }_{i}\ge 0\) in the constraints of the objective function is introduced which is then penalized as a proxy. The soft-margin escalation problem is discussed as

$${\parallel w\parallel }_{2}^{2} +P{\sum }_{i=1}^{k} {\xi }_{i}$$
(14)

where \(w\in\) \({R}^{t}\), \(\gamma \in\) \(R\), \({\xi }_{i}\ge 0\). Considering another alternative as to introduce an error term \({\xi }_{i}\) in the objective function using the squared hinge loss function \({\sum }_{i}^{k}{\xi }_{i}^{2}\) instead of the hinge loss function \({\sum }_{i}^{k}{\xi }_{i}\) to attain specificity of soft-margin SVM. The misclassification rate of this optimization strategy maximizes when norm-2 is replaced with norm-1 leading to linear optimization problems.

2.2.3.3 Sparse SVM

Various approaches have been proposed to deal with sparsity (feature selection in classification model) in SVMs among which 1-norm, elastic net (both 1-norm and 2-norm) are common. The approach is applied to the model which tunes bias to one of the norms using a hyperparameter [57]. The number of features selected can be modeled in the soft-margin optimization problem by using binary variables \({\rm Z}\in {\left\{0, 1\right\}}^{d}\) where \({\rm Z}_{j}=1\) indicates feature \(j\) is selected else \({\rm Z}_{j}=0\). A constraint restricting the feature number for an optimum desired reach can be resulting in a mixed-integer quadratic catch as

$${\parallel w\parallel }_{2}^{2} +P{\sum }_{i=1}^{k} {\xi }_{i}$$
(15)

s.t. \({B}_{i}\left({w}^{\top }{a}_{i}+\gamma \right)\ge 1-{\xi }_{i}\), \(where {\forall }_{i=1,\dots , k}\), \(w\in\) \({R}^{t}\), \(\gamma \in\) \(R\), \({\xi }_{i}\ge 0\), \(s.t.\, {\sum }_{j=1}^{d} {\rm Z}_{j}=r\).

2.2.4 SVR

Support Vector Regression (SVR) is a supervised machine learning technique that is designed to handle regression difficulties. Regression analysis comes in handy while observing the relationship between one or more predictor variables and dependent variables since it can balance the complexity of the model and prediction error [58]. SVR is an extension to classic SVM that is introduced for binary classification buttressing the core idea of recognizing a linear function \(f\left(x\right)={w}^{\top }a+\gamma\) approximated with a tolerance variable \(\varepsilon ,\) training set (\(A, B\)) where \(B\in R\) [59]. SVR has shown optimal performance in handling high-dimensional data that deals with regression problems. SVR uses a similar approach to SVM to perform classification using hyper-planes defined by a few support vectors and can easily handle non-linear regression competently [60]. However, a linear function might not always be derivable thus slack variables \({\xi }_{i}^{-}\ge 0 \& {\xi }_{i}^{+}\ge 0\) expressing deviations from the expected tolerance are introduced and minimized similar to the way of soft-margin SVMs. Following, the optimization problem is stated.

$${\parallel w\parallel }_{2}^{2}\; +P{\sum }_{i=1}^{k} \left\{{\xi }_{i}^{-}+ {\xi }_{i}^{+}\right\}$$
(16)

Hyperparameter (P) tuning further adjusts the weight on deviation from tolerance\(\varepsilon\). This deviation is the \(\varepsilon\)-insensitive loss function \({\left|\xi \right|}_{\varepsilon } ,\) given by

$${\left|\xi \right|}_{\varepsilon }= \{0 \left|\xi \right|-\varepsilon\; when \left|\xi \right|\le \varepsilon$$
(17)

2.3 Clustering

Clustering is a widely used class of supervised learning that focuses mainly on the grou** of a set of objects into smaller clusters of similar genera. This common statistical data analysis technique finds its application in the domains of pattern recognition, bioinformatics, data compression, image analysis, and information retrieval. Healthcare sectors collect massive amounts of data from various healthcare service providers, and this data may include information such as patient information, medical tests, and treatment specifics. Because of the intricacy of the data obtained, analyzing the data for decision-making on a patient's health state is tough. Numerous strategies, such as clustering, are currently used by healthcare practitioners to determine a patient's health state. Clustering is an unsupervised learning method that divides huge datasets into smaller groups based on related properties [61]. This method is usually used to find commonalities between data points. The most common use of unlabeled learning (Unsupervised learning) has been to generate a cluster or group of items in a dataset. Given an input \(A\in {R}^{k\times d}\), which includes k unlabelled observations, \({a}_{1}, {a}_{2}, \dots .., {a}_{k}\) with \({a}_{i}\in {R}^{d}\), clustering aims to procure \(K\) subsets of \(A\), i.e., individual clusters, which are homogeneous as well as separated. The cluster estimation acts as a tuning parameter that needs to be corrected before examining the clusters. The degree of separation and homogeneity can be modeled based on the different criteria which give rise to several types of clustering algorithms such as K-means Clustering, Capacitated Clustering, Hierarchical Clustering, etc.

2.3.1 K-Means

K-means clustering or minimum sum of squares clustering is a vector quantization method that aims to partition the \(m\) no. of data observations into \({\rm K}\) disjoint clusters with an affiliated minimum central mean for each sample. The decision on the cluster proportions is considered by close examination of the elbow curve, or similarity indicators, such as Calinski-Harabasz index, silhouette values, or via statistical programming approaches [62]. Binary variables described as \({\varphi }_{ij}=\{1 i \in cluster j 0 otherwise\) and the centroid \({\varphi }_{j} \in {R}^{d }\) of each cluster \(j\), the difficulty of reduction in cluster variance is provided as a nonlinear equation [63]

$${\sum }_{i=1}^{k} {\sum }_{j=1}^{\rm K} {\varphi }_{ij} {\| {a}_{i}-{\varphi }_{j}\| }_{2}^{2}$$
(18)

\(s.t.\, {\sum }_{j=1}^{\rm K} {\varphi }_{ij}=1,\forall i=\mathrm{1,2},\dots .,k\), \(\forall j=\mathrm{1,2},\dots ,{\rm K} , {\varphi }_{j} \in {R}^{d}\). Introduction of the variable \({\varphi }_{ij}\) which denotes the distance of observation \(i\) from centroid\(j\), the following linear dimensional formula is obtained as

$${\sum }_{i=1}^{k} {\sum }_{j=1}^{\rm K} {\delta }_{ij}$$
(19)

\(s.t. \;{\sum }_{j=1}^{\rm K}{\varphi }_{ij}=1, \forall i=\mathrm{1,2},\dots .,k and \forall j=\mathrm{1,2},\dots , {\rm K}.\) Apart from the above-mentioned methods several other alternatives such as the heuristic approach based on gradient method, bundle approach, and a column generation approach are in practice. Figure 3 represents the clusters with K-means as their centroid, all classified distinctly.

Fig. 3
figure 3

Clusters with K-means, classified

2.3.2 Capacitated Clustering

The Capacitated Centred Clustering Problem (CCCP) aims to catalogue a bunch of clusters with a limited capacity and correlation indicated by the similarity index of the cluster’s mean. Considering a group of expected clusters from \(1, 2, \dots ., {\rm K},\) CCCP can be mathematically represented as

$${\sum }_{i=1}^{k} {\sum }_{j=1}^{\rm K} {\beta }_{ij}{\varphi }_{ij}$$
(20)

\(s.t.\,\) \({\sum }_{j=1}^{\rm K} {\varphi }_{ij}=1, {\sum }_{j=1}^{\rm K} {\upsilon }_{j}\le {\rm K},\) \({\varphi }_{ij}\le {\upsilon }_{j}, {\sum }_{i=1}^{k} {q}_{i}{\upsilon }_{j}\le {Q}_{j}\). Where \({\rm K}\) is the uppermost bound on the clusters, \({\beta }_{ij}\) represents the measure of dissimilarity between cluster \(j\) and observation i. \({Q}_{j}\) is the capacity of cluster\(j\), and \({q}_{i}\) is the weight of observation \(i\). Variable \({\varphi }_{ij}\) denotes the assignment of \(i\) to \(j\) and variable \({\upsilon }_{j}\) is equivalent to 1 when cluster \(j\) is used. If the variable \({\beta }_{ij}\) is a distance and the clusters are homogeneous then the formula also models the well-known facility location problem [64].

2.4 Linear Dimension Reduction

Linear dimensionality reduction or shrinkage methods have been developed extensively for ages in the domain of statistics and applied fields to become an indispensable tool for analysing high-dimensional and noisy data. These methods improve the model's interpretability by producing a low-dimensional linear function from the original high-dimensional data that preserve features of interest in the output sample [65].

2.4.1 Principal Components

Principal component analysis (PCA) targets prune the sum of squared residual errors between the original high-dimensional data and projected data points. PCA trail in terms of explained variances, which refer to the quantum of information regained from the original feature set \({a}_{1}, {a}_{2}, \dots .., {a}_{d}.\) PCA was formulated originally as

$$\frac{1}{k}{\sum }_{i=1}^{k} {\left({\sum }_{j=1}^{d} {\phi }_{j}^{1}{a}_{ij}\right)}^{2}$$
(21)

\(s.t.\, {\sum }_{j=1}^{d} {\left({\phi }_{j}^{1}\right)}^{2}=1,\) where \({\phi }^{1}\in {R}^{d}\) is a unit vector. The problem above was sensitive to the presence of outliers. To improve robustness, the original formulation later grew equivalent to "maximizing variance" derivation given as

$$\frac{1}{k}{\sum }_{i=1}^{k} {\left({\sum }_{j=1}^{d} {\phi }_{j}^{h}{a}_{ij}\right)}^{2}$$
(22)

\(s.t.\, {\sum }_{j=1}^{d} {\left({\phi }_{j}^{h}\right)}^{2}=1,\) where\({\phi }^{{h}^{\tau }}{S\phi }^{\iota }=0 \;and \;\forall \iota =1, 2, \dots , h-1 , {h}^{th} principle component\). PCA finds its application in various data analytics problems which benefit from dimensionality reduction mechanisms. For linear regression models, there exists Principal Component Regression (PCR) a two-staged procedure that inherits the properties of PCA accompanied by the advantage of including fewer predictors and reduced predictability time in the same variable dataset. Amid all the resolute outcomes of PCA, the only known drawback is interpretability.

3 Problems in Healthcare Sector

A change toward a data-driven socioeconomic health slant is taking place. This is due to the increased volume, velocity, and diversity of data attained from the public and private sectors in healthcare and natural sciences in a wide range. Over the last five years, there has been remarkable advancement in informatics technologies and computational intelligence for use in health and biomedical sciences. However, the full potential of data to address the breadth and extent of human health problems has yet to be realized. The properties of health data present intrinsic limitations to the effective implementation of typical data mining and ML technologies. Aside from the volume of data ('Big Data’) they are difficult to manage because of their complexity, heterogeneity, dynamic nature, and unpredictability. Finally, practical obstacles in applying new and current standards across different health providers and research organizations have hindered data management and the interpretability of the results. Oliveira et al. [125].

5.1.2 Environmental and Instrumental Noise

The process of digital data collection and regulation seldom accompanies environmental and instrumental disturbances. Little agitation in certain diagnostic procedures such as in multishot MRI where extensive supervision is required, can lead to undesirable noise in the solicited data thereupon increasing the risk of misdiagnosis.

5.2 Vulnerabilities Due to Data Annotation

ML/DL applications require extensive model training for perfect predictive performance. For medical usage applications, most models are extensively trained on clinically produced images that require every sample to be annotated. This tedious task of assigning labels should mostly be performed by clinical experts who can prepare domain-enriched datasets or by some automated algorithms [126]. Labeling data like secondary tasks are not encouraged by professionals as it employs a lot of their crucial time therefore trainee staff (who have little domain expertise) are employed for the task. As a result, it leads to problems such as bawdy labels, misclassification, sanction imbalance, etc. Several vulnerabilities due to data annotation are noted further.

5.2.1 Ambiguous Ground Truth

In medical datasets, Finlayson et al. [127] proactively presented a study that expresses the ambiguity in the ground truth of the results. Even well-defined diagnostic tasks are criticized by therapeutic experts, further mishandling and malicious attacks by some perplexed users make the diagnosis, and hence the treatment process difficult yet being under expert supervision.

5.2.2 Improper Annotation

The proper annotation for data samples is critical for certain life-saving healthcare applications. ML/DL mechanisms are deployed for the automated image labeling tasks which often might lead to coarse-grained problems, mislabelling [128]. These problems may challenge the predictive capabilities of healthcare systems that are mentioned next.

5.2.3 Efficiency Challenges

Efficacy becomes the prime factor to monitor an ML/DL-based system's performance. Particular challenges that influence the quality of data and performance thereafter are Limited and Imbalanced datasets, Class imbalance and bias, and sparsity. Newly identified diseases do not have much available history, due to this limitation the performance of a model on predicting the outcomes of this problem is demoted. Class Imbalance is seen as a common problem in supervised ML/DL models which arise due to a mismatch or non uniform data distribution amongst respective classes. Data Sparsity refers to the missing values in the input data that arise due to skipped or unreported samples. All these problems put a significant effect on the functioning of ML/DL techniques.

5.3 Vulnerabilities in Model Training

Vulnerabilities concerning ML/DL model training comprise partial training, model poisoning, privacy infringement, incomplete data rendering. Unbecoming training means inappropriate parameters (such as epochs, test/training ratio, etc.) feeding to the model as a result it becomes exposed to infer at a corrupt proposition. ML/DL models are exposed to cyber-attacks such as adversarial attacks, Trojan attacks, backdoor attacks, etc., breaching the secure integrity of the underlying system [134].

6.4 Availability of Quality Data

One of the other shortcomings in a healthcare ecosystem is the availability of diverse and good-quality data. Daily, an extensive amount of heterogeneous information related to patients is being generated across medical institutions, and an inadequate amount of useful data is being retrieved for researchers and the scientific community to work on. To produce high-quality practical data requires resources and service with good maintenance and management. The ample presence of quality data would enable professionals to develop systems for the grounds of illness prediction and treatment. Data collected during practice can have issues such as bias, a redundancy that will reflect as adverse outcomes in the algorithms. Intelligent systems cannot differentiate racial bias and fair subjectivity as humans persuade the act they learn, for example, a person with no health provision is repudiated for facilitating medical services wherefore research has brought forward that an AI system could predict bias in racial terms [135]. The trained data also contributes to its modeling challenges [136,137,138].

6.5 Casualty is Challenging

Casualty can be challenging from a medical perspective. Understanding the importance of reasoning, i.e., "What if?" while taking decisions in crucial healthcare problems is imminent [139]. Consider a circumstance where we need to analyse that if the doctor prescribed treatment 1 rather than treatment 2, how will the outcome be influenced? Queries of this kind cannot be answered from a medical data analysed perspective but through causal reasoning. In healthcare applications learning from observational data and inferencing is the socio norm but forming casual rationalizing from it is challenging which requires building casual models. ML/DL models lack fundamental reasoning under their hood and produce output based on correlation and patterns without considering the casual loop in between. In practical application, the limitation of casual analysis may raise concerns about the prophecy of AI systems. The acknowledgment of the casual effect of certain variables on target yields is paramount for fair predictive behaviour.

6.6 Updating Hospital Infrastructure is Inflexible

Healthcare organizations favor independent operations and mostly avoid sharing information. For a frictionless erudition exchange, it requires the fixing and updating of antiquated software which can be time-consuming and most are not cost-effective. Finlayson et al. [127] reported that even in the late 20 s most of the infirmaries were operating on the ninth version of the International Classification of Disease (ICD) system even though an updated version of ICD-10 had been released in the early '90 s. The difficulties in upgrading hospital infrastructure and internal management systems can raise concerns with the applicability of recent DL/ML practices.

7 Future Research Directions

In this section, various issues that require active research attention related to the security, privacy, and robustness of ML in the Healthcare ecosystem are discussed.

7.1 Machine Learning on the Edge

The revolutionary change in the purposes of ML in Healthcare applications has seen exponential growth in recent years. Research in ML has revolutionized traditional methods and opted for smart and energy-efficient utilization of wearable devices, IoT sensors, etc. With the development of smart cities and transportable medical devices such as portable ventilators, oxygen concentrators, MRI machines, etc., there is a constant demand for refined ML models trained on Edge devices. This imposes a few limitations including a lack of available hardware support and high computational processing capabilities. ML in the Edge devices is nurturing at its nascent stage and requires attention from the researching fraternity. The growth in this domain will lead to faster care in chancy situations and continuous monitoring of patient's health from a remote location, thereby improving healthcare facilities for a better lifestyle and timely medical assistance.

7.2 Handling Dataset Annotation

The output of AI systems is highly subservient on the labeled datasets for training and inference. This requires the medical experts and physiologists to annotate the medical data (such as images, clinical reports, signals, etc.) manually, spending a lot of their valuable time doing this tedious work. The variety of practical medical data glossed with accurate labels will appraise the execution of ML/DL models and exhibit hindrance that might have not been noticed. Thus, manual labeling of data into respective classes is inquisitive, tedious, and energy draining. Automatic approaches like active learning should be adopted and developed to inscribe this impediment.

7.3 Distributed Data Management and ML

In Healthcare systems, the generation of data is discrete, i.e., data is processed from various departments within a hospital extending to various other hospitals geographically. This imposes pressure on efficient data sharing and management for clinical analysis particularly using ML models. ML/DL models are developed based on a general consideration that all the analytical information is easily accessible and centrally available. These shortcomings offered by improper management of information exchange need the attention of developers and researchers who collaboratively could tackle the administration of distributed data and ML.

7.4 Fair and Accountable ML

Qayyum et al. [140] in analyzing robustness and security of ML/DL techniques reasoned that the results of the models are biased and lack accountability. Ensuring fairness and precision of predictions is of cardinal importance for life-critical application in healthcare systems. Trading the accuracy and accountability of these models could result in cynical outcomes and impose risk to patients' health. Fair predictions by the ML/DL models are influenced by a variety of cases with little available data. Taking into account the importance of fair judgment and interpretability, tuning of models accordingly will make it robust and desist from misjudgements made in the past clinical records. Further study to develop dynamic methods to ensure safety and lessen imperfections is needed in this area.

7.5 Model-Driven ML

The practice of ML, AI for predictive analysis in healthcare applications comes with privileges as well as liabilities. Latif et al. [141] discussed the associated caveats in utilizing these tools, failing to denote its lapses might turn out critical as in clinical terms. Usually, the perks of these models convince one that data once available in abundance can handle hypothesis generation without any medical expert validation and interpretation, which attracts unavoidable problems. To avoid these quandaries, it is important to achieve a combined data-driven method including hypothesis and model-based approaches to bring controlled precision in these studies. Areas for building robust, secure, and accountable ML deliverables that are technically precise require further research.

8 Conclusion

ML is activated by statistically afformed algorithms, distributed over different categories such as Regression, Classification, Clustering, etc. All of these algorithms assist in building intelligent solutions for automating clinical tasks and suspecting disease apprehensions. The traditional practice of services provided by healthcare systems has seen a vast change with the advent of ML and DL-based approaches. However, to ensure secure, bias-free, and hale utilization of these models, provocations should be addressed. This report provides a brief introduction to several ML algorithms, discusses their extent of reinforcement and controls, further marking reliable standards to bypass shortcomings in model building. This paper also provides a synopsis of the challenges arising in the ML deployment pipeline for healthcare infrastructure by classifying different origins of jeopardies in it. Conclusively this work discusses possible solutions to provide users as well as clinical experts in a healthcare ecosystem with secure, robust, and privacy-protected ML explication for privacy endeavouring applications. The paper is summarized by including the potential pursuit of ML techniques in the healthcare sector and the privacy consideration linked with it.