1 Introduction

Researchers can utilize medical data to diagnose a variety of ailments because it is readily available in healthcare and disease classification [1]. Thyroid disorders and diseases of the thyroid glands have emerged as one of the most critical conditions to identify, treat, and anticipate for a speedy recovery. The major goal of this project is to develop an accurate system with risk factor identification for the selection of patient attributes. Several data mining algorithms [2,3,4,5,6,7] were employed by numerous academics to create an effective prediction system. The Various specialists employed and incorporated a helpful tool in this study to manage clinical information from the digitally stored data to the needs of the researcher. For the diagnosis of Thyroid Bending Protein using an Artificial Neural Network (ANN), a straightforward data preparation technique has been introduced. The key improvements to the data that were handled by the ANN in this article for a more accurate design were filling in the missing values and coding the various properties and classes. A comparative analysis based on minimizing feature selection between two class prediction systems (hyperthyroidism and hypothyroidism) was conducted by the researcher. In this research, the author minimized characteristics using a genetic approach and classified data using a support vector machine (SVM).

Other researchers presented an algorithm for classifying thermograph images captured by thermal cameras. It is also advised to employ a different model rather than utilizing X-rays or ultrasounds as their hazards prevent them from being used frequently. To obtain better results, the Bayesian Classifier (BC) was employed in this work to predict the thyroid class by comparing its results with those of a doctor. The author of this thesis created a classifier using the entire model data set and used it as a supervised learning method for the entire model. A highly successful effort was undertaken in Denmark, involving a large volunteer population. The impact of this condition on a person’s life and even their capacity to work was examined and demonstrated, along with the significance of early detection and prediction. Some studies employed mathematical approaches such as multigene genetic programming to improve the prediction of thyroid disorders. The suggested model’s quality was assessed using Scatter/Histogram diagrams and Root Mean Square Error. The author recommended the Back Propagation Neural Network (BPNN) for this kind of illness prediction. However, some researchers found that using a Linear Discriminate Analysis (LDA) to handle the enormous volume and complexity of the healthcare data was more fascinating. The design of this paper was for a single class detection (hypothyroid), with a cross-validation that was modified. Researchers find treating these kinds of disorders more fascinating to provide better results. The most well-known technique used in this study was thyroid prediction using a variety of data mining methods [8]. It demonstrated the benefits, the application of certain methods, and the rationale for using these algorithms. While employing random forests makes dealing with thyroid disorders and diseases easier, their identification can be challenging, particularly when dealing with large amounts of data. He asserts that there are significant differences between the types of electronic health record (HER) data and that the LDA approach employed in this work demonstrated greater accuracy than earlier LDA techniques. A second survey was conducted to predict thyroid disease using two classifications, and it was analyzed using four different techniques: Naïve Bayes, Decision Tree, Multilayer Perceptron, and Radial Bias Network.

The findings of this study provide insightful information about the use of machine learning algorithms in the classification of thyroid diseases, potentially enhancing the precision and effectiveness of diagnosis. Such implementations could help medical personnel make wise choices, lessen the number of incorrect diagnoses, and ultimately improve patient care and results. The results of this study add to the body of knowledge in the area of medical informatics and could have wider ramifications for the creation of automated diagnostic tools for several disorders.

. If detected early, thyroid cancer is one of the most treatable types of the disease. This study aims to examine and forecast thyroid cancer based on the year, gender, and age group using machine learning algorithms. The suggested study is a descriptive cross-sectional investigation that makes use of data on thyroid cancer incidence from the World Bank for Cancer as support. Understanding the global impacts of thyroid cancer by gender and age is the goal of this study. This article will eventually employ a machine learning method to forecast accuracy on a limited set of particular thyroid cancer features. These days, the prevalence of thyroid disease is rapidly increasing, and it affects more women than men. The generation of aberrant thyroid hormones is the most frequent cause of thyroid diseases. An excessive amount of hormones might result in hyperthyroidism. Hypothyroidism results from inadequate hormone production. Thyroid dysfunction is characterized by symptoms such as fatigue, dry skin, cold intolerance, face swelling, menstrual cycles, and hair loss. Because the majority of treatments involve taking long-term medication, prevention is significantly more important than treatment for such conditions. To take preventative measures against the fatal condition of thyroid illness, it is crucial to examine the thyroid dataset for early disease identification. Because they enable clinicians to make better decisions, machine learning techniques are crucial in the medical field. Figure 1 below shows the thyroid Gland diagram.

Fig. 1
figure 1

Thyroid gland

There are two further conditions to think about: thyroiditis and Hashimoto’s thyroiditis.

  1. 1.

    The synthesis and decrease of thyroid hormones

Iodine deficiency inhibits the production of T3 and T4 hormones, which causes hypertrophy of the thyroid tissue and the emergence of goiter, a different illness [8].

The thyroid gland is under the direction of the brain’s hypothalamus and pituitary gland. The pituitary gland releases thyroid-stimulating hormone (TSH) in response to stimulation from the brain, which releases thyrotropin-releasing hormone (TRH). The hypothalamus and pituitary glands can detect low thyroid hormone levels and secrete more thyroid-stimulating hormone (TSH) when they are functioning normally. They can also detect high thyroid hormone levels and secrete less thyroid-stimulating hormone (TSH) when they are functioning properly.

  1. 2.

    The signs and causes of thyroid disorders

After discussing the concepts of hyperthyroidism and hypothyroidism, let’s examine the causes of the imbalance in the thyroid gland’s hormone production, which are indicated in Table 1. In addition, there are a number of symptoms associated with thyroid illness. For example, a person with hyperthyroidism may experience muscle weakness, sensitivity to heat, a large thyroid gland or goitre, anxiety, difficulty slee**, vision issues, irregular menstruation, and so on. In contrast, a person with hypothyroidism may experience fatigue, weight gain, forgetfulness, heavy menstrual flow, dry scalp, hoarse voice, coarse hair, and sensitivity to cold temperatures.

Table 1 Causes of hyperthyroidism and hypothyroidism

1.1 Role of machine learning to predict thyroid disease

Thyroid illness diagnosis and prognosis are greatly aided by machine learning (ML). Thyroid disease refers to disorders of the thyroid gland that impact the production and regulation of hormones. These disorders include hypothyroidism, hyperthyroidism, and thyroid cancer. ML approaches can be used for a number of prediction and management aspects related to thyroid disease:

1.1.1 Early identification and assessment

Evaluation of Risk: In order to determine a patient’s risk of acquiring thyroid diseases, machine learning algorithms can evaluate patient data, including medical history, test results, and demographic details.

Diagnostic models To create diagnostic models, machine learning models can be trained on datasets that include details about patients with thyroid conditions. These models can examine clinical factors, imaging data, and symptoms to assist in a precise and timely diagnosis.

1.1.2 Tailored care programmes

Treatment response prediction Medical professionals can better customize treatment regimens for the best results by using machine learning (ML) to forecast a patient’s potential response to various treatment alternatives.

Medicine adjustment By analyzing patient responses to medicine over time, algorithms can help medical personnel modify treatment plans or dosages to better meet the needs of specific patients.

1.1.3 Image analysis

Radiological imaging Radiologists can get help from machine learning (ML) approaches like computer-aided diagnosis (CAD) when analyzing CT, MRI, or thyroid ultrasound images. This can help identify any anomalies, such as tumors or nodules.

Thyroid nodule biopsy ML algorithms can be used to evaluate the findings of a fine-needle aspiration (FNA) biopsy and help classify thyroid nodules as benign or cancerous.

1.1.4 Data integration and fusion

Combining information from various sources Machine learning models can integrate data from several sources, such as genetic information, electronic health records, and lifestyle characteristics. This allows for a more comprehensive picture and more precise predictions.

Feature selection Machine learning algorithms are capable of automatically spotting pertinent patterns and features in a variety of datasets, which enhances the precision and effectiveness of thyroid illness prediction models.

1.1.5 Observation and investigation

Disease progression monitoring ML models are able to continuously evaluate patient data in order to track the advancement of a disease and offer early alerts regarding possible problems.

Long-term outcome prediction ML can assist in making decisions about continued care and management by predicting long-term results by evaluating past patient data.

2 Literature review

The goal of the literature review is to identify research gaps by concentrating on the kind of attribute, the type of classification algorithm, the size of the dataset, and the repository used. The data is shown in tabular form below to give an overview of the main work:

The author of paper [16] presented an interactive method based on machine learning for predicting thyroid illness. Before using the ML approach, data cleaning is conducted using the UCI machine learning repository dataset.

It uses the mean absolute error and accuracy as assessment parameters for putting the ANN, KNN, SVM, and DT algorithms into practice. SVM performs better than all other algorithms, according to the results.

It has fifteen properties, including SVM, KNN, and Naïve Bayes algorithms. It does not, however, provide the assessment parameters or the outcome analysis [8]. Furthermore, using an excessive number of parameters would just make things more complicated and expensive while decreasing the accuracy of the predictions.

In the work [17], the author discussed the chosen thyroid disorders in four subtypes—four hyperthyroid and four hypothyroid—while taking into account five parameters: TSH, T3, TT4, T4U, and FTI with their approved ranges. There are 29 qualities in it. The suggested algorithms verify the outcomes taking into account selectivity, sensitivity, accuracy, and precision. The paper also takes the computation of time and speed into account. The outcome demonstrates that the suggested NSGA-II algorithm works better and has superior computational capabilities when compared to the other methods (SVM and CNN).

In the publication [18], the thyroid illness is classified using feature selection and meta-classifiers. It takes into account data cleansing, preprocessing prior to applying the advertisement, and bagging methods. It uses the UCI ML dataset and takes into account 30 attributes. The outcomes are not contrasted with other cutting-edge works.

The use of machine learning techniques for early diagnosis is the topic of paper [19]. It classifies data using predictive modeling, which is followed by the NB algorithm, decision tree ID3, and decision tree. It makes a prediction based on eight attributes. The assessment parameters and result analysis aren’t covered in full in the study, though.

The prediction of a particular thyroid variation known as carcinoma is addressed in the paper [20]. If diagnosed in its early stages, thyroid cancer is curable, and as a result, The author suggests using a machine learning algorithm to predict thyroid cancer based on factors like age, gender, and years. The cross-sectional study based on the aforementioned attributes is presented globally.

Another paper is proposed in [21] that deals with the use of deep neural networks for the identification of thyroid cancer. The goal of the paper is to increase thyroid cancer diagnosis accuracy. The study emphasizes the difficulty of applying deep learning to minimize the number of factors influencing thyroid cancer detection.

The work employs a 4-layer neural network with three hidden layers and a single output. While the sigmoid function is applied to the output layer, the ReLU activation function is utilized for hidden layers. The model’s accuracy, according to the author, is 98%. In [22], a comparative analysis of methods based on machine learning for the diagnosis of thyroid illness is covered. The UCI repository appears to be the most frequently utilized dataset by researchers, according to the report.

The use of sodium levothyroxine (LT4), a thyroid hormone, in treating thyroid disease is examined in this paper [23] along with a machine-learning approach to the problem. It looks at the patient’s hypothyroidism and attempts to forecast how LT4 will be treated. It creates the “AOU Federico II” data collection on its own. It makes use of ten distinct classifiers and, when employing the extra-tree classifier, claims a maximum accuracy of 84%. It makes use of the thyroid parameter and 27 characteristics. The paper uses AdaBoost, Gradient, XGBC, and CatBoost as its primary classifiers. Metrics including accuracy, precision, recall, and F-score are taken into account.

Paper [24] uses the UCI ML dataset to test two machine learning techniques: random forest and SVM. The study concludes that SVM is a superior machine learning technique, utilizing f-score, accuracy, precision, and recall as evaluation measures. The comparable methods can also be applied to other medical datasets containing distinct disorders.

The Paper [25] provides comprehensive information on thyroid illness. It tackles the problems of having a tiny dataset, unvalidated results, and being limited to binary classification. The study examines feature engineering techniques with additional tree classifier-based features, ML, DL, and FFS, BFS, and BiDFE.

It makes use of a random forest classifier, yielding results with 99% accuracy in predicting thyroid illness. Prediction of primary hypothyroidism, binding protein hypothyroidism, compensated hypothyroidism, and noncurrent nonthyroidal disease are all addressed by the study project. With 9172 samples, it takes into account 31 attributes. The least amount of calculation is involved with RF. While conventional classifiers yield limited results, the extra tree classifier achieves 99% accuracy.

The goal of the paper is to expand the study’s class size in subsequent work.

The writers of [26] use the open-source KEEL repository to address thyroid diseases. Using ML-based attributes estimators, it takes into account three attributes: select from model (SFM), recursive feature elimination (RFE), and choose k-best (SKB). It states that it has 99.27% accuracy, 97% precision, and 98% recall.

The authors of [27] examine ML and DL methods for thyroid illness prediction. It finds that the UCI machine learning repository is being used in important studies. It tackles the problem of incorrect diagnosis, which can cause suffering for the patient.

An ensembling technique has been used in [28] which addresses the class imbalance issue by using the down-sampling approach to balance the UCI dataset. It majorly addresses hypothyroidism including ensembling voting classifier. This paper has the disadvantage of down-sampling itself to 230 instances which the author aims to address in the future.

Paper [29] studies various radiomics approaches using ML techniques to identify thyroid disease. It considers data sources from various modalities. The study’s key finding indicates a multicentric approach to handle clinically accurate mechanisms.

Researchers in [30] presented predictive models concerning the prediction of thyroid cancer. It developed and optimized the SVM, XGBoost RF, and NN models through fivefold validations and Bayesian optimization. However, the paper also has limitations concerning the selection bias, and domain-specific data that may not be relevant in generic terms.

Predictive analysis using Industry 4.0 has been done in [31] to study cardiovascular disease. It introduces concept of hyperparameter tuning and ensemble techniques and implement the ML techniques including SVM, logistic regression, random forest KNN, and decision tree. Hyperparameter optimization handles the effects of overfitting and underfitting thereby providing an accurate predictive model.

IoT based machine learning predictive model has been introduced in [32]. It also utilizes the standard ML techniques viz random forest, Decision Tree, KNN and SVM for the cardiovascular disease prediction. It indicates that Random Forest performs the best considering the hyper-parameter tuning and Naïve Bayes performs the best without tuning.

3 Description of dataset

3.1 Selection of dataset

The UCI Repository is the source of the thyroid dataset (Sick dataset). The primary information contained in the database is the patient’s name, personal contact information, and any prior medical history. These details will be kept in the database and used as the patient record for any additional clinical assessment. The prioritization of the dataset attributes is taken into account. The characteristics that are more likely to cause thyroid disease are taken into account, while the others are disregarded. Boolean (True/False) or continuous values make up the attribute values. Age, gender, hyperthyroidism, hypothyroidism, pregnancy, T3, T4, and TSH levels are the primary factors taken into account. The research work considers the UCI dataset with 27 attributes with a sample size of 2800, 1830 female, 860 male, and 110 samples with missing gender information. The attributes and their values is shown in Table 2. The attributes contribute differently based on the gender, age, and hormone levels due to the physiological and emotional demands of the patient and therefore contribute differently considering these parameters.

Table 2 Attributes types and values

4 Architecture of thyroid prediction

The architecture includes the following as mentioned in the Fig. 2.

Fig. 2
figure 2

Flowchart of thyroid prediction

Data source The UCI Machine Learning Repository is the source of the Thyroid Dataset, which is used by the system during startup. This dataset most likely includes thyroid function-related medical data. In data preparation, data preprocessing has been done. The preprocessing includes removing the unwanted outliers, missing values, null values, and noisy data from the raw UI dataset to make it apt for the next step.

Selection and extraction of features:

Feature extraction From the raw data, the system first extracts pertinent features. These characteristics are probably numerical depictions of important thyroid-related characteristics.

Feature selection The system chooses the most crucial features after extraction to lower dimensionality and maybe enhance model performance. The feature selection includes identifying 6 relevant attributes out of the 28 attributes. It therefore reduces the complexity and highlights the significant contributors for the prediction of thyroid disease.

Data division There are two sections to the dataset:

The classification model is trained using the training data.

Testing data These are used to assess how well the model performs.

Classification algorithm The training data is used to train a classification algorithm. Finding patterns that can differentiate between various thyroid conditions—such as normal, hypothyroid, and hyperthyroid—is the aim of this research.

Prediction algorithm Using the testing data, the trained model is then applied to generate predictions.

Ensemble methods The architecture alludes to the possible application of ensemble methods, which merge several models to improve forecast accuracy.

Knowledge base It is integrated and probably contains material unique to the thyroid condition domain.

Evaluation of Performance: The model’s effectiveness is assessed by looking at the predictions it makes using the test data. This indicates areas for improvement and helps assess the efficacy of the model.

5 Methodology

The information mining task of supervised learning involves deriving a function from designated training data. An arrangement of prepared drawings made up of the teaching materials. Every case is controlled adaptation consists of two components: the intended output value, often known as the supervisory flag, and an information input object, usually referred to as a vector. An indirect function that can be used for map** new illustrations is created by an investigation of the training data by a supervised learning computation. A perfect enhancement would consider the computation needed to determine the class names for scenarios that have not yet been seen. This necessitates using computation to compile training material to obscure events in a “sensible” way.

In Fig. 3, the flowchart of the proposed work is introduced. Firstly, the UCI repository provides information on the valid dataset required for the research work. Any errors or inconsistencies are eliminated from the data using the data cleaning tools in the next step. The pertinent elements from the data that could be helpful in thyroid prediction is extracted using the mechanism also known as feature extraction.

Fig. 3
figure 3

Flowchart of the proposed work

Three classifiers are then implemented namely Support Vector Machine, Random Forest, and XGBoost (proposed) that provides distinct machine learning models providing thyroid prediction with varied results. Then the ensembling technique is introduced which is a single, more precise prediction is produced by combining the predictions made by the three models. Finally, the model’s performance and assessment has been done: The model’s performance is assessed to determine its predictive accuracy the prediction of thyroid disease. The results are saved for subsequent use if the model performs satisfactorily. To increase the model’s accuracy, the procedure is repeated from step 3 if the model’s performance is not up to par.

5.1 Features to identify thyroid disorder

Table 3 elaborates on the significant attributes used for the identification of thyroid disease.

Table 3 Significant attributes

Figure 4 confirms the importance of various features in the detection of thyroid disease. It is evident from the figure that attribute T3 contributes around 50% in the identification of thyroid disease. Other attributes viz T4U, TT4, FTI, TSH contribute around 10% each with a collective contribution of approx. 35% to 40%. Hence 5 attributes contribute to the detection of thyroid with a probability of 85–90%.It illustrates the relative significance of many factors in forecasting an entity known as “T3”. T4U, TT4, FTI, TSH, age, source, sex, illness, thyroxine, and hypothyroid are among the contributing factors. The thyroid gland, which is in charge of generating hormones that control growth, development, and metabolism, is connected to all of the characteristics.T4U is the type of thyroid hormone that is most crucial for predicting T3. Important characteristics also include age, TT4, FTI, and TSH. The least significant characteristics are hypothyroid, thyroxine, ill, source, and sex.

Fig. 4
figure 4

Feature selection of thyroid disease

5.2 Classification methods

Random forest A popular and effective machine learning technique, random forest is a member of the ensemble learning family. To reduce overfitting and increase overall accuracy, it integrates the predictions of several decision trees. One effective and adaptable tool for handling a variety of machine-learning applications is Random Forest. Both practitioners and scholars favor it because of its great accuracy, resilience, and simplicity of use. Compared to more straightforward models like linear regression, random forests may require more computing power. The forest algorithm is less transparent than some other algorithms because it might be difficult to interpret individual decision trees inside the forest.

Support vector machine (SVM) is strong and adaptable; it is mainly applied to classification jobs. It finds the optimal hyperplane, or decision border, in high-dimensional space to divide several classes. It maximizes the margin—also known as the support vectors—between each class’s closest data points and the hyperplane. It is also able to use kernel functions to manage intricate, non-linear connections between features. It is an effective technique with great accuracy and interpretability for handling challenging classification issues. However careful consideration is needed because of its high computational cost and sensitivity to hyperparameter adjustment.

XGBoost It is a strong tool for obtaining great accuracy in various machine-learning tasks. It is a well-liked option for numerous applications due to its effectiveness and feature set. However, due to its intricacy and black-box design, it needs to be carefully considered. It is akin to Random Forest which substitutes boosted decision trees for conventional ones. Every tree aims to improve upon the mistakes made by the ones before it, creating a stronger model over time incredibly effective and performance-optimized, frequently producing cutting-edge outcomes in a variety of activities.

5.2.1 Ensemble technique

An ensemble can perform better and generate better predictions. An ensemble model decreases the spread or dispersion of the model’s performance and prediction. Machine learning tasks such as anomaly detection, regression, and classification can all be handled with ensemble approaches. For example, stacking uses a meta-model to aggregate predictions from various models. It builds upon the advantages of every foundation model to form an even more potent ensemble. Ensembles are appropriate for a variety of applications because they can manage complicated relationships in data. However, they do have certain drawbacks, though, namely the possibility of overfitting and higher computing complexity.

The current work utilizes the ensemble technique. The current state-of-the-art models include SVM, KNN, and Naïve Baye models with specific advantages over the other. It utilizes the collective benefits of all the models in a single unit, the ensemble technique is introduced which enhances the performance in terms of the prediction of thyroid disease.

A machine learning model known as a voting classifier is trained on a large ensemble of models and forecasts an output (class) based on the models’ best likelihood of producing the desired class.

It merely compiles the results of every classifier that is fed into the voting classifier, which then predicts the output class according to the voting’s largest majority. The goal is to train a single model that predicts output based on the cumulative majority of votes from each output class, rather than building individual specialized models and determining the accuracy for each one.

It supports two types of voting:

Hard voting A class with the largest majority of votes, or the class with the best likelihood of being predicted by each classifier, is the predicted output class in a hard vote.

Soft voting In soft voting, the forecast made for a class is based on its average likelihood.

6 Result and discussion

Thyroid detection requires identification of significant attributes which has been identified using random forest technique for the feature importance. After identifying the selective features, the dataset is fed to three separate algorithms namely Random Forest, SVM and XGBoost. The models were implemented, and result comparison has been done based on parameters like the F1-score, Accuracy, Precision and Recall. It has been observed that XGBoost has outperformed the other models while comparing using ROC curve. Figure 5 represents the ROC curve which indicates the graph between the sensitivity and specificity. Three ROC curves for the Random Forest, SVM, and XGBoost models are displayed in this figure. Plotting the True Positive Rate (TPR) versus the False Positive Rate (FPR) at various threshold levels is what each curve shows. The model’s overall performance is gauged by the area under the curve (AUC), where a higher AUC denotes better performance. Curve of Random Forest (area = 0.867): The comparatively high area under the curve (AUC) suggests that the Random Forest model performs well in differentiating between real positives and false positives. It is not, however, the model that performs the best in this instance. The SVM curve (area = 0.718) shows that the SVM model is less effective at differentiating between true and false positives than the Random Forest curve since it has a lower AUC. The XGBoost curve (area = 0.925) is the best-performing model in this instance since it has the highest AUC of the three models. It can minimize false positives while accurately identifying true positives. The XGBoost model appears to be the top-performing model for this task overall, based on the ROC curves displayed in the graphic. It is crucial to remember that a model’s performance can change based on the particular task and data. A diagonal line drawn from the bottom left corner to the top right corner represents the optimum ROC curve. This would point to a flawless model that can accurately categorize every case. The model performs better the closer a curve is near the diagonal line. Figure 6 below shows the ROC curve of the ensemble method.

Fig. 5
figure 5

a and b ROC and confusion matrix of random forest, c and d ROC and confusion matrix of SVM, e and f ROC and confusion matrix of XGBoost

Fig. 6
figure 6

ROC curve of the ensemble method

6.1 Confusion matrix

The F1-score represents the harmonic mean of precision and recall. The precision provides information related to the accuracy of positive predictions and recall indicates how effectively the model identifies the actual positive. The F1-score lies in the range of 0–1. A higher F1-score indicates better model performance. To conclude, a higher F1 score with low accuracy or a lower F1 score with higher accuracy are undesired model outcomes. It is highly desired to have a higher value of F1-score and a higher accuracy to ensure a credible model performance. The F1 score ensures the elimination of class imbalance providing a better predictive model. The parameters involved in the calculation of the F1 score and other significant performance matrices are presented in Fig. 7.

Fig. 7
figure 7

Performance metric identification mechanism

The performance of various ML classifiers is examined using the performance metrics listed below.

Precision It is a metric used to assess how accurate a classifier is. It goes by the name of positive predictive value (PPV) and is given by:

$$ Precision = \frac{TP}{{TP + FP}}. $$

A precision of 96% is achieved in the proposed algorithm and after using an ensemble classifier 96.9% is achieved. This is shown graphically in Fig. 8.

Fig. 8
figure 8

Comparison chart of precision

Recall It assesses the completeness of the classifier. It is sometimes referred to as true positive rate (TPR), hit rate, or sensitivity as given by:

$$ Recall = \frac{TP}{{TP + FN}}. $$

The recall value of 85% is achieved as compared to other algorithms which is far better than the other algorithms, and after doing ensembling 87.2%is achieved as shown in Fig. 9.

Fig. 9
figure 9

Comparison chart of recall (%)

The F1-score The F1 score aims to highlight the class imbalance in the model to provide the performance of the model. It indicates the harmonic mean of precision and recall that are derived from the false positive, true positives, and false negative values as mentioned in Fig. 7. It is given by the following expression:

$$ F1{\text{-}}score = 2*\frac{Precision*Recall}{{Precision + Recall}}. $$

The proposed algorithm has achieved an F1-score of 90% which is better than the other algorithms, and after doing ensembling 91.5% is achieved as mentioned in Fig. 10.

Fig. 10
figure 10

Comparison chart of F1-score considering the ensemble, random forest, XGBoost, and SVM classifier

Accuracy The percentage of correctly predicted events that our model produced. The total number of accurate predictions divided by the total number of test instances as given under:

$$ Accuracy = \frac{Correct\;Predictions}{{Total\;Predictions}} $$
$$ Accuracy = \frac{TP + TN}{{TP + TN + FP + FN}}. $$

The purpose of the ensemble classifier is to combine multiple classification models using hard voting and soft voting techniques to potentially improve prediction accuracy. In this case, the hard voting accuracy of 99.10% and soft voting accuracy of 99.41% have been achieved as shown in Fig. 11.

Fig. 11
figure 11

Comparison chart of accuracy

In Table 4, the proposed algorithm XGBoost is compared with the state-of-the-art algorithms namely SVM and Random Forest. It is evident from the table that XGBoost outperforms the other algorithms with the highest F1 score of 90, accuracy of 98.5%, precision of 96%, and recall value of 85%. In addition, an ensemble classifier is introduced which is the optimization of all three algorithms. The ensemble classifier provides the collective better result indicating the dominance in the detection of thyroid disease.

Table 4 Comparison of algorithms

7 Conclusion

The presented work focuses on the prediction of thyroid disease with improved decision-making capability in a real-time scenario. To achieve this, we have considered three algorithms viz SVM, RF and XGBoost. The performance of the models was evaluated and analyzed separately. After analyzing the models on various parameters, an ensembling technique is introduced. The ensembling techniques offer optimized performance extracting the best features from the three models in the form of a voting. Both hard and soft voting is applied, and the results are analyzed. The results provide an improved accuracy of 98.5 using the proposed XGBoost classifier and an accuracy of 99.1 in the case of the ensemble classifier.

Our long-term goal is to support the development of unique machine-learning approaches that may be used to the detection of thyroid disorders. In the latter years of appropriate and competent thyroid disease diagnosis, several accessible analyses have been defined and are being utilized. The study reveals that every article with varying accuracy levels uses a separate set of technologies. The majority of research studies demonstrate the superiority of neural networks over alternative methods. However, it should be noted that decision trees and support vector machines have also shown good performance. While there is no denying that researchers throughout the world have made significant progress in the diagnosis of thyroid disorders, it is advised that patients employ fewer factors in their thyroid illness diagnosis. A patient with more qualities will need to take part in more clinical tests, which will save money and take more time. Therefore, it is necessary to create algorithms and prediction models for thyroid illness that require a minimal amount of personal characteristics in order to identify the condition and save patients’ time and money. The major challenge encountered is data preparation as it is the most crucial aspect of decision-making. During the evaluation, ensuring class bias does not affect the overall performance of the model, the confusion matrix is religiously implemented focusing on the F1-score.