Implementation of ensemble machine learning algorithms on exome datasets for predicting early diagnosis of cancers

Pasha Syed, Abdu Rehaman; Anbalagan, Rahul; Setlur, Anagha S.; Karunakaran, Chandrashekar; Shetty, Jyoti; Kumar, Jitendra; Niranjan, Vidya

doi:10.1186/s12859-022-05050-w

Implementation of ensemble machine learning algorithms on exome datasets for predicting early diagnosis of cancers

Research
Open access
Published: 18 November 2022

Volume 23, article number 496, (2022)
Cite this article

Download PDF

You have full access to this open access article

BMC Bioinformatics Aims and scope Submit manuscript

Implementation of ensemble machine learning algorithms on exome datasets for predicting early diagnosis of cancers

Download PDF

Abdu Rehaman Pasha Syed¹,
Rahul Anbalagan²,
Anagha S. Setlur³,
Chandrashekar Karunakaran³,
Jyoti Shetty²,
Jitendra Kumar⁴ &
…
Vidya Niranjan³

3321 Accesses
4 Altmetric
Explore all metrics

Abstract

Classification of different cancer types is an essential step in designing a decision support model for early cancer predictions. Using various machine learning (ML) techniques with ensemble learning is one such method used for classifications. In the present study, various ML algorithms were explored on twenty exome datasets, belonging to 5 cancer types. Initially, a data clean-up was carried out on 4181 variants of cancer with 88 features, and a derivative dataset was obtained using natural language processing and probabilistic distribution. An exploratory dataset analysis using principal component analysis was then performed in 1 and 2D axes to reduce the high-dimensionality of the data. To significantly reduce the imbalance in the derivative dataset, oversampling was carried out using SMOTE. Further, classification algorithms such as K-nearest neighbour and support vector machine were used initially on the oversampled dataset. A 4-layer artificial neural network model with 1D batch normalization was also designed to improve the model accuracy. Ensemble ML techniques such as bagging along with using KNN, SVM and MLPs as base classifiers to improve the weighted average performance metrics of the model. However, due to small sample size, model improvement was challenging. Therefore, a novel method to augment the sample size using generative adversarial network (GAN) and triplet based variational auto encoder (TVAE) was employed that reconstructed the features and labels generating the data. The results showed that from initial scrutiny, KNN showed a weighted average of 0.74 and SVM 0.76. Oversampling ensured that the accuracy of the derivative dataset improved significantly and the ensemble classifier augmented the accuracy to 82.91%, when the data was divided into 70:15:15 ratio (training, test and holdout datasets). The overall evaluation metric value when GAN and TVAE increased the sample size was found to be 0.92 with an overall comparison model of 0.66. Therefore, the present study designed an effective model for classifying cancers which when implemented to real world samples, will play a major role in early cancer diagnosis.

View this article's peer review reports

A Semi-supervised Learning Approach for Pan-Cancer Somatic Genomic Variant Classification

A snapshot neural ensemble method for cancer-type prediction based on copy number variations

Article Open access 30 November 2019

CPEM: Accurate cancer type classification based on somatic alterations using an ensemble of a random forest and a deep neural network

Article Open access 15 November 2019

Introduction

Background of study

The coding part of the genome is referred to as an exome. Any genetic abnormalities in the exomes are known to trigger several types of cancers. With the present prevailing cancer scenario in the world on a constant uprise, extensive research is being carried out to arrive at possible solutions for early diagnosis [1,2,3]. With possible early diagnosis of the disease and application of suitable treatment strategies still hazy in research, there is an urgent need for the design and development of alternative ways that provide faster and precise predictions via comprehending the huge amount of existing cancer data. One important approach is to develop a decision support system (DSS), which predicts patient specific cancer probabilities, and overcomes challenges that arise with wrong treatment decisions and prognosis, massive data interpretation and comprehending patient-specific causes [4]. As an emerging and ever-evolving technology, DSS systems are highly adept at improving the decision-making process, thereby providing support to clinicians and diagnosticians [5]. Currently, there are several approaches to classify the cancer types, based on the exome datasets that are essential for designing a decision support system (DSS) for early diagnosis of cancers [6,7,8,9]. With advent of technology, using artificial intelligence and machine learning on high-throughput data to design an improved DSS model is the premise of the present study.

Related works

Classification algorithms such as support vector machines (SVM), K-nearest neighbors (KNN), Naïve Bayes, decision trees and random forest are primarily being used for cancer classification using machine learning [10, 11]. Studies have previously classified cervical cancer datasets [Contribution of present study

The major contribution of our study is towards the development of a highly accurate and improved decision support model, which when used in healthcare, will provide immense benefits to the diagnosis and control of cancers. Additionally, our model encompasses classifications and predictions for five cancer types, making it a novel study with huge potential for early diagnosis of five different cancer types. The reduction of dimensions in the datasets were covered in our study to derive an appropriate derivative dataset which is of utmost importance since they directly contribute to providing better and more accurate predictions on the features of importance. The present study also provides massive insights into the workings of our proposed model, which resulted in a much better overall accuracy when compared to similar such previous work, satisfying the rudimentary aim of our research work, to offer support to the management of healthcare.

Materials and methods

A block diagram summarizing the proposed work, from cleaning and obtaining the derivative exome dataset, using classification analysis by including three classifiers, namely, K-NN, SVM, and a multilayer perceptron network. This was then followed by using majority voting-based ensemble classifier, to finally obtain the proposed results (Fig. 1).

Dataset analysis

A preliminary analysis of the exome datasets was carried out. These datasets were obtained after a careful analysis of twenty cancer exome datasets, belonging to five cancer types, obtained from our previous work using a standardized workflow (Table 1) [25]. These were human diffuse type gastric cancer, pancreatic adenocarcinoma, high-grade serous ovarian cancer, intrahepatic cholangiocarcinoma, and non-BRCA1/BRCA2 familial breast cancer.

Table 1 Twenty exome datasets for five cancer types that were analysed in our previous work for obtaining variant information that led to formation of derivative datasets

Full size table

The five cancer types were chosen for our initial analysis in our previous studies because they were the major ones affecting the Indian population, for which we aimed to build a model. Although other cancer types such as hepatocellular carcinoma [29, 30], and bone cancer [31, 32] are also significant, the present study focused on model building for the five types as continuation of our previous work. An extension of this work however, will include more cancer types to stabilise the model further. Additionally, variant identification was also performed in our previous work specifically for these five cancer types which were thought to affect the Indian population more.

Moreover, previous studies have shown that no other similar models were available that were built on these five different cancer types, making our method unique. Please refer Padmavathi et al. [25], for more information on the pipeline used and justifications provided for arriving at different variants. These datasets employed for the study are publicly available and can be downloaded from NCBI SRA (https://www.ncbi.nlm.nih.gov/sra) with their accession numbers.

Hyperlinks for the sample files that were employed in our previous work is provided.

Data clean-up and obtaining a derived dataset

The given exome dataset consisted of 4181 sample variants, with 88 features. On initial analysis, most of the features were filled with NaN (missing value marker). The initial analysis was done using “Pandas” library module available in python modules. These features were dropped, as they couldn’t be used. The features left were 55 in number. These features were still filled with a few of NaN values. Categorical features with NaN values were dropped as well, since these features were not distinct and filling them with the use of Natural Language Processing (NLP) could not have a significant improvement on the precision prediction of the five types of cancer [33]. These included high-grade serous ovarian cancer, pancreatic adenocarcinoma, human diffuse-type gastric cancer, intrahepatic cholangiocarcinoma and non BRCA1/BRCA2 familial breast cancer. Considering only the numerical features for the prediction model, 25 numerical features were obtained. The few NaN values present in the dataset were filled with probabilistic distribution using probabilistic matrix factorization [34]. These 25 features after handling the missing data over the 4181 sample variants constituted the derived dataset.

Exploratory data analysis

Principle Component Analysis (PCA) models were trained over the derived dataset. The number of dimensions in which the dataset was analyzed were one dimension, and two-dimensional axes. The results of the PCA models, reduced the high variance in the dataset due to distributing the weight of the features along two dimensions. Through this distribution the high dimensionality of the dataset was reduced, as the features that would have caused overfitting were removed [35 https://colab.research.google.com/drive/1AypJYvigGnpCrhsmLkO6c3b-jSZTqqKN]. The 14 features that had the maximum weight were selected for training the subsequent classification models and were also trained in the ensemble models trained later. The 14 selected features are, ‘shiftscore’ (score for sorting the variants from tolerant to intolerant), ‘TLOD’ (log odds that the variant is present in the tumor sample relative to the expected noise), ‘Sample.AF’ (allelic frequency of the sample), ‘MBQ’ (median base quality of each allele), ‘MFRL’ (median fragment length of each allele), ‘MMQ’ (median map** quality of each allele), ‘Sample.AD’ (allelic depth of the sample), ‘Sample.F1R2’ (forward and reverse read counts for each allele), ‘Sample.F2R1’ (forward and reverse read counts for each allele), ‘DP’ (read depth), ‘GERMQ’ (phred-scaled posterior probability that the alternate alleles are not germline variants), ‘MPOS’ (median distance from the end of the read for each alternate allele), ‘POPAF’ (population allele frequency of the alternate alleles), and ‘Sample.DP’ (approximate read depth of the sample), (https://support.sentieon.com/appnotes/out_fields/) [36]. These parameters provided information on the variants identified from our previous analysis of cancer exomes, with alleles being the alternative forms of the genes that result from mutations and are present on the chromosomes [37]. Since these parameters were found to be most important that could point towards specific cancer types, these were selected for building our model.

This allowed the authors to reduce the bias-variance trade off that would have been caused due to the use of irrelevant features according to the two-dimensional PCA model [38].

Oversampling using SMOTE

Synthetic Minority Oversampling Technique, also referred to as SMOTE, is an oversampling technique to reduce imbalanced datasets. In the exome dataset, it was found that the dataset was heavily imbalanced with the majority class of cancer being Human diffuse-type cancer having the highest number of sample variant (Fig. 2). This would cause the classifiers to not be sensitive to the change in the features of the dataset [39]. In this technique the minority class types to match the number of sample variants in the majority class type were increased using the SMOTE algorithm. This ensured that the imbalance in the dataset was significantly reduced.

Cross validation

Cross validation is a technique used to assess the variance-bias trade-off, of a machine learning model, to understand if the model is overfitting or underfitting, on completely unseen data [40].

The approach followed for cross-validation in our proposed study was hold out cross validation technique. This technique follows by dividing the dataset into a training set and a test set (the test set can be further divided into test and validation set). The model is then trained on the training set, where adjustments are made to its hyper-parameters to balance the variance-bias trade-off. After training the model, the model is subjected to the test set, where all the results produced by the model are considered as a final statement to the performance metrics [40]. This approach was implemented in the present study to cross-validate and confirm the relevance of our model in real-world test scenarios.

K-nearest neighbors classification model analysis

The K-Nearest Neighbors (KNN) machine learning algorithm is an important pattern recognition-based classifier that has great importance in analyzing and predicting cancer types in exome datasets [41, 42]. The primary step in implementing the KNN classifier is to identify the correct number of clusters that the dataset can be divided into. To identify the correct number of clusters, the elbow-curve method was employed. In this method the KNN classifier using the default hyperparameters, for various values of K, i.e., the number of clusters was applied. The order in which the value of K increases is sequential. Then the error rate versus K-graph is plotted. Through this graph the value of K for which the decrement in error rate is the most significant is chosen as the optimal cluster value, K [43]. After obtaining the optimal cluster value, it was used to train the KNN classifier.

$$D(di,dj) = \sqrt {\frac{1}{N}(\sum (wik - wjk)^{2} )}$$

The above formula, describes the Euclidean distance method, where N is the dimension of the feature vectors, w_k is the dimension of the k-th feature vector, and the pair d_i and d_j, denote the feature vector of a specific text in the training set and the feature vector of another text under consideration in the training set [44].

The default hyperparameters relied on using the Euclidean distance to differentiate the data points into different clusters. This did not result in a better classification. To identify the correct hyperparameters, “Grid Search” module was used [45]. From the grid search module, the best hyperparameters were obtained on training the KNN classifier on different hyperparameters using a verbose of 2. The hyperparameters involved using Manhattan distance, reducing the number of leaf nodes, and using “Ball Tree” algorithm over “Brute Force” algorithm. The classification model was then obtained using these hyperparameters.

For two points (x₁,y₁), and (x₂,y₂), the Manhattan distance can be defined as:

$$\left| {x_{1} - x_{2} } \right| - \left| {y_{1} - y_{2} } \right|$$

where the absolute distance of two points in consideration are calculated. This model is then repeated throughout the different points under consideration for the feature vector present in the dataset, and the classification was carried out [46]. The grid search values are provided in https://colab.research.google.com/drive/1oOBwnfbmy9yLngPSpsJyTCEEOGM_CkmE?usp=sharing#scrollTo=40STvZ9rx8s1 for understanding the range values, which were kept to be a positive integer increment (from 0 to infinity) with verbose of 2.

Support vector machine classification model analysis

Another popular classification model used for data that can be distinguished better with the use hyperplanes and kernel substitution [47]. In this model the Support Vector Machine (SVM) classifier was used with default hyperparameters on the oversampled dataset. The hyperplanes differentiation can be very well implemented for our dataset, due to the high dimensionality [48].

$$H: w^{T} (x) + b = 0$$

where H represents the hyperplane equation, b is the bias term of the hyperplane equation, and w is the dimension of the feature vector [49].

$$d_{H} (\Phi (x_{0} )) = \frac{{\left| {w^{T} \Phi (x_{0} ) + b} \right|}}{{\left\| w \right\|_{2} }}$$

where the distance function d with reference to a point vector, is given in terms of the symbols defined before [49].

Furthermore, “Grid Search” on SVM classifier using “GridSearchCV” to identify the best hyperparameters on a verbose of 2 was performed, but the results of the “Grid Search” module based on the value ranges as follows,

$${\text{`C'}} :[0.{1},\;0.{5},\;{1},\;{5},\;{1}0,\;{15},\;{1}00,\;{15}0,\;{5}00,\;{1}000]$$

$${\text{`gamma'}} :[{1},\;0.{1},\;0.0{1},\;0.00{1},\;0.000{1},\;0.0000{1}]$$

$${\text{`kernel'}} :[{\text{`rbf'}} ,\; {\text{`poly'}} ,\; {\text{`sigmoid'}} ]$$

where ‘C’ is the regularization, which acts as a penalty parameter, ‘gamma’ defines the suitable line of separation, and ‘kernel’(s) are the dimensional modifiers. Within the kernels, ‘rbf’ stands for a Gaussian kernel based on standard normal distribution, and the rest ‘poly’ and ‘sigmoid’ retain their usual meanings.

It was found that the default hyperparameters were best suited for the classification of dataset used in the present study.

Implementing neural networks

Artificial Neural Networks is a complex system that is designed to function and learn like the human brain [50, Full size image

On the SMOTE oversampled dataset, the individual precision of the above 3 cancers increased significantly; High-grade serous ovarian cancer increased to 0.75, Human Diffuse Type Gastric Cancer increased to 0.83 and Pancreatic adenocarcinoma increased to 0.78. The precision obtained for Intrahepatic cholangiocarcinoma was 0.85 and for Non BRCA1/BRCA2 familial breast cancer was 0.89. This model showed 82.56% validation accuracy after 100 epochs and average accuracy of 82% (Table 2c) on the test set. This model showed to be more stable that the one trained on under sampled dataset and increased the precision and recall for all the types of cancer. Results and codes for the same can be found here https://colab.research.google.com/drive/1lH2tdApkHfqF_6C-d9Pe3o2ZR6oCjp-5, https://colab.research.google.com/drive/1KSDKoxJmbNwW_hBElV2DP-CIlLDA-eP0.

Weighted ensemble learning classifier

As discussed in “Ensemble Machine Learning Approach” section, the base classifiers identified to be ideal had to be weighted according to their performance on the classification of the cancer types. To perform this function, the “tensordot” API available in the “NumPy” module was used (https://numpy.org). The tensordot API helps in calculating the tensor product of the weighted accuracy obtained from the base classifiers. The weighted accuracy of the KNN classifier, SVM classifier, and MLP classifier were 0.754, 0.774, and 0.842 respectively. The ensemble classifier had a weighted accuracy of 82.91% (Table 2d). The dataset was divided into 70:15:15 ratio. The 70:15 was used to split into training and test sets. The remaining 15% was used for the holdout validation set. The performance metric was calculated by fitting the test set to the base classifiers, and then measuring the true positives using majority voting. Using only KNN and SVM classifiers as base classifiers the weighted accuracy of the ensemble estimator still performed better with soft voting, resulting in 78.288%. In this case, the KNN classifier and SVM classifier models had weighted accuracy of 0.736 and 0.701 respectively (https://colab.research.google.com/drive/1mFcOy--VT1hQem8JhClh5TfSK5KnLKJL). The confusion matrix from the resulting ensemble classifier (Table 3), had much better evaluation metrics, with the precision value for high grade serous ovarian cancer, and pancreatic adenocarcinoma reaching 0.76 and 0.83, compared to the results in "Comparison of KNN and SVM classifiers" section. The entire results have been depicted in Table 4, where the performance parameter used for the results is precision. The justification for choosing such a parameter is to allow the weightage of false positives (FP), to have a greater ratio in determining the results as from the statistical relation in "Performance evaluation metrics" section, we observe precision to give us a significant ratio for the same. The weightage of false positives, helps us in the case of prediction of cancer classes based on exome dataset. Precision, has been selected as the required performance metric, as the requirement of having a better ratio in false positives (FP), has a greater significance in cancer prediction for a decision support system. The table therefore, summarizes our proposed models and their respective precision values. The results are presented in SOTA method.

Table 3 Confusion matrix from resulting ensemble classifier

Full size table

Table 4 Results based on precision for the proposed classifiers under study

Full size table

CTGAN and TVAE generated dataset

The proposed model for CTGAN was trained for 300 epochs with a batch size of 10 after which the generator loss was 0.2503 and the Discriminator loss was − 1.4397. The synthetic dataset on evaluation with real dataset with CSTest and KSTest the evaluation metric value was 0.92 and the overall comparison value was 0.66. The proposed TVAE model was also trained for 300 epochs with a batch size of 10 (https://colab.research.google.com/drive/1mFcOy--VT1hQem8JhClh5TfSK5KnLKJL). The synthetic dataset on evaluation with real dataset with CSTest and KSTest, the evaluation metric was 0.93 and the overall comparison value was 0.63.

Discussion

Ensemble learning technique

In the ensemble learning algorithm used in the present study, the ensemble estimator to perform soft voting on all the respective base classifiers that were used was proposed. The difference between soft voting and its alternate, hard voting, is that the latter works on the principle of majority label that was classified by all the base classifiers. Whereas soft voting relies on the base classifiers generating a probability value for the target class. From "Weighted ensemble learning classifier" section, the use of soft voting was employed so as to allow each classifier to be judged for every class according to its performance, and then adding the tensor sum. The target label with the greatest total of the weighted probabilities gets the vote [68, 69]. In Assiri et al. [70] the ensemble learning model proposed on the hard voting mechanism had shown better accuracy reaching 99.42%. The proposed classifiers were simple logistic regression learning, support vector machine learning with stochastic gradient descent optimization and multilayer perceptron network. This works on classification of single type of cancer class, i.e., the cancer class under study in Assiri et al. [70], was breast tumor classification on dataset taken from the Wisconsin Breast Cancer Dataset (WBCD). In the model proposed in the present study, the classification of five types of cancers simultaneously was enhanced with ‘Non BRCA1/BRCA2 familial breast cancer’, also a class under study, yielding a recall value of 0.92 and precision of 0.89. From Table 5, using SVM learning with stochastic gradient descent (SGD) optimization the recall and precision were 0.979 and 0.978 respectively. This leads to the inference that SVM with SGD would be a better parameter, but this would be inaccurate due to the fact that breast tumour classification in Assiri et al. [70], have parameters such as the radius of curvature, which can be correctly classified using a gradient descent in a hyperplane; but would be incapable to do so for features that belong only to the exome dataset, as using the same models led to a decrease in precision for ‘Non BRCA1/BRCA2 familial breast cancer’ in our proposed study. Similarly, Table 5 depicts the other 3 proposed models and their respective performance evaluation metrics. The ensemble model based on majority voting described in Assiri et al. [70], plateaus around 0.994. Comparing the performance evaluation metrics in Table 6, from our proposed study we see from the results in "Comparison of KNN and SVM classifiers" section, the recall value for the cancer class ‘Non BRCA1/BRCA2 familial breast cancer’, to be at a high 0.99 in case of SVM using the hyperparameters discussed in "Support vector machine classification model analysis" section, i.e., the default hyperparameters. And has a recall value of 0.96, and 0.92 in case of K-Nearest Neighbour and Neural Networks respectively, as depicted in the Table 6. The majority-based ensemble method developed for all the 5 cancer class in our proposed study, resulted in a recall value of 0.93 for the ‘Non BRCA1/BRCA2 familial breast cancer’ as depicted in Table 6. This, clearly leads to the conclusion that for exome dataset, our proposed ensemble model had better relevant results compared to Adel S. Assiri et al. [70].

Table 5 Classification analysis by Assiri et al. [70]

Full size table

Table 6 Non BRCA1/BRCA2 familial breast cancer

Full size table

In this model, soft voting was used to counter the fact, that from "Comparison of KNN and SVM classifiers" section, it was clear that the five cancer types were not well distinguished, simultaneously by the KNN or SVM classifier. Using soft voting instead of hard, allowed us to predict the cancer class better by giving each of the individual classifiers a probability value based on their performance with the holdout validation set. From "Weighted ensemble learning classifier" section, the weighted accuracy of the model was found to be 82.91%. Furthermore, on training the ensemble estimator using hard voting, i.e., majority voting, the overall weighted accuracy was observed to be 76.758%.

In Li et al. [71], the reported overall accuracy was 71.46% for the classification of 14 types of cancer class with the use of performance weighted voting ensemble on five classifiers, logistic regression, support vector machine, random forest, XGBoost and neural networks. From Table 7, the overall weighted accuracy for 8-cancer types calculated for the five classifiers mentioned above, was well below 70% [71]. Only the performance weighted voting ensemble model resulted in an overall accuracy of 71.46 [71]. This clearly shows that the ensemble model with performance weighted voting for greater number of classifiers doesn’t yield significant results, as it is necessary to define a distinguishable structure for the exome dataset by including hyperplane distinction. From Table 8, the weighted accuracy in all cases of different classifiers used in our proposed study is greater than 76%, with the ensemble model based on soft-voting resulting in 82% weighted accuracy. Furthermore, the recall values of the models proposed in our study were significantly higher for all the 5 cancer types. Our proposed model however, resulted in much better overall accuracy of 83%, with the evaluation parameters outperforming the model based on performance weighted voting ensemble.

Table 7 Performance evaluation metric, Li et al. [71], for 8 cancer types

Full size table

Table 8 Performance evaluation metric, proposed study for 5 cancer types

Full size table

Furthermore, from their research soft voting model had the overall accuracy output comparatively lesser than that of the performance model. However, from the present research the soft-voting ensemble model performed much better as compared the performance model, due to the three classifiers that were used (as mentioned in "Weighted ensemble learning classifier" section), being able to distinguish and give better probability values as compared to the five weak classifiers used in Li et al. [71]. The model designed in the present work also resulted in much larger true positives, and hence a better method for the early prediction of 5 classes of cancer as mentioned in "Data clean-up and obtaining a derived dataset" section.

Additionally, Tables 6 and 8, refer to soft voting classifiers in majority voting ensemble, which use predicted probabilities for class labels, and give almost proportional contribution to predictions for all the involved models. Table 7, pertains to the performance weighted voting ensemble model used in Li et al. [71], and involves a non-uniform weight attached to the models based on different judging parameters. Therefore, the model under Li et al. [71], (Table 7) and the soft-voting models in Tables 6 and 8 are different.

CTGAN and TVAE on synthetic dataset

The synthetic dataset obtained from CTGAN and TVAE, was saved as a comma separated value file (csv). The proposed ensemble learning model was carried out on the synthetic dataset generated by the CTGAN method (Additional file 1) [72]. The weighted accuracy of the model was about 63.54%, with recall values and precision values for the cancer classes being low. This however was not the case with synthetic dataset generated through TVAE (Additional file 2). On training with the proposed ensemble model, the weighted accuracy was observed to be about 76.58%, with very good recall and precision values. But the main objective of the generated dataset was to be able to distinguish between the cancer classes with lower probability values of being classified. This was easily observed in the model that was trained on TVAE synthetic generated dataset, with very good recall values (Table 9). Clearly, using TVAE and CTGAN can be proposed for improving the oversampling, as well as improving the resultant true positives and false positives. This has a great importance in saving resources, and improving the prediction probability, as compared to other oversampling techniques such as SMOTE.

Table 9 Ensemble model trained on TVAE generated dataset

Full size table

Conclusion

The present research work has important clinical significance for identifying the origin of five cancer types and provides insight on obtaining better cancer risk probabilities for the five selected types. In this paper, various algorithms were explored on the exome dataset to classify the cancers. In addition, the present work presented an ensemble machine learning method to combine the benefits of the 3 models (KNN, SVM and Neural network) into one model to provide a more balanced cancer classifier to obtain more accurate predictions. When KNN and SVM models were used, the weighted accuracy using the KNN classifier with the default hyperparameters was 0.69, whereas with the selected hyperparameters, the weighted accuracy increased to 0.77. Likewise, the SVM classifier using the default hyperparameters performed much better in overall classification report for all the five cancer types. The weighted average remained around 0.76. With the neural networks model, the model had validation accuracy of 74.31% after 100 epochs and average accuracy of 73% on the test set. However, with SMOTE on the datasets, the model showed 82.56% validation accuracy after 100 epochs and average accuracy of 82% on the test set. This model showed to be more stable that the one trained on under sampled dataset and increased the precision and recall for all the types of cancer. With the ensemble classifier model, the accuracy upped to 82.91%, close to 83% proving that this model improved the overall model precision.

The trained models enabled us to understand the impact of TVAE on the generation of datasets, by reducing the false negatives by a considerable amount. From the realization of bagging techniques in ensemble machine learning and utilizing weighted ensemble learning technique using soft-voting, the cumulative results yielded a better overall model collection consisting of the same explained throughout "Ensemble learning technique" and "CTGAN and TVAE on synthetic dataset" sections. The classifications obtained through Tables 8 and 9, both provide insight into the mathematical understanding of how the exome datasets can be better partitioned and studied in a hyperplane, as well as distributing the values of the dataset through TVAE and CTGAN, allows us to understand the distribution of the generated datasets as well. Hence, proving to be a vital technique to build a correction system for all types of classifications and reduce the bias-variance trade off which was studied throughout "Weighted ensemble learning classifier" and "CTGAN and TVAE generated dataset" sections.

Further enhancement is dependent on the addition of more variation data from other cancer types. Moreover, the model developed in this work also incorporated study on under sampling, over sampling for data balancing and a novel approach of data augmentation using CTGAN and TVAE was added to the model which proved to be effective in rare cancer cases where data is not widely available, hence proving data similar to real world samples.

Availability of data and materials

All data generated or analysed during this study are included in this published article and its supplementary information files. The derivative datasets used in the current study are generated from analysis of datasets downloaded from publicly available NCBI SRA database. The below NCBI SRA datasets were used in our previous work, Padmavathi et al. [25] and Padmavathi et al. [26] to arrive at data that was used in the current study. SRR894452, SRR900123, SRR900099, SRR941051, SRR941052, SRR941053, SRR941054, ERR166303, ERR166304, ERR166307, ERR166310, ERR166312, ERR166335, ERR166336, ERR035487, ERR035488, ERR035489, ERR232253, ERR232254, ERR232255. The proposed ensemble learning model carried out on the synthetic dataset generated by the CTGAN method is available in Additional file 1. The synthetic dataset generated through TVAE is available in Additional file 2.

Abbreviations

DSS:: Decision support system
NLP:: Natural language processing
PCA:: Principal component analysis
SMOTE:: Synthetic minority oversampling technique
KNN:: K-nearest neighbour
SVM:: Support vector machine
MLP:: Multi-layer perceptron
GAN:: Generative adversarial networks
TVAE:: Triplet based variational auto encoder
CTGAN:: Conditional tabular generative adversarial networks

References

Fu R, Wu L, Zhang C, Chu Q, Hu J, Lin G, Yang L, Li J-S, Yang X-N, Yang J-J, et al. Real-world scenario of patients with lung cancer amid the coronavirus disease 2019 pandemic in the People’s Republic of China. JTO Clin Res Rep. 2020;1(3):100053–100053.
PubMed PubMed Central Google Scholar
Cantini L, Mentrasti G, Russo GL, Signorelli D, Pasello G, Rijavec E, Russano M, Antonuzzo L, Rocco D, Giusti R, et al. Evaluation of COVID-19 impact on DELAYing diagnostic-therapeutic pathways of lung cancer patients in Italy (COVID-DELAY study): fewer cases and higher stages from a real-world scenario. ESMO Open. 2022;7(2):100406–100406.
Article CAS PubMed PubMed Central Google Scholar
Pilleron S, Sarfati D, Janssen-Heijnen M, Vignat J, Ferlay J, Bray F, Soerjomataram I. Global cancer incidence in older adults, 2012 and 2035: a population-based study. Int J Cancer. 2018;144(1):49–58.
Article PubMed Google Scholar
Sutton RT, Pincock D, Baumgart DC, Sadowski DC, Fedorak RN, Kroeker KI. An overview of clinical decision support systems: benefits, risks, and strategies for success. NPJ Digit Med. 2020;3:17–17.
Article PubMed PubMed Central Google Scholar
Liu Sheng OR. Decision support for healthcare in a new information age. Decis Support Syst. 2000;30(2):101–3.
Article Google Scholar
Hosni M, Abnane I, Idri A, de Gea JMC, Fernández Alemán JL. Reviewing ensemble classification methods in breast cancer. Comput Methods Programs Biomed. 2019;177:89–112.
Article PubMed Google Scholar
Brinker TJ, Hekler A, Utikal JS, Grabe N, Schadendorf D, Klode J, Berking C, Steeb T, Enk AH, von Kalle C. Skin cancer classification using convolutional neural networks: systematic review. J Med Internet Res. 2018;20(10):e11936–e11936.
Article PubMed PubMed Central Google Scholar
Yoon J, Kim M, Posadas EM, Freedland SJ, Liu Y, Davicioni E, Den RB, Trock BJ, Karnes RJ, Klein EA, et al. A comparative study of PCS and PAM50 prostate cancer classification schemes. Prostate Cancer Prostatic Dis. 2021;24(3):733–42.
Article CAS PubMed PubMed Central Google Scholar
Tandel GS, Biswas M, Kakde OG, Tiwari A, Suri HS, Turk M, Laird JR, Asare CK, Ankrah AA, Khanna NN, Madhusudhan BK. A review on a deep learning perspective in brain cancer classification. Cancers (Basel). 2019;11(1):111.
Article Google Scholar
Ilyas QM, Ahmad M. An enhanced ensemble diagnosis of cervical cancer: a pursuit of machine intelligence towards sustainable health. IEEE Access. 2021;9:12374–88.
Article Google Scholar
Abouelmagd LM, Shams MY, El-Attar NE, Hassanien AE. Feature selection based coral reefs optimization for breast cancer classification. In: Studies in computational intelligence. Springer; 2021. p. 53–72.
Choudhury A, Wesabi Y, Won D. Classification of cervical cancer dataset. ar**v preprint. https://arxiv.org/abs/1812.10383 (2018).
Sathiyanarayanan P, Pavithra S, Sai Saranya M, Makeswari M. Identification of breast cancer using the decision tree algorithm. In: IEEE international conference on system, computation, automation and networking (ICSCAN): 2019/03. IEEE; 2019.
Garg G, Garg R. Brain tumor detection and classification based on hybrid ensemble classifier. ar**v preprint. https://arxiv.org/abs/2101.00216 (2021).
Kourou K, Exarchos KP, Papaloukas C, Sakaloglou P, Exarchos T, Fotiadis DI. Applied machine learning in cancer research: a systematic review for patient diagnosis, classification and prognosis. Comput Struct Biotechnol J. 2021;19:5546–55.
Article PubMed PubMed Central Google Scholar
Jean S, Nikita R, Rucha K, Sulochana D. Breast cancer classification and prediction using machine learning. Int J Eng Res Technol. 2020;V9(2):576–80.
Google Scholar
Cao Y, Geddes TA, Yang JYH, Yang P. Ensemble deep learning in bioinformatics. Nat Mach Intell. 2020;2(9):500–8.
Article Google Scholar
Hosni M, Carrillo-de-Gea JM, Idri A, Fernandez-Aleman JL, Garcia-Berna JA. Using ensemble classification methods in lung cancer disease. In: 41st Annual international conference of the IEEE engineering in medicine and biology society (EMBC): 2019/07. IEEE; 2019.
**ao Y, Wu J, Lin Z, Zhao X. A deep learning-based multi-model ensemble method for cancer prediction. Comput Methods Programs Biomed. 2018;153:1–9.
Article PubMed Google Scholar
Lu J, Song E, Ghoneim A, Alrashoud M. Machine learning for assisting cervical cancer diagnosis: an ensemble approach. Future Gen Comput Syst. 2020;106:199–205.
Article Google Scholar
Khuriwal N, Mishra N. Breast cancer diagnosis using adaptive voting ensemble machine learning algorithm. In: IEEMA engineer infinite conference (eTechNxT): 2018/03. IEEE; 2018.
Jabbar MA. Breast cancer data classification using ensemble machine learning. Eng Appl Sci Res. 2021;48(1):65–72.
Google Scholar
keymasi M, Mishra V, Aslan S, Asem MM. Theoretical assessment of cervical cancer using machine learning methods based on pap-smear test. In: IEEE 9th Annual information technology, electronics and mobile communication conference (IEMCON): 2018/11. IEEE; 2018.
Zhang Y, Tomuro N, Furst J, Raicu DS. Building an ensemble system for diagnosing masses in mammograms. Int J Comput Assisted Radiol Surg. 2011;7(2):323–9.
Article Google Scholar
Padmavathi P, Setlur AS, Chandrashekar K, Niranjan V. A comprehensive in-silico computational analysis of twenty cancer exome datasets and identification of associated somatic variants reveals potential molecular markers for detection of varied cancer types. Inform Med Unlocked. 2021;26:100762.
Article Google Scholar
Padmavathi P, Setlur AS, Adithya Sabhapathi C, Satyam Suresh Raiker, Satyam Singh, Chandrashekar K, Vidya Niranjan. Prototype of decision support system using pattern recognition as an application of artificial intelligence and machine learning for early diagnosis of genetic diseases. In: 1244th International conference on medical, biological and pharmaceutical sciences (Accepted). 2022. IASTEM.08122021.14897.
Moreira MWL, Rodrigues JJPC, Korotaev V, Al-Muhtadi J, Kumar N. A comprehensive review on smart decision support systems for health care. IEEE Syst J. 2019;13(3):3536–45.
Article Google Scholar
Holsapple CW. DSS architecture and types. In: Handbook on decision support systems 1. Berlin: Springer; 2008. p. 163–89.
Książek W, Turza F, Pławiak P. NCA-GA-SVM: a new two-level feature selection method based on neighborhood component analysis and genetic algorithm in hepatocellular carcinoma fatality prognosis. Int J Numer Methods Biomed Eng. 2022;38(6):e3599.
Article Google Scholar
Książek W, Gandor M, Pławiak P. Comparison of various approaches to combine logistic regression with genetic algorithms in survival prediction of hepatocellular carcinoma. Comput Biol Med. 2021;134:104431.
Article PubMed Google Scholar
Sharma A, Yadav DP, Garg H, Kumar M, Sharma B, Koundal D. Bone cancer detection using feature extraction based machine learning model. Comput Math Methods Med. 2021;2021:7433186–7433186.
Article PubMed PubMed Central Google Scholar
Shrivastava D, Sanyal S, Maji AK, Kandar D. Bone cancer detection using machine learning techniques. In: Smart healthcare for disease diagnosis and prevention. Elsevier; 2020. p. 175–183.
Daneshjou R, Wang Y, Bromberg Y, Bovo S, Martelli PL, Babbi G, Lena PD, Casadio R, Edwards M, Gifford D, et al. Working toward precision medicine: predicting phenotypes from exomes in the Critical Assessment of Genome Interpretation (CAGI) challenges. Hum Mutat. 2017;38(9):1182–92.
Article CAS PubMed PubMed Central Google Scholar
Hernández-Lobato JM, Houlsby N, Ghahramani Z. Probabilistic matrix factorization with non-random missing data. In: International conference on machine learning: 2014. PMLR. p. 1512–20.
Kim K, Park S, Kim J, Park S-B, Bae M. A fast minimum variance beamforming method using principal component analysis. IEEE Trans Ultrason Ferroelectr Freq Control. 2014;61(6):930–45.
Article PubMed Google Scholar
https://support.sentieon.com/appnotes/out_fields/. Accessed on 2 May 2022.
Heim WG. What is a recessive allele? Am Biol Teach. 1991;53(2):94–7.
Article Google Scholar
Munson MA, Caruana R. On feature selection, bias-variance, and bagging. In: Machine learning and knowledge discovery in databases. Berlin: Springer; 2009. p. 144–59.
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321–57.
Article Google Scholar
Yadav S, Shukla S. Analysis of k-fold cross-validation over hold-out validation on colossal datasets for quality classification. In: IEEE 6th International conference on advanced computing (IACC): 2016/02. IEEE; 2016.
Tan M, Tsang IW, Wang L. Minimax sparse logistic regression for very high-dimensional feature selection. IEEE Trans Neural Netw Learn Syst. 2013;24(10):1609–22.
Article PubMed Google Scholar
Wang L. Research and implementation of machine learning classifier based on KNN. IOP Conf Ser Mater Sci Eng. 2019;677(5):052038.
Article Google Scholar
Farid DM, Al-Mamun MA, Manderick B, Nowe A. An adaptive rule-based classifier for mining big biological data. Expert Syst Appl. 2016;64:305–16.
Article Google Scholar
Bhavani RR, Wiselin JG. Image registration for varicose ulcer classification using KNN classifier. Int J Comput Appl. 2017;40(2):88–97.
Google Scholar
Syakur MA, Khotimah BK, Rochman EMS, Satoto BD. Integration K-means clustering method and elbow method for identification of the best customer profile cluster. IOP Conf Ser Mater Sci Eng. 2018;336:012017.
Article Google Scholar
Szabo F. The linear algebra survival guide. Elsevier; 2015. p. 185–89.
Ghawi R, Pfeffer J. Efficient hyperparameter tuning with grid search for text categorization using kNN approach with BM25 similarity. Open Comput Sci. 2019;9(1):160–80.
Article Google Scholar
Yue S, Li P, Hao P. SVM classification: its contents and challenges. Appl Math A J Chin Univ. 2003;18(3):332–42.
Article Google Scholar
https://www.analyticsvidhya.com/blog/2020/10/the-mathematics-behind-svm/. Accessed on 11 June 2022.
Desai M, Shah M. An anatomization on breast cancer detection and diagnosis employing multi-layer perceptron neural network (MLP) and convolutional neural network (CNN). Clinical eHealth. 2021;4:1–11.
Article Google Scholar
Lévy D, Jain A: Breast mass classification from mammograms using deep convolutional neural networks. ar**v preprint. https://arxiv.org/abs/1612.00542 (2016).
Shah D, Dixit R, Shah A, Shah P, Shah M. A comprehensive analysis regarding several breakthroughs based on computer intelligence targeting various syndromes. Augment Hum Res. 2020;5(1):1–12.
Article CAS Google Scholar
Jani K, Chaudhuri M, Patel H, Shah M. Machine learning in films: an approach towards automation in film censoring. J Data Inf Manag. 2019;2(1):55–64.
Article Google Scholar
Sukhadia A, Upadhyay K, Gundeti M, Shah S, Shah M. Optimization of smart traffic governance system using artificial intelligence. Augment Hum Res. 2020;5(1):1–14.
Article Google Scholar
Chunekar VN, Ambulgekar HP. Approach of neural network to diagnose breast cancer on three different data set. In: International conference on advances in recent technologies in communication and computing. IEEE; 2009.
Kingma DP, Ba J. Adam: a method for stochastic optimization. ar**v preprint. https://arxiv.org/abs/1412.6980 (2014).
Gaikwad NB, Tiwari V, Keskar A, Shivaprakash NC. Efficient FPGA implementation of multilayer perceptron for real-time human activity classification. IEEE Access. 2019;7:26696–706.
Article Google Scholar
Dietterich TG. Ensemble methods in machine learning. In: Multiple classifier systems. Berlin: Springer; 2000. p. 1–15.
Li D, Luo L, Zhang W, Liu F, Luo F. A genetic algorithm-based weighted ensemble method for predicting transposon-derived piRNAs. BMC Bioinform. 2016;17(1):329–329.
Article Google Scholar
https://machinelearningmastery.com/weighted-average-ensemble-for-deep-learning-neural-networks/. Accessed on 3 May 2022.
Dou J, Yunus AP, Bui DT, Merghadi A, Sahana M, Zhu Z, Chen C-W, Han Z, Pham BT. Improved landslide assessment using support vector machine with bagging, boosting, and stacking ensemble machine learning framework in a mountainous watershed, Japan. Landslides. 2019;17(3):641–58.
Article Google Scholar
Huang J-C, Tsai Y-C, Wu P-Y, Lien Y-H, Chien C-Y, Kuo C-F, Hung J-F, Chen S-C, Kuo C-H. Predictive modeling of blood pressure during hemodialysis: a comparison of linear model, random forest, support vector regression, XGBoost, LASSO regression and ensemble method. Comput Methods Programs Biomed. 2020;195:105536.
Article PubMed Google Scholar
https://www.nvidia.com/en-us/glossary/data-science/xgboost/. Accessed on 20 Oct 2022.
Xu L, Skoularidou M, Cuesta-Infante A, Veeramachaneni K. Modeling tabular data using conditional gan. Adv Neural Inform Process Syst. 2019;32.
Wen B, Wang N, Subbalakshmi KP, Chandramouli R. One-intervention causal explanation for natural language processing based Alzheimer’s disease detection (Preprint). JMIR Publications Inc.; 2022.
https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9. Accessed on 4 May 2022.
Townsend JT. Alphabetic confusion: a test of models for individuals. Percept Psychophys. 1971;9(6):449–54.
Article Google Scholar
https://www.oreilly.com/library/view/machinelearningfor/9781783980284/47c32d8b-7b01-4696-8043-3f8472e3a447.xhtml. Accessed on 6 May 2022.
https://scikitlearn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html. Accessed on 7 May 2022.
Assiri AS, Nazir S, Velastin SA. Breast tumor classification using an ensemble machine learning method. J Imaging. 2020;6(6):39.
Article PubMed PubMed Central Google Scholar
Li Y, Luo Y. Performance-weighted-voting model: an ensemble machine learning method for cancer type classification using whole-exome sequencing mutation. Quant Biol. 2020;8(4):347–58.
Article CAS PubMed PubMed Central Google Scholar
https://www.maskaravivek.com/post/ctgan-tabular-synthetic-data-generation/. Accessed on 5 May 2022.

Download references

Acknowledgements

The authors would like to acknowledge Bangalore Bio-innovation Centre (BBC), Karnataka Innovation and Technology Society, Department of Electronics, IT, BT and S & T, Govt of Karnataka, India for providing us the funding acquisition. We would also like to acknowledge Mr. Aditya Sabhapathi C, Mr. Satyam Suresh Raiker and Mr. Satyam Singh for collecting the preliminary data and analysing it. Special thanks to Mr. Akshay Uttarkar for reviewing the manuscript and providing valuable suggestions. We are also grateful to Dr. Shobha G, Professor, Department of Computer Science and Engineering, R V College of Engineering, Bangalore, for hel** with initial computational analysis.

Funding

The funding acquisition was made from the Bangalore Bio-innovation Centre (BBC), Karnataka Innovation and Technology Society, Department of Electronics, IT, BT and S & T, Govt of Karnataka, India towards paying the publication cost.

Author information

Authors and Affiliations

Department of Information Science and Engineering, RV College of Engineering, Bangalore, 560059, India
Abdu Rehaman Pasha Syed
Department of Computer Science and Engineering, RV College of Engineering, Bangalore, 560059, India
Rahul Anbalagan & Jyoti Shetty
Department of Biotechnology, RV College of Engineering, Bangalore, 560059, India
Anagha S. Setlur, Chandrashekar Karunakaran & Vidya Niranjan
Bangalore Bio-Innovation Centre (BBC), Helix Biotech Park, Electronic City, Phase-I, Bangalore, 560100, India
Jitendra Kumar

Authors

Abdu Rehaman Pasha Syed
View author publications
You can also search for this author in PubMed Google Scholar
Rahul Anbalagan
View author publications
You can also search for this author in PubMed Google Scholar
Anagha S. Setlur
View author publications
You can also search for this author in PubMed Google Scholar
Chandrashekar Karunakaran
View author publications
You can also search for this author in PubMed Google Scholar
Jyoti Shetty
View author publications
You can also search for this author in PubMed Google Scholar
Jitendra Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Vidya Niranjan
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

ARPS and RA: involved in implementing the algorithms, data analysis and writing the manuscript. ASS and CK: involved in writing the manuscript, data analysis and collecting preliminary data required for this project. JS: analysed the results. JK and VN: conceptualized the idea for the project and were involved in data analysis and project implementation. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Jitendra Kumar or Vidya Niranjan.

Ethics declarations

Ethical approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1

. The proposed ensemble learning model carried out on the synthetic dataset generated by the CTGAN method.

Additional file 2

. The synthetic dataset generated through TVAE method.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Pasha Syed, A.R., Anbalagan, R., Setlur, A.S. et al. Implementation of ensemble machine learning algorithms on exome datasets for predicting early diagnosis of cancers. BMC Bioinformatics 23, 496 (2022). https://doi.org/10.1186/s12859-022-05050-w

Download citation

Received: 25 August 2022
Accepted: 10 November 2022
Published: 18 November 2022
DOI: https://doi.org/10.1186/s12859-022-05050-w

Implementation of ensemble machine learning algorithms on exome datasets for predicting early diagnosis of cancers

Abstract

Similar content being viewed by others

A Semi-supervised Learning Approach for Pan-Cancer Somatic Genomic Variant Classification

A snapshot neural ensemble method for cancer-type prediction based on copy number variations

CPEM: Accurate cancer type classification based on somatic alterations using an ensemble of a random forest and a deep neural network

Introduction

Background of study

Related works

Materials and methods

Dataset analysis

Data clean-up and obtaining a derived dataset

Exploratory data analysis

Oversampling using SMOTE

Cross validation

K-nearest neighbors classification model analysis

Support vector machine classification model analysis

Implementing neural networks

Weighted ensemble learning classifier

CTGAN and TVAE generated dataset

Discussion

Ensemble learning technique

CTGAN and TVAE on synthetic dataset

Conclusion

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethical approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary Information

Additional file 1

Additional file 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation