1 Introduction

Melanoma, a highly aggressive form of skin cancer originating from pigment-producing melanocyte cells, often spreads rapidly and ranks among the most common cancer types [1]. Main types include superficial spreading melanoma, nodular melanoma, lentigo maligna melanoma, and acral lentiginous melanoma. While UV radiation exposure is a primary cause, genetics, skin color, the presence of numerous moles, and weakened immunity also contribute. Early detection and treatment are crucial due to its hazardous nature, significantly boosting patient recovery rates. Classification of melanoma, based on parameters like asymmetry, irregular borders, color and shape changes, and size (typically larger than 6 mm), is vital for diagnosis, prognosis, and treatment planning [2]. Dermatologists rely on visual examination of skin lesions for the diagnosis and classification of melanoma. Nonetheless, factors such as variation in observation techniques person to person and the observer's momentary state can affect the accuracy of diagnosis and classification. In this regard, the use of machine learning methods has been promising. In particular, the deep learning approach has been observed to provide significant benefits in the diagnosis and classification of melanoma [3].

It has been observed that the visual examination of dermatologists is not sufficient due to the types of lesions and used examination techniques. Considering the importance of early diagnosis and classification for the healing of patients, the use of machine learning, especially deep learning methods, is an area of research. There are many studies on the use of deep learning methods for the early diagnosis and classification of melanoma in particular. ** N (2023) SkinLesNet: Classification of Skin Lesions and Detection of Melanoma Cancer Using a Novel Multi-Layer Deep Convolutional Neural Network. Cancers 2024(16):108. https://doi.org/10.3390/cancers16010108 " href="/article/10.1007/s11042-024-19561-6#ref-CR15" id="ref-link-section-d195623878e411">15] proposed the SkinLesNet algorithm. The SkinLesNet contains several convılutional layers, max-pooling layer and some feature extraction operations. They used different skin image datasets; PAD-UFES-20, HAM10000, and ISIC2017. The proposed method cis compared with VGG16 and ResNet50 and the result SkinLesNet has higher accuracy than other methods. All these studies are given in Table 1.

Table 1 Comparison of literature work

1.1 Motivation

Early diagnosis and classification of melanoma, one of the most common types of cancer, is vital for patients. Physical methods used by dermatologists for diagnosis and classification are not sufficient. In the literature, it has been shown that machine learning, especially deep learning methods, have been successfully used for melanoma diagnosis and classification. However, the differences in the optimizer methods used in these studies and their contribution to success have not been evaluated. In this study, our motivation is to use a hybrid method that combines different deep learning methods for the classification of melanoma, a dangerous type of skin cancer, and to investigate the contribution of optimizer methods used in deep learning methods to classification success.

1.2 Novelties and contributions

This study examines the use of deep learning methods in an important medical application area such as melanoma detection and the role of optimizer methods to improve its performance. Melanoma is one of the deadliest types of skin cancers and early detection and accurate classification are critical for the treatment and management of the disease. In the literature, various machine learning and deep learning approaches have been proposed for melanoma classification. However, the selection of an appropriate optimizer method to ensure high performance of deep learning models is important and studies on this topic are limited. In this study, the impact of optimizer methods on the performance of deep learning methods is investigated for the first time. Different optimizer methods are used and their effects on classification accuracy are evaluated.

The most important contributions of this study are as follows:

  1. 1.

    Investigating the effects of transfer learning architectures: Forty different experimental runs were performed with five different transfer learning architectures and eight different optimizer functions in melanoma classification and the results were compared and interpreted.

  2. 2.

    The first study of the impact of optimizer methods on the success of deep learning methods: This study is one of the first to focus on the success of optimizer methods to optimize the training of deep learning models in melanoma classification. Different optimizer methods such as SGD, Adam, RMSprop are extensively tested on the performance of the deep learning model. The results obtained help us understand the impact of different optimizer methods on the performance of the deep learning model.

As a result, this study is an important step towards optimizing the performance of deep learning methods in the vital healthcare field of melanoma detection. By enabling the successful use of deep learning models in melanoma classification, it can make a great contribution to early detection and treatment processes. Moreover, the results of the study can be applicable to the optimization of deep learning models in general and to other medical imaging problems. Therefore, this research in the field of melanoma classification can open new doors in an important intersection between medicine and artificial intelligence and can be an important reference source for future studies.

The rest of the paper, Section-2 Materials and Methods gives information about dataset, dimension reduction, and classification methods that are used in this study. Section-3 Experimental Evaluations shows the results of all analyses and discussions. The Conclusion is given in Section-4 and summarizes the whole work.

2 Materials and Methods

In this study, the Melanoma Skin Cancer Dataset [16] was used to classify benign/malignant status for melanoma skin disease. The data set consists of 10,605 RGB images in 300 × 300 size, 5500 benign and 5105 malignant. In the study, classification processes are carried out on this data set by using different deep learning architectures and optimizer methods.

2.1 Transfer Learning

Deep learning is a field that improves the learning and analysis capabilities of computer systems by focusing on complex tasks. This approach usually involves multilayer artificial neural networks trained on large data sets. Deep learning is particularly effective in processing complex visual and signal data, improving automatic learning capabilities by extracting learned features hierarchically [17, 18].

Transfer Learning is a machine learning approach that is often used in deep learning techniques. This approach involves adapting the features of a pre-trained model to another task to solve it. This way, a higher success rate can be achieved on a smaller data set and the training time of the model can be reduced [19, 20].

Transfer Learning is often used, especially in image processing. For example, many image classification tasks can be initiated from a pre-traned model trained on a large data set such as ImageNet. This model can then be used for a more customized classification task by adapting it to a new data set [21].

MobileNet, a convolutional neural network model developed by Google; features fast model convergence, low memory consumption and low computational overhead [22]. In addition, it uses fewer parameters compared to other models. Thanks to these features, it is very useful in real-time classification operations on low computing devices. Although it is suitable for 224 × 224 inputs, it can also be used for any size input larger than 32 × 32 [23]. The MobileNet architectural structure is shared in Fig. 1.

Fig. 1
figure 1

Architecture of MobileNet

DenseNet, proposed by Huang et al. provides cross-layer information flow in deep networks. This model won the best article award at CVPR2017. The DenseNet working process takes place in such a way that each layer receives additional inputs from all previous layers and transfers its own feature map to all subsequent layers [24]. DenseNet architecture is shown in Fig. 2 in general terms. Compared to other deep learning architectures, it has advantages such as having fewer parameters, reuse of features and overcoming the problem of gradient backpropagation disappearance. In addition to these, there are also disadvantages such as excessive memory usage due to the fact that the training time takes time due to layer density and the merging of feature maps [25, 26].

Fig. 2
figure 2

Architecture of DenseNet

Inceptionv3, a deep neural network architecture released by Google that won ILSVRC (ImageNet Large Scale Visual Recognition) in 2014. Its basic principle is to reduce the number of connections without reducing network efficiency [27]. Thus, it offers a solution to the overfitting problem in traditional networks with densely interconnected layers. It also reduces the computational cost thanks to the sparse connection. The InceptionV3 architecture, which offers a more efficient operation with fewer parameters, is state-of-the-art architecture used in image classification studies [28]. The architectural model is presented in Fig. 3.

Fig. 3
figure 3

Architecture of InceptionV3

ResNet50(Residual Neural Network), which has the advantage of having a deeper structure compared to other deep learning architectures, also offers a solution to the disappearing gradient problem experienced in previous architectures [22]. Thanks to the method of adding shortcuts between the layers in its working principle, it prevents the deterioration that may occur in case the network becomes deeper and more complex. In addition, bottleneck blocks are used for faster training [29]. The ResNet50 architectural structure is shared in Fig. 4.

Fig. 4
figure 4

Architecture of ResNet50

InceptionResnetV2 deep learning architecture is proposed by examining the positive effects of introducing residual connections on the performance of Inception architecture. A more advanced, 164-layer deep neural network model was created by combining two state-of-the-art models such as Inception and Residual networks (shown in Fig. 5). Thanks to ResNet, it is ensured that the model is deeper, avoiding the disappearing gradient problem, and the learning process is fast. Inception architecture, on the other hand, provides advantages such as lower computational cost, better performance and preventing information loss by using ReLU activation function [30]. The InceptionResNetv2 architecture is trained on over one million photos from the ImageNet dataset. Thus, it has been seen that architecture can define images in 1000 item categories [31].

Fig. 5
figure 5

Architecture of InceptionResNetv2

2.2 Optimizer Methods

Optimizer methods are the methods used to minimize the error, that is, the difference between the output value produced by the network and the actual value. In deep learning processes, the absolute minimum value of the error function is found by using optimizer methods. Thus, it is aimed to realize the learning processes in a healthier way. One of the most important factors in the use of optimizer methods is the learning rate. However, setting the appropriate learning rate for each algorithm is a difficult process. In order to overcome this difficulty, different models of gradient methods have been developed [32].

SGD (Stochastic Gradient Descent) is an optimizer method that works iteratively to find the optimal parameter value in deep learning algorithms. It frequently updates the parameter values to reduce the error value. It runs slowly as a new predictive value is calculated for the training data at each iteration. Due to this feature, it is not suitable for use in large data sets. Its main advantage is that it is efficient and an easy-to-apply method [33].

AdaGrad optimizer method aims to overcome this problem by using a different learning rate at each step. A fixed learning rate is used in the SGD optimizer method. It calculates the learning rate based on the sum of the squares of the past gradients [34].

RMSProp (Root Mean Squared Propagation) has been proposed as a solution to the constant learning rate problem. It differs from AdaGrad in terms of the gradient sum method. By discarding the past gradient information, operations are performed with the last gradient information. Gradients are calculated by taking the exponentially weighted average. It is a gradient-based optimizer method whose learning rate changes over time and a separate value is applied for each parameter [35, 36].

AdaDelta is the proposed optimizer method to overcome AdaGrad's learning rate problem. In case of too many iterations, too small learning rate causes slow convergence. When this method is used, there is no obligation to choose a learning rate. Instead of the learning rate, the momentum sums of squares of the delta value, which indicates the difference between the current weights and the updated weights, are used. Thus, taking the exponentially decreasing average value eliminates the slowness problem [34, 37].

Adam (Adaptive Moment Estimation) is designed mainly for deep learning studies. It is aimed to obtain the most appropriate value by adding momentum to the RMSProp method. It is an optimizer method that keeps the exponentially descending average of historical gradients and adjusts adaptive learning rates for each parameter. Its most important feature is to find individual adaptive learning rates for various parameters [38,39,40].

Nadam (Nesterov-accelerated Adaptive Moment Estimation) is formed by combining Adam and Nesterov Momentum. It uses an exponential growth method based on the moving average of the gradients. Thus, it accelerates the learning process during model training. The Nadam converges faster than Adam. It is often preferred because it is simple to apply and efficient to calculate [6. Within the scope of the study, transfer learning models were trained. Model training was done in a cloud environment using Google Colab.

Fig. 6
figure 6

Block diagram of the methods used in the study

The optimizer algorithms are used to minimize errors in training. Experimental analyzes were conducted to examine the effect of Optimizers on deep learning models. Melanoma detection was performed from skin lesions images through simulation. The simulation was created with seven different optimizer functions with different deep learning architectures DenseNet, InceptionV3, ResNet50, InceptionResNetV2 and MobileNet. In experimental studies, SGD, Adam, RmsProp, AdaDelta, AdaGrad, Adamax and Nadam Optimizers were used. For the performance evaluation of deep learning models, the data set is divided into 80% training (8484 images) and 20% testing (2121 images). In Table 37 shows the results of DenseNet, InceptionV3, ResNet50, InceptionResNetV2 and MobileNet models that detect Melanoma disease. In the experiments, the activation function ReLU and the error function cross-entropy were preferred.

Table 3 DenseNet model performance results

In Table 3, the loss, accuracy, f-score and sensitive informations are observed. These informations were obtained from a simulation with 50 epochs. The increase in accuracy performance was investigated by comparing the ADAM function, which is used extensively on deep learning architectures, with other optimizer functions. In the experimental analyzes performed on the DenseNet legacy, it is seen that the SGD optimizer gives more successful results than the other optimizers.

Figure 7 displays the average validation loss values and their standard deviations for different optimizers applied to the Melanoma dataset for DenseNet model. The distance between the extreme values in the box plot is very small. This indicates that the performance of different activation functions on the Melanoma dataset for the DenseNet model is relatively consistent, with minimal variation between the best and worst results. The box plot lengths of the SGD and AdaGrad optimizers are shorter than the other box plot lengths. A shorter box plot suggests that the results using these activation functions have less variability and are more concentrated around the median value. The distance of whiskers to the box is closed for SGD and AdaGrad. This means that the data points representing the results for these activation functions are not significantly spread out beyond the whiskers, indicating stability in the results. The results of the simulation process created with the InceptionV3 model and seven different optimizer functions are presented in Table 4.

Fig. 7
figure 7

Validation loss results for DenseNet

Table 4 InceptionV3 model performance results

In Table 4, the results of the performance criteria obtained in the classification of melanoma disease with the InceptionV3 model are given. When the results are examined, it is seen that the SGD optimizer is more effective than other optimizer functions on the InceptionV3 architecture. The closest to the results obtained with the SGD optimizer is Adamax in the InceptionV3 architecture.

Figure 8 displays the average validation loss values and their standard deviations for different optimizers applied to the Melanoma dataset for InceptionV3model. The distance between the extreme values in the box plot is very small. This indicates that the performance of different activation functions on the Melanoma dataset for InceptionV3 model is relatively consistent, with minimal variation between the best and worst results. The box plot lengths of the SGD optimizer are shorter than the other box plot lengths. A shorter box plot suggests that the results using these activation functions have less variability and are more concentrated around the median value. The distance of whiskers to the box is closed for SGD and AdaDelta. This means that the data points representing the results for these activation functions are not significantly spread out beyond the whiskers, indicating stability in the results. The median value for SGD is in the middle of the box. This indicates that the median performance using these activation functions is relatively central and not skewed towards extreme values. The results for RmsProp are less stable compared to the other optimizers. This is evident from the larger distance between the lower whiskers and the box, as well as the median value being far from the middle. These characteristics suggest that RmsProp shows more variability and inconsistency in its performance on the InceptionV3 model. The results of the simulation process created with the ResNet50 model and seven different optimizer functions are presented in Table 5.

Fig. 8
figure 8

Validation loss results for InceptionV3

Table 5 ResNet50 model performance results

In Table 5, the results of the performance criteria obtained in the classification of melanoma disease with the ResNet50 model are given. When the results are examined, it is seen that the SGD optimizer is more effective than other optimizer functions on the ResNet50 architecture. The closest to the results obtained with the SGD optimizer is AdaGrad. In addition, it is observed that AdaGrad is the closest to the results obtained with the SGD optimizer. The AdaGrad optimizer function is inspired by the SGD optimizer. It has been observed that results close to each other are obtained due to their similar structures.

Figure 9 displays the average validation loss values and their standard deviations for different optimizer applied to the Melanoma dataset for InceptionV3model. The distance between the extreme values in the box plot is very small. This indicates that the performance of different activation functions on the Melanoma dataset for ResNet50 model is relatively consistent, with minimal variation between the best and worst results. The box plot lengths of the SGD optimizer are shorter than the other box plot lengths. The median value for SGD is in the middle of the box. This indicates that the median performance using these activation functions is relatively central and not skewed towards extreme values. The results of the simulation process created with the InceptionResNetV2 model and seven different optimizer functions are presented in Table 6.

Fig. 9
figure 9

Validation loss results for ResNet50

Table 6 InceptionResNetV2 model performance results

In Table 6, the results of the performance criteria obtained in the classification of melanoma disease with the InceptionResNetV2 model are given. When the results are examined, it is seen that the SGD optimizer is more effective than other optimizer functions on the InceptionResNetV2 architecture. In addition, 0.9434 accuracy, 0.9479 f-score and 0.3457 loss values were obtained with SGD.

Figure 10 displays the average validation loss values and their standard deviations for different optimizers applied to the Melanoma dataset for InceptionV3model. The distance between the extreme values in the box plot is very small. This indicates that the performance of different activation functions on the Melanoma dataset for InceptionResNetV2 model is relatively consistent, with minimal variation between the best and worst results. The box plot lengths of the SGD optimizer are shorter than the other box plot lengths. The results for RmsProp are less stable compared to the other optimizers. This is evident from the larger distance between the lower whiskers and the box, as well as the median value being far from the middle. These characteristics suggest that RmsProp, Adam, and Nadam show more variability and inconsistency in its performance on the InceptionResNetV2 model. The results of the simulation process created with the MobileNet model and seven different optimizer functions are presented in Table 7.

Fig. 10
figure 10

Validation loss results for InceptionResNetV2

Table 7 MobileNet model performance results

In Table 6, the results of the performance criteria obtained in the classification of melanoma disease with the MobileNet model are given. When the results are examined, Adagrad Optimizer is more effective than other optimizer functions on Inceptionresnetv2 architecture. In addition, 0.9323 accuracy with Adagrad, 0.9351 f-score and 0.2089 loss values were obtained. Adagrad Optimizer function has been developed inspired by SGD optimizer.

Figure 11 displays the average validation loss values and their standard deviations for different optimizer applied to the Melanoma dataset for MobilNet. The distance between the extreme values in the box plot is very small. This indicates that the performance of different activation functions on the Melanoma dataset for MobileNet model is relatively consistent, with minimal variation between the best and worst results. The box plot lengths of the AdaGrad optimizer are shorter than the other box plot lengths. The results for RmsProp are less stable compared to the other optimizers. This is evident from the larger distance between the lower whiskers and the box, as well as the median value being far from the middle. The results of the simulation process created with the all models are presented in Table 8.

Fig. 11
figure 11

Validation loss results for MobileNet

Table 8 The best performance results of all architecture

In Table 8, the effects of five different transfer learning methods DenseNet, InceptionV3, ResNet50, InceptionResNetV2 and MobileNet. Therefore, seven different optimizer functions (SGD, Adam, RmsProp, AdaDelta, AdaGrad, Adamax and Nadam) on the detection of melanoma disease were examined. It is observed that the best melanoma detection is performed with the DenseNet model and SGD optimizer. Momentum parameter is added to the structure of SGD optimizer in order to reduce oscillation and training time compared to other functions. Therefore, SGD optimizer gives high performance in all deep learning architectures except MobileNet, compared to other optimizer functions. Table 8 also shows that DenseNet offers the highest performance with 0.9490 accuracy, 0.9492 f-score and 0.1809 loss.

4 Discussion

In deep learning architectures, the learning process is basically expressed as an optimization problem. Optimizer methods are commonly used to find the optimum value in the solution of nonlinear problems. In deep learning applications, optimizer functions such as stochastic gradient descent (SGD), adagrad, adadelta, adam, and adamax are commonly used. There are differences between these functions in terms of performance and speed. When the momentum value is used in the SGD optimizer, it is observed that SGD performs better than adaptive methods in terms of performance. Performance comparisons of the optimizer functions used with five different DenseNet, InceptionV3, ResNet50, InceptionResNetV2 and MobileNet models were performed. These models were made to classify the disease in melanoma images with the image processing method. In light of the results obtained, DenseNet model and SGD optimizer are more successful than other methods with an accuracy of 0.9490. The results of the study were compared with the studies in the literature in the last two years as shown in Table 9.

Table 9 Literature studies on classification of melanoma and comparison of proposed method

When the studies conducted with different datasets for the same purpose in the literature are examined, it is seen that the results of our study are more accurate than the other studies. When the two studies with higher accuracy are examined, it is thought that the number of images in the dataset is very small compared to our study and the datasets in the related studies are not as diverse as desired. In this context, the results obtained reveal that different optimizer functions positively affect the classification performance of transfer learning architectures, which is our main motivation in this study.

Optimizer methods are used to prevent deep learning architectures from making too much oscillation during training and to minimize errors. Optimizer processes are structures that can be operated step by step in deep learning architectures. Back propagation algorithm is used to update parameters in deep learning architectures. In order to find the difference in the derivative operation in the back-propagation algorithm and calculate the difference value step by step, it is multiplied by the learning rate and the new weights of the architecture are calculated. In addition to the learning rate, the momentum value is added to the optimizer functions to reduce the oscillation and perform a more consistent optimization process faster [46]. Thanks to the added momentum parameter, instead of taking the newly produced value as it is, the new value is calculated by adding the beta coefficient to the previous value. Thus, noise and oscillations in the graphics are reduced with the SGD optimizer and a faster method is created [47,48,49,50].

In this study, SGD (Stochastic Gradient Descent) stands out against other optimizers such as adagrad, adadelta, adam and adamax due to its simplicity, speed, generalisability and compatibility with large data sets. Its ability to perform gradient update on a single randomly selected sample at each iteration, its tendency to reduce overfitting, and its ideal for situations where the entire data set cannot be processed simultaneously due to memory limitations make SGD attractive. Its easy integration with momentum techniques provides a robust alternative to situations where optimizers with adaptive learning rate may be more prone to overfitting. Furthermore, its direct control over hyperparameters such as the learning rate allows researchers to optimizer model performance. These features make SGD a preferred optimizer method in deep learning projects.

5 Conclusions

Early diagnosis and classification of melanoma, one of the most common types of cancer, is vital for patients. The physical methods used by dermatologists for diagnosis and classification are not sufficient. In the literature, machine learning and deep learning methods are frequently mentioned to play an effective role in the diagnosis and classification of the disease. In deep learning architectures, the learning process is basically expressed as an optimization problem. The crucial goal of machine learning in a particular set of scenarios is to create a model that performs effectively and offers thorough predictions; however, optimization strategies are required to do that. Melanoma detection was performed from skin lesions images through simulation created with seven different optimizer functions. And these functions were used with different deep learning architectures DenseNet, InceptionV3, ResNet50, InceptionResNetV2 and MobileNet. We analyzed optimizers that are most used for optimizer: SGD, Adam, RmsProp, AdaDelta, AdaGrad, Adamax and Nadam. The results of the study show that SGD performs better and consistently in terms of convergence rate, training speed and performance compared to other optimization strategies. The SGD gives higher accuracy with 94.90% when used with the DenseNet algorithm. The limitation of the study may be that the experimental studies were applied only to the melanoma dataset and that seven of the deep learning models were used. In future works, new optimizer functions that will increase the accuracy and time performance of deep learning architectures will be proposed and tested on different data sets. In addition, the effect of optimizer functions on Vision Transformer architectures will be examined.