Introduction

Plant leaves contain valuable information about plant health. Because the first symptoms of plant stresses show up in leaves, the visual assessment of plant leaves is a valuable aid in the early detection of plant diseases and the prevention of crop failure. Yet, leaf disease identification by agricultural experts based on their experience is a time-consuming, tedious and inefficient process. On the other hand, blind use of pesticides in some cases not only does not prevent diseases but can also affect the quality of the product and lead to environmental pollution.

The advancement of deep learning techniques, particularly convolutional neural networks (CNNs) has become increasingly popular in precision agriculture Schirrmann et al. (2021); de Camargo et al. (2021). In recent years, automatic classification of crop diseases has been carried out using deep learning models to compensate for the lack of human expertise. Selvaraj et al. (2019) trained three different CNNs (ResNet50, InceptionV2, and MobileNetV1) using transfer learning to classify 18 different diseases using 18000 images of different parts of the banana plant. It was found that ResNet50 and InceptionV2 based models performed better compared to MobileNetV1 in their dataset. Kumar et al. (2020) explored InceptionV3 model to diagnose coffee leaf diseases. They collected a dataset of 1747 images of coffee leaves and categorized them into 5 classes (healthy and four different diseases). They utilized the transfer learning technique to reduce the training time. To overcome the problem of limited data, they applied data augmentation techniques to increase the data set used to train the network. Liu et al. (2020) generated a dataset of 107366 grape leaf images using image enhancement and data augmentation techniques. They proposed a Dense Inception-based Convolutional Neural Network (DICNN) to classify images into seven different classes. Ramcharan et al. (2017) collected a dataset of cassava disease images including 11670 images of three diseases and two types of pest damages. They applied transfer learning to train InceptionV3 model and analyzed the performance of the model with three different classifiers: the original softmax, support vector machines (SVM), and k‑nearest neighbor (KNN). Liu and Wang (2020) created a dataset of tomato diseases and pests under real natural conditions with 15000 images for 12 different classes of diseases and pests. They optimized the feature layer of You Only Looks Once version 3 (YoloV3) model using an image pyramid to achieve multi-scale feature detection and improve detection accuracy. Fuentes et al. (2017) considered three detectors: Faster Region-based Convolutional Neural Network (Faster R‑CNN), Region-based Fully Convolutional Network (R-FCN), and Single Shot Multibox Detector (SSD) to explore the performance of Deep Learning in detecting diseases and pests on tomato. To improve the detection and localization of bounding boxes, they combined each of the detection models with deep feature extractors such as VGG network and Residual Network (ResNet). They evaluted their proposed method on a dataset of tomato diseases and pests, which includes 5000 images of nine different classes of diseases and pests. Liu et al. (2017) proposed an AlexNet-based deep model for apple leaf disease detection using a dataset of 13689 images of four different apple disease. In addition to working on specific species, some researchers also focused on classifying different diseases in different species. Ferentinos (2018) explored different CNN architectures such as AlexNet, VGG and GoogleNet to classify 58 different plant diseases from 25 different plant species. Too et al. (2019) conducted an experiment to evaluate VGG16, InceptionV4, ResNet50-152 layers, and DenseNet121 for classifying 38 different diseases, including diseased and healthy images of leaves from 14 plants from the plantVillage dataset.

With the advances that have been made through the use of Deep Learning, some new challenges have also emerged. Deep Learning techniques require large amounts of data to train the network, which is virtually impossible to obtain due to the limited amount of annotated data in plant disease classification problems. In most cases, researchers fine-tune off-the-shelf deep models to solve the problem of limited data. Contrary to popular belief, some research has shown that transfer learning does not always lead to better performance because features cannot be readily transferred to other tasks Raghu et al. (2019); He et al. (2019). In addition, some methods try to improve the performance of the deep model by using complex scenarios and huge deep learning models. However, if we have a limited data problem, using a huge deep model with many parameters can lead to an overfitting problem. On the other hand, most models learn not only relevant disease features, but unfortunately they also learn irrelevant image features such as background noise or uninfected plant parts Ferentinos (2018); Mohanty et al. (2016); Toda and Okura (2019); Lee et al. (2020b). This will lead to confusion between similar plants of different disease classes. Therefore, the problem of limited data leads the model to learn irrelevant features. To solve this problem, Fuentes et al. (2017, 2019) proposed a region-based deep neural network to focus on contaminated parts of leaves. This is a very time-consuming technique because it requires labour-intensive manual annotation of disease locations and also depends heavily on prior knowledge of plant diseases. Lee et al. (2020a) developed a new method based on GoogleNet and Recurrent Neural Network (RNN) to automatically locate infected regions and extract relevant features for disease image classification (20 diseases and one healthy class). However, they performed better with GoogleNet compared to a combination of GoogleNet with RNN in the PlantVillage dataset. Using oversized deep neural network models tends to produce a lot of redundant features that are either shifted versions of one another or are very similar and reduce system performance Ayinde et al. (2019).

Because the size/shape of a leaf disease may be significantly different at different growth stages, the attention mechanism can enhance disease feature extraction by highlighting disease information while suppressing non-disease features of leaves and understating background information, resulting in better detection accuracy. The Convolutional Block Attention Module (CBAM) is an effective attention module for feedforward convolutional neural networks Woo et al. (2018). Given an intermediate feature map, CBAM sequentially infers the attention map along two separate dimensions (channel and spatial) and then multiplies the attention map with the input feature map for adaptive feature refinement. CBAM is a lightweight module with negligible parameters that can be plugged into the output of any CNN. In terms of both parameters and computations, the overall overhead of CBAM is quite small, but its role in improving the performance of CNNs is remarkable.

In this research, some well-known CNN architectures such as EfficientNetB0, MobileNetV2, ResNet50, InceptionV3, and VGG19 were trained with images of leaves of healthy and diseased plants to develop an automatic system for plant disease diagnosis. To solve the problem of limited data and learning irrelevant features, we plugged CBAM into the output feature map of CNNs to highlight important local regions and extract discriminative features. To demonstrate the effectiveness of the proposed system in a real-world application, we trained and tested our system using the DiaMOS Plant dataset, which contains images acquired under real field growing conditions Fenu and Malloci (2021). The DiaMOS Plant dataset is an available dataset of leaf diseases in pear trees. Amongst others, it contains 3006 images of leaves showing different stress symptoms. The main contributions of the proposed work can be summarized as follows:

  • We show that using a huge deep learning model with many parameters in small datasets is not an efficient solution.

  • We explore the advantage of the attention module for learning representations of plant diseases in classification performance.

  • We show the effectiveness of the attention mechanism to extract more discriminative features and improve the performance of pre-trained CNNs with little training data.

The rest of the paper is organized as follows. Sect. 2 introduces the CNN models, the CBAM that was used for the investigation of the attention mechanism Vaswani et al. (2017) that affects the performance of the system and proposed system. Sect. 3 presents the dataset used for training and testing, as well as the training procedure and performance evaluation. Sect. 4 presents the results of applying the proposed models for plant disease detection and diagnosis, while the paper concludes in Sect. 5 with conclusions and future research to improve the proposed method.

Methodology

Pre-trained CNNs

In this study, we used the following five pre-trained CNNs for plant disease classification. We have initialized the networks with the weights from ImageNet Russakovsky et al. (2015) dataset and then frozen all the convolutional and max-pooling layers so that their weights could not be modified. Table 1 presents a brief information about the architecture of each network.

Table 1 A brief comparison between the CNN architectures used in this study

EfficientNet

Using the concept of compound scaling, Tan and Le (2019) proposed EfficientNet to create various models in the family that has great capability of feature extraction. EfficientNet is designed based on multiobjective neural architecture search, and composite scaling to regularly measure the depth, width, and resolution of the network. The core component of the network is a mobile inverted bottleneck convolution module which is inspired by inverted residual and residual structure. In EfficientNet, instead of scaling only the depth of the network, the network’s width and resolution are scaled uniformly. Compared to other CNNs architectures, it has fewer parameters and higher accuracy. In this study, we used EfficientNetB0 which has 2 convolution layers and 16 mobile inverted bottleneck convolution modules. The total number of parameters for the whole network is about 5.3 million. EfficientNetB0 has a predefined \(224\times 224\times 3\) input size.

MobileNet

By replacing the standard convolution layers with depthwise separable convolution blocks, Howard et al. (\(224\times 224\times\)3. The input images of the CNNs underwent one-to-one augmentation without duplication. To avoid a long training time, we used pre-trained models trained using the ImageNet dataset with a cross-entropy function. To avoid the plateau phenomenon, the model’s validation loss was monitored during the reduction of the learning rate to stop it when it does not improve. A learning-rate of \(2\times 10^{-5}\) and a momentum of 0.9 were set. We used the Adam algorithm for optimization. The performance of the models was evaluated by macro-averaged and micro-averaged versions of precision, recall, F‑score, and overall classification accuracy extracted by the confusion matrices as follows:

$$\text{ Precision }=\frac{\text{TP}}{\text{TP}+\text{FP}}$$
(1)
$$\text{ Recall }=\frac{\text{TP}}{\text{TP}+\text{FN}}$$
(2)
$$\text{F-Score }=2\times\frac{\text{ Precision }\times\text{ Recall }}{\text{ Precision }+\text{ Recall }}$$
(3)

where TP, TN, FP, and FN are true positive, true negative, false positive, and false negative, respectively. Precision is the ratio of correctly predicted positive observations to the total predicted positive observations, whereas Recall (Sensitivity) is the ratio of correctly predicted positive observations to all observations in the actual class. Thus, precision focuses on the prediction, whereas recall focuses on the measurements. F‑Score is the harmonic average of Precision and Recall. The measure selected by the authors for ranking the systems was the overall classification accuracy score.

Results and Discussion

The results obtained by the five pre-trained CNNs for classification of the pear disease images with and without applying CBAM to the output of CNNs are presented in the Table 3. The values are the means over the ten folds. We compared the performance of applying CBAM on the output of ImageNet pre-trained models including ResNet50, VGG19, InceptionV3, MobileNetV2 and EfficientNetB0 since they were reported to be the baseline models for pear disease identification on the DiaMOS Plant dataset Fenu and Malloci (2021). As we have discussed in Sect. 3.2, there are overlaps between training and test sets provided by DiaMOS Plant dataset collectors. This is the main reason why they have achieved higher rates than us. Therefore, for an unbiased comparison, we trained and tested the models on disjoint sets over ten-fold cross validation. First, we compared the performances of the three most depth models i.e. ResNet50, VGG19 and InceptionV3. The ResNet50, VGG19 and InceptionV3 achieved accuracies of 69.90, 73.09 and 76.61%, respectively. The ResNet50, VGG19 and InceptionV3 are very huge and heavy models that have many parameters to learn. On the other hand, DiaMOS Plant dataset is an imbalanced small dataset with 3,505 images, which includes 43 and 54 images for healthy and curl classes, respectively. As we discussed earlier, using a huge deep learning model with many parameters on a small dataset can lead to an overfitting problem and consequently reduce system performance in the test set. To demonstrate the role of CBAM in representation power enhancement and performance improvement, CBAM was only plugged into the baseline networks. The networks integrated with CBAM outperform the baselines in terms of all the performance metrics except for VGG19. VGG19+CBAM has close performance with VGG19 for micro-averaged precision and micro-averaged F‑score metrics. This could be due to the large size of VGG19 for DiaMOS Plant dataset. However, the networks with CBAM achieved the best accuracies compared to the baselines, demonstrating that the CBAM generates a richer descriptor and spatial attention that complements the channel attention effectively. Applying CBAM to the output of the models improved the accuracies of ResNet50, VGG19, and InceptionV3 by 1.46, 0.33, and 1.26%, respectively. Since the overall overhead of CBAM is quite small in terms of both parameters and computation as shown in the Table 4, CBAM was also applied to the light-weight networks, MobileNetV2 and EfficientNetB0. MobileNetV2 and EfficientNetB0 achieved accuracies of 82.06 and 85.82%, respectively, indicating that using the light-weight backbone networks for the small DiaMOS Plant dataset leads to better performances. As shown in the Table 3, we observe significant improvements from MobileNetV2 and EfficientNetB0 for all performance metrics, demonstrating the effect of applying CBAM to the baseline methods. MobileNetV2+CBAM and EfficientNetB0+CBAM achieved accuracies of 83.99 and 86.89%, respectively, which are greater than MobileNetV2 and EfficientNetB0 by 1.93 and 1.07%, respectively. Compared to the baselines, EfficientNetB0 has the best results for all performance metrics. EfficientNetB0 integrated with CBAM outperforms EfficientNetB0 and improves macro-averaged and micro-averaged versions of precision, recall, F‑score, and overall classification accuracy.

Table 3 Micro and macro measurements of the pre-trained and the pre-trained+CBAM models on the test set for classification of the “healthy”, “slug”, “curl” and “spot” classes in ten fold cross validation (mean \(\pm\) standard deviation). The best results are indicated in bold for each network
Table 4 Comparison of different networks in terms of parameters and computation over ten fold cross validation (mean \(\pm\) standard deviation) when we integrated them to CBAM. CBAM has a light overhead and computational load
Table 5 The results of the paired t‑test between pre-trained and proposed models

To make the role of CBAM in discriminative feature enhancement and performance improvement more transparent, gradient-weighted class activation maps (Grad-CAM) Selvaraju et al. (2017) were applied to EfficientNetB0 and EfficientNetB0+CBAM networks using images from the DiaMOS Plant test set to highlight important regions. Grad-CAM is a gradients-based visualization method that tries to calculate the importance of the spatial locations in convolutional layers with respect to a unique class. We attempted to look at how CBAM helps the network to enhance the power of discrimination by highlighting the regions that the network has considered as important for predicting a class. The visualization results of CBAM-integrated EfficientNetB0 (EfficientNetB0+CBAM) were compared with baseline (EfficientNetB0). Fig. 3 illustrates the visualization results. The Softmax scores for target class and also different classes are shown in the figure.

Fig. 3
figure 3

Grad-CAM visualization results that highlight the importance regions for trained model prediction. We compared the visualization results of CBAM-integrated EfficientNetB0 (EfficientNetB0+CBAM) with baseline (EfficientNetB0). The grad-CAM visualization was calculated for the last convolutional outputs. The ground-truth label is shown on the top of each input image and P denotes the softmax score of each network for the different classes. The correctly predicted class and its score are shown in blue and the incorrectly predicted class and its score are shown in red. It is apparent that CBAM supports the network to correct its prediction and increase target class scores

From Fig. 3, it can be seen that EfficientNetB0+CBAM covers the plant symptoms regions better than EfficientNetB0. CBAM helps EfficientNetB0 to better exploit information in leaf diseases regions and aggregate features from them. Due to the high degree of similarity between leaf spot and slug damage in some images, the network predicts the target class incorrectly. As shown in Fig. 3, EfficientNetB0 predicted the slug input image as leaf spot class. The feature refinement process of CBAM helps the network to utilize given features well and correct its prediction. In addition, for the leaf spot input, CBAM helps the network to increase the target class score and decrease other class scores accordingly. This leads to a more discriminative deep learning model, which can help classifying real application data.

Confusion matrices related to the performance of EfficientNetB0 (as the best performed pre-trained CNN) and EfficientNetB0+CBAM are shown in Fig. 4. The confusion matrices obtained by summing confusion matrices of all the ten folds. The confusion matrix provides the performance of a predictive model to show which classes are being predicted correctly and which are incorrectly predicted. As we can see, EfficientNetB0 predicted some slug damage images as leaf spot class because of the high degree of similarity and imbalanced data. Integrating the network with CBAM helps to distinguish between classes and improve classification accuracy.

Fig. 4
figure 4

Confusion matrices related to a EfficientNetB0 and b EfficientNetB0+CBAM for classifying plant disease images of the DiaMOS Plant dataset (obtained by summing the confusion matrices of all the ten folds)

It is interesting that these improvements result from plugging CBAM into the pre-trained models with a negligible parameter overhead, indicating that enhancement is not due to a naive capacity increment but because of CBAM’s effective feature refinement. As a result, the networks integrated with CBAM outperform all the baselines, demonstrating the general applicability of CBAM across different architectures. CBAM can be seamlessly integrated and trained into any CNN architecture to improve networks. In addition, the result using the light-weight backbone networks (MobileNetV2 and EfficientNetB0) and the overhead of CBAM show that CBAM can be an effective module to improve the performance of networks in low-end devices.

To determine whether there were statistically significant differences between the mean of the pre-trained and the pre-trained+CBAM models in Table 3, the paired t‑test was conducted. In the paired t‑test, if \(p\text{-Value}<\alpha=0.05\), the null hypothesis is rejected, which means that differences in the model outcomes are so convincing at 95% of confidence that they can be considered significant. Table 5 reports the results of the paired t‑test. According to Table 5, \(p\text{-Value}<0.05\) were found in all comparisons except for VGG19. Using a huge deep learning model with many parameters such as VGG19 on a small dataset can lead to an overfitting problem and consequently reduce system performance in the test set. Therefore, CBAM plugging to VGG19 has no effect on improving the performance. From Table 5, the null hypothesis is rejected for ResNet50, InceptionV3, MobileNetV2, and EfficientNetB0, and it can be concluded that the resultant accuracy in Table 3 between the pre-trained and the pre-trained+CBAM models are not due to chance for the mentioned networks. According to Table 5, the hypothesis proposed in innovations is proven.

Conclusions and Future Directions

In this study, the Convolutional Bottleneck Attention Module (CBAM) was investigated to improve the representation power of CNN networks for automatic plant disease classification. We applied attention-based feature refinement with two different modules, channel and spatial, to the baselines’ output feature map to improve performance while kee** overhead low. CBAM helps baselines learn what and where to highlight to effectively refine features. We have shown that integrating CBAM with baselines has higher generalization ability than baselines, especially in discriminating similar symptom classes. We conducted our experiments on the DiaMOS Plant dataset, which was collected under uncontrolled conditions. The integrated baselines with CBAM performed better than all other baselines. We also showed that CBAM causes the network to properly focus on the target plant disease class. CBAM could help networks overcome the problem of lack of sufficient training data needed to learn deep models. In this study, we only plugged CBAM into the pre-trained models to show the important role of the attention module in improving plant disease classification in an imbalanced small dataset. Future research directions include: (1) develo** a novel deep model based on the attention module, e.g., CBAM and residual or dense blocks with light weights such as SoilNet Alirezazadeh et al. (2021) for plant disease classification; (2) using a margin-based softmax loss Alirezazadeh et al. (2022); Tavakoli et al. (2021) instead of the original softmax to improve the discriminative power of the feature space.