1 Introduction

Agriculture plays a vital role in a country’s growth. Despite several initiatives launched by develo** nations to eradicate malnutrition and feed the growing population, the daily dietary needs of approximately 800 million people worldwide remain unmet due to insufficient access to food [1]. This concern is also reflected in the United Nations 2030 Agenda for Sustainable Development, which includes the zero hunger (SDG 2) and good health and well-being (SDG 3) goals aimed at addressing food security issues [2]. To overcome these issues, extensive research is required to develop intelligent farming paradigms incorporating remote sensing and IoT applications. Several statistical data models and machine learning-based methods have been deployed to predict crop yields across various countries [3]. This helps farmers take corrective action early to maintain their fields and plan their export–import policies.

In recent years, the application of artificial intelligence (AI) in agriculture has significantly benefited farmers. AI-based models have been developed to address various crop-related issues, reducing human intervention and enabling timely and accurate decision-making [2, 3]. Early crop yield prediction requires obtaining crops’ spatial-temporal attributes during ripening. The use of machine learning or deep learning models for prediction is becoming more prevalent because they can capture linear and nonlinear associations among the attributes, thereby enhancing the prediction accuracy of the network [4]. Although several statistical-based yield estimation models have been developed in the literature, they are often time-consuming and tedious, and their accuracy is mainly influenced by the weather and phenological attributes of a crop during its growth period [5]. In earlier studies, data mining and regression techniques were widely used to predict a crop based on production and climatic factors. Nevertheless, the main drawback of these methods is their inability to handle complex relationships among independent parameters [6, 7].

Machine learning methods often include irrelevant or duplicate features in the dataset to enhance a model’s accuracy. Therefore, feature selection is a crucial step in the preprocessing phase of a model [8]. Metaheuristic algorithms, such as particle swarm optimization, genetic algorithms, and ant colony optimization, are used to optimize the selection of optimal hyperparameters among a dataset’s attributes to find the best subset of features that can improve a model’s accuracy [6, 9]. In contrast, deep learning approaches have the advantage of feature learning over machine learning techniques because they can capture high-level features from the low-level features in a dataset [10]. While these techniques have gained popularity in various domains, their usage in agriculture research is still limited to a few cases.

Thus, the primary motivation of this work is to introduce an integrated model that overcomes the drawbacks of machine learning and linear models that can determine nonlinear relationships among the attributes of the wheat dataset. In this study, we have proposed a novel PSO-optimized Conv-1D and Bi-LSTM approach for wheat crop yield prediction in major wheat-producing states in India. The significant innovations and contributions of this work are summarized as follows:

  • Develop an optimized hybrid deep learning approach, namely, “PSO-CNN-Bi-LSTM,” for the optimal selection of hyperparameters for crop yield prediction based on spatial-temporal features, including the remote sensing-derived vegetation condition index, meteorological data, historical data, and soil factors.

  • Evaluate the performance of the proposed approach with publicly available large crop datasets in terms of the MAE, MSE, and RMSE to ensure its effectiveness and applicability compared with traditional approaches.

  • Perform a comparative analysis with baseline models and existing state-of-the-art hybrid models to demonstrate the credibility of the proposed approach.

The subsequent sections of this paper are arranged as follows. Section 2 discusses the existing related work. Section 3 explains the proposed methodology, study area, description of the dataset, and preprocessing steps. Section 4 illustrates the experimental outcomes, performance evaluation measures, and comparison with existing baseline methods. Section 5 describes the contributions and significance of the proposed work. Section 6 concludes the proposed study with the future direction and a discussion of the limitations of this work.

2 Related Work

This section provides an overview of the literature review related to crop yield prediction research. With advancements in artificial intelligence, deep learning (DL) and machine learning (ML) approaches have been increasingly utilized to overcome the limitations of traditional linear models in investigating agricultural issues. In the era of smart farming, several authors have employed deep learning-based LSTM networks to predict and forecast crop yields [10,11,12,13,14]. For instance, in a study [10], the authors proposed a recurrent neural network (RNN) to predict wheat yield in the Punjab province of India using historical and weather attributes of a particular region during the October to March season. The outcomes of the proposed technique were compared with random forest and regression methods, and the RMSE value was found to be 147.12. Similarly, in a study [11], the performance of the Kharif crop yield prediction was evaluated using a DNN-LSTM in the Rewari district of Haryana. The outcomes revealed that the proposed method achieved an RMSE value of 81.9 and was compared to traditional ARIMA and regression models. The results suggested that the model accuracy could be improved by expanding the dataset to train the network. However, the main drawback of the RNN-LSTM model is the vanishing gradient problem in predicting yield dependencies over time. Some authors have utilized regression and neural networks to analyze crop prediction. In a study [15], the authors performed a comparative analysis among random forest (RF), multiple linear regression (MLR), and an artificial neural network (ANN) to predict sugarcane yield using a publicly available CAN dataset. A global filtering method was used to overcome noise or outliers in the dataset. It was found that SFC was the most important essential attribute among the mentioned parameters in the data, and an RMSE value of 7.7 was obtained by using the random forest approach. Similarly, in a study [16], the authors evaluated the performance of random forest, polynomial regression, and SVR to predict potato and maize yields. They found that the random forest method performed best, with RMSE values of 510 and 129, respectively.

To address the limitations of individual methods, some researchers have combined multiple methods to develop hybrid architectures for improving the model accuracy. A PSO-based regression model architecture was employed in a study [6] to predict the yield of apricot and assess the essential attributes among the soil and irrigation samples of 110 GPS locations in central Iran. The authors used 61 variables to train the model with an 80% dataset and attained an RMSE value of 2.329. In another study [17], a hybrid approach of an ANN, a grey wolf optimizer (GWO), and an ICA was proposed to optimize the hyperparameters of data. The investigation was conducted on barley, wheat, potato, and sugar beet crops using agricultural statistical data with climatic factors. From the analysis, it was estimated that the RMSE value of the ANN-ICA approach was 3.20, which was slightly higher than the ANN-GWO value of 3.19. In a study [14], the authors utilized a GLCM with the LSTM method to capture the spatiotemporal features of a crop during the growing phase and obtained an RMSE value of 4.40. Similarly, to improve the performance of LSTM, an IOF-LSTM hybrid model was introduced in [18] to handle the underfitting and overfitting issues of LSTM and attained an RMSE value of 2.19. In a study [19], the authors proposed a hybrid method for predicting rice yield in the districts of West Bengal by combining the ARIMA model with an ANN. They overcame the limitations of the ARIMA model by using statistical data for experimentation over 20 years. The authors in [20] presented a generalized stacking model with LASSO regression for yield prediction based on 26 attributes related to the crop and climate of the study region in the US. The proposed ensemble model achieved 88.89% accuracy, surpassing traditional approaches. In a study [21], a hybrid MLP-ANN model was employed to predict maize yield and achieved an RMSE value of 71 kg/ha. The authors suggested that the model could be improved using a graph-based RNN structure to incorporate the spatial and temporal attributes of a crop. Numerous studies have demonstrated that optimization techniques enhance the accuracy of deep learning and machine learning models by selecting the appropriate hyperparameters [22, 23]. With the introduction of AI, various paradigms have been introduced to handle yield prediction in smart farming. However, researchers still need to improve the analysis of nonlinear crop-related datasets. The major shortcomings of the related work for crop yield prediction are as follows:

  • Most existing yield prediction models employ linear models or regression to achieve higher accuracy. Nevertheless, the major limitation is that these models cannot retrieve nonlinear characteristics and exhibit poor performance in long-term prediction.

  • Recently, RNNs and LSTM have gained popularity in processing long-term dependencies and forecasting production and yield. However, they have issues with vanishing gradients in long-term prediction. GRUs and Bi-LSTM models have been introduced to address this problem in the network.

  • Traditional crop yield prediction techniques employ SVM, ANN, MLP, or random forest approaches, which require longer training times in terms of processing large crop datasets because they include too many irrelevant parameters to achieve higher accuracy.

The challenges mentioned above demand a sophisticated crop yield prediction model. Therefore, we proposed a novel hybrid model to overcome the drawbacks of traditional approaches and leverage the advantages of deep learning methods for predicting wheat yield in significant wheat-producing regions. The proposed methodology and description of the dataset will be discussed in the next section.

3 Materials and Methods

3.1 Study Region

The present study focuses on India’s major wheat-producing states, which account for 13.5% of the world’s wheat production. In India, wheat crop cultivation is known as the “rabi season crop.” In this study, data from Haryana, Punjab, Rajasthan, Uttar Pradesh, Rajasthan, Bihar, and Gujarat were analyzed. Considering the latest statistics for 2021–22, Uttar Pradesh, Punjab, and Madhya Pradesh are India’s leading contributors to wheat production. Figure 1 presents the distribution of crop data in the dataset, revealing that Uttar Pradesh accounted for 21.8% and Madhya Pradesh for 16.9%, followed by Haryana and Punjab, according to statistical records.

Fig. 1
figure 1

Box plot representing the statewise data distribution in the dataset

3.2 Dataset and Preprocessing Steps

In this study, we gathered wheat crop-related temporal and spatial information, including meteorological and historical yield data from 2000 to 2018. All the relevant parameters were collected from the October–April crop season and combined in a CSV file containing 1,00,828 instances, with yield as the independent parameter and the 90 other parameters as dependent parameters, as illustrated in Table 1. The dataset includes various climatic factors, such as temperature, wind direction, maximum temperature, minimum temperature, surface pressure, wind speed, specific humidity, relative humidity, precipitation, soil moisture, dew point, and profile soil moisture, in addition to vegetation condition indices (VCIs), which were processed using the MOD13A2 NDVI product on the Google Earth engine for the study regions [24, 25]. The historical yield data was sourced from the Directorate of Economics and Statistics [16].

Table 1 Characteristics of the wheat crop yield dataset

Approximately 5% of the data in the selected wheat dataset needed to be imputed. Some studies have suggested that the average imputation method performs well if there are less than 10% missing data [28, 29]. Therefore, we used the average imputation method to impute the missing data in the wheat dataset. Then, we performed feature engineering to standardize the data values using a standard scaler method, which distributes the data values within a specific range such that the mean is 0 and the standard deviation is 1. To implement this, we used the Feature Engine package in Python, which utilizes the scikit-learn fit and transform functions. After normalization, we used the drop-correlated library in Feature Engine and applied the Pearson correlation coefficient method with a 0.6 threshold to remove features that are less correlated with the yield attribute. This threshold limit was set to include features with significant correlations and exclude the remaining features for further analysis. We selected 33 features that correlated significantly with the independent variable. After choosing relevant components, we partitioned the dataset into a training and testing set with an 80:20 split to build and validate our proposed model.

3.3 Proposed Method (PSO-CNN-Bi-LSTM) for Yield Prediction

This section discusses the operation of the proposed methodology for yield prediction, and Fig. 2 shows the workflow of the proposed work, which will be discussed in detail in the subsequent section.

Fig. 2
figure 2

Methodology of the proposed PSO-CNN-Bi-LSTM method

In neural networks, hyperparameters are often a challenging aspect of the network performance. The stochastic gradient descent method is commonly used but requires extensive resources and manpower and does not always find the global optimal point. Therefore, the proposed methodology uses a nature-inspired technique named particle swarm optimization (PSO) to optimize the hyperparameters of a hybrid deep learning network. PSO effectively avoids local optimization issues during the training phase of the network. The PSO-optimized parameters of the CNN-Bi-LSTM model include the learning rate, number of hidden nodes, dropout rate, number of filters in Conv1D layers, and the number of neurons in the hidden layer. In the literature, many authors have illustrated the advantage of using CNNs for extracting relevant features from data. CNN applications are widely known in agriculture for seed classification, crop yield prediction, plant recognition, and fruit counting [12, 30,31,32,33]. At the same time, Bi-LSTM layers can capture long-term dependencies in both the forward and backward directions. The application of Bi-LSTM is widely used in smart farming for crop detection [34], intelligent irrigation systems [35], soil moisture forecasting in the field [36], etc. The CNN-Bi-LSTM model is also widely used for crop seed analysis [30], yield prediction [31], and crop prediction improvement [37]. The proposed method utilizes the advantages of both the CNN and the Bi-LSTM approaches.

  • Step 1. Data preprocessing and feature engineering of the original data. Initially, the original data are preprocessed to impute the values in the dataset, and after that, outliers are removed. Data scaling is performed using the standard scaler method represented by Eq. (1),

    $$W=\frac{y-\gamma }{\sigma },$$
    (1)

    where “\(\gamma\)” is the mean and “σ” denotes the standard deviation of the sample “y”.

  • Step 2. Build the model and conduct the training process. After preprocessing, the CNN-Bi-LSTM network is represented by Eq. (2),

    $$\mathrm{Z}\left(\mathrm{t}\right)=F \, (\mathrm{Y}(\mathrm{t}),\mathrm{ Z}(\mathrm{t}-1),\mathrm{ Z}(\mathrm{t}-2) \dots \dots .\mathrm{ Z}(\mathrm{t}-\mathrm{n})),$$
    (2)

    where \(\mathrm{Z}(\mathrm{t})\) is the predicted yield at a time and “\(t\)” and “\(F(t)\)” represent the proposed model’s function.

    Then, the dataset is partitioned into 80% for training and 20% for testing and validating the results. Now, the model is trained with a 0.8 portion of the dataset. The hyperparameters of the network are optimized using the mechanism of PSO mentioned in Algorithm 2. The Conv1D and Bi-LSTM layers are fed with the optimal parameters to enhance the accuracy of the proposed model.

  • Step 3. Evaluate the model. The predicted values are compared with the actual values of the test set to determine the efficiency of the proposed model using the evaluation metric discussed in Eqs. (14) to (16). The description of these three steps is explained in the following section.

After preprocessing the original data, the convolution layer of the CNN network extracts meaningful information from the crop yield dataset. We have adopted the PSO method to continuously tune the hyperparameters of the CNN-Bi-LSTM network and encode them as “particles” using Eqs. (12) and (13). The fitness value of the parameters is automatically adjusted to obtain the optimal performance of the proposed approach. The structure of CNN includes an input layer, an output layer, a convolution layer, a pooling layer, and a fully connected layer, as represented in Fig. 3, and is a type of feedforward network [36]. The computations performed in the convolution layer are shown in Eq. (3),

$${Y}_{j}^{k}=f\left( k{\textstyle\sum }_{i\in {d}_{j}}\left({x}_{i}^{k-1}\right)\times {v}_{ij}^{k}+{a}_{j}^{k}\right),$$
(3)

where \(f\) is an activation function, \({v}_{ij}^{k}\) is the kernel of the \({i}^{th}\) activation map of the \(k-1\) layer and \({j}^{th}\) activation map of the \({k}^{th}\) layer, and \({d}_{j}\) is the collection of activation maps.

Fig. 3
figure 3

Convolution neural network (CNN)

In convolution layer processing, the activation map obtains the relevant features of the wheat dataset by extracting information from the lower layers to the higher layers of the input to create the linear model. The feature map of the convolutional layer becomes the input of the pooling layer, which captures the key points from the input to perform dimensionality reduction and discard the irrelevant features. The outcomes of these layers reduce the computational time overhead and improve the network’s speed. Then, the maximum pooling layer is utilized to enhance the proposed method’s robustness in terms of yield prediction. The outcomes of this layer are fed into Bi-LSTM layers that are processed through three gates to control the output (\({o}_{t}\)) and cell state (\({z}_{t}\)). The structure of the Bi-LSTM network includes two units of LSTM that receives input from both forward and backward directions, as represented in Fig. 4 [38, 39]. The roles of the Bi-LSTM method are that it can capture long-term dependencies bidirectionally and selectively retain and forget the information in the memory cells of the network [40]. Thereafter, a ReLU activation function is employed to distinguish the features of the respective classes and produce the desired prediction results. The processing of the hybrid proposed model includes data preprocessing and convolution, Bi-LSTM layer, fully connected layer, and particle swarm optimizer operations, which can be elaborated using Eqs. (4) to (11). The mathematical computation of the processing of LSTM is illustrated by Eqs. (4) to (8) and of Bi-LSTM by Eqs. (9) to (11). Let \({Y}_{t}=[{i}_{1},{i}_{2},{i}_{3}\dots {i}_{t}^{n}]\) represent the \({^\prime}\!n{^\prime}\) number of inputs, \({h}_{t}=[{h}_{1},{h}_{2},{h}_{3}\dots {h}_{t}^{k}]\) represent the hidden units in the network, and \({z}_{t}=[{z}_{1},{z}_{2},{z}_{3}\dots {z}_{t}^{k}]\) represent the cell state of LSTM at time \({^\prime}\!t{^\prime}\).

Fig. 4
figure 4

Bi-LSTM network

The LSTM architecture includes “\({f}_{t}\)” as a forget gate, “\({i}_{t}\)” as an input gate, and “\({o}_{t}\)” as an output gate [36]. Initially, the data passes through the forget gate “\({f}_{t}\)” and a determination is made on which data are discarded or processed from the cell state of the LSTM to retain in the memory using a sigmoid function, as computed in Eqs. (4) and (5) [41].

$${f}_{t}=\sigma \left({W}_{f}*\left[{o}_{t-1},{i}_{t}\right]+{b}_{f}\right),$$
(4)

Here, \(\sigma\) represents the sigmoid function and \({W}_{f}\) and \({b}_{f}\) signify the weight and biases of the forget gate, respectively.

$${i}_{t}=\sigma ({W}_{i}*\left[{o}_{t-1},{i}_{t}\right]+{b}_{i}.$$
(5)

Now, the tanh function creates a vector state (\({v}_{t}\)) that determines the new information that can be added, as shown in Eq. (6).

$${v}_{t}=\tanh\;({W}_{n}*\left[{o}_{t-1},{i}_{t}\right]+{b}_{n}.$$
(6)

The next step includes updating the state of a cell from \({z}_{t-1}\) to the new state \({z}_{t}\), as represented by Eq. (7).

$${z}_{t}={z}_{t-1}{{f}_{t}+v}_{t}{i}_{t}.$$
(7)

The final step considers the computation of the output gate of the LSTM using Eq. (8).

$${o}_{t}=\sigma \left({W}_{o}*\left[{o}_{t-1},{i}_{t}\right]\right)+{b}_{o}.$$

Now, the hidden state is computed using the output tanh of the cell state (\(z\)),

$${h}_{t}={o}_{t}*\mathrm{tanh }({z}_{t}),$$
(8)

The forward and backward states of the Bi-LSTM can be presented by Eqs. (9) to (11),

$$\overrightarrow{{h}_{t}}=\overrightarrow{LSTM}({z}_{t-1},{o}_{t-1}),$$
(9)
$$\overleftarrow{{h}_{t}}=\overleftarrow{LSTM}({z}_{t-1},{o}_{t-1}),$$
(10)
$${h}_{t}=LSTM(\overrightarrow{{h}_{t}},\overleftarrow{{h}_{t}}),$$
(11)

Here, \({h}_{t}\) represents the forward and backward LSTM outcomes in a combined manner to capture the temporal dependencies in the dataset. The proposed “PSO-CNN-Bi-LSTM” algorithm is represented in Algorithm 1.

Algorithm 1

Proposed PSO-CNN-Bi-LSTM network

figure a

PSO is a metaheuristic approach based on swarm intelligence that is utilized to ameliorate the ability of the CNN-Bi-LSTM network to repeatedly select the hyperparameters and overcome the need for manual adjustments [42]. In 1994, PSO was proposed by Kennedy et al., and it uses global and local data and fitness functions to search for the best solution for the particles [43]. Algorithm 2 illustrates the procedure of the PSO approach used in our work [44]. The basic concept of PSO includes the acceleration of each particle in achieving the particle best position \(({p}_{best})\) and global best position \({(g}_{best})\) with random weight adjustments at each time, where \({p}_{best}\) signifies the optimal solution attained by the particle and \({g}_{best}\) presents the best solution obtained by particles in the neighborhood of the particle. Each particle updates its position using Eqs. (12) and (13):

$${x}_{i}={x}_{i-1}+{v}_{i},$$
(12)
$${v}_{i}=\Theta {v}_{i-1}+{co}_{1} ,{ro}_{1}\left[{p}_{best} -{x}_{i-1}\right]+{co}_{2},{ro}_{2}\left[{g}_{best}-x\left(i-1\right)\right],$$
(13)

where \(\Theta {v}_{i-1}\) represents the inertia, \({co}_{1} ,{ro}_{1}\left[{p}_{best} -{x}_{i-1}\right]\) represents the cognitive (personal), and \({co}_{2},{ro}_{2}[{g}_{best}-x(i-1)]\) represents the social (global).

Algorithm 2

PSO technique

figure b

4 Results and Discussion

This section discusses the experimentation outcomes of the proposed framework, and a comparative analysis is presented in the subsequent section. The proposed methodology is implemented using the Python Jupyter Notebook on an i-7 processor with 16 GB RAM.

During the experiment, 80% of the data were utilized for training, and the remaining data were used to test the accuracy of the obtained results. We utilized four datasets: Dataset 1 includes wheat crop information, as discussed in Sect. 2, and the other three datasets are publicly available datasets collected from Kaggle, FAOSTAT, and GitHub to evaluate the proposed model performance. The Agro_data dataset (Dataset 2) was collected from Kaggle and consists of rice yield information from 2000 to 2015 [45]. A crop and livestock product dataset (Dataset 3) was collected from FAOSTAT [46]. The Soybean_data dataset (Dataset 4) consists of 25,345 instances with 395 attributes representing soybean yields in Iowa, Illinois, and Indiana [12]. The accuracy of the proposed model was evaluated with these datasets, and the subsequent section demonstrates that the proposed PSO-CNN-Bi-LSTM method obtained a higher prediction accuracy and reliability than those of the other existing approaches tested. The developed model incorporated temporal and phenological information to predict crop yield.

4.1 Evaluation Metrics

We used the following metrics to evaluate the performance of the proposed model: root mean square error (\(\mathrm{RMSE}\)), mean absolute error (\(\mathrm{MAE}\)), and mean squared error (\(\mathrm{MSE}\)). The root mean square error computes the quality of the prediction model by finding the difference between the true and predicted values based on the Euclidean distance. The mean absolute error distinguishes the true predicted labels from the actual ground truth data. The mean squared error measures the regression line fitness to the predicted data points. Their formulas are described by Eqs. (14) to (16):

$$\mathrm{MSE}=\frac{1}{S}\sum\nolimits_{j=1}^{S}{(q}_{j}-\widehat{q )}\,\!^{2},$$
(14)
$$\mathrm{RMSE}=\sqrt{{\frac{1}{S}\textstyle\sum_{j=1}^{S}{(q}_{j}-\widehat{q })}^{2}},$$
(15)
$$MAE=\textstyle\sum_{j=1}^{M}\left|\frac{{O}_{j}-{p}_{j}}{S}\right|,$$
(16)

where \(s\) = no. of dataset values, \(q\) = observed yield, and \({q}_{j}\) = mean yield value.

4.2 Experimental Analysis

Table 2 highlights the descriptive statistical results obtained for the yield attributes of Dataset 1, in which major wheat-producing state data were analyzed.

Table 2 Descriptive analysis of wheat crop yield (tonnes/hectare)

Table 2 represents the descriptive analysis of the wheat crop yield over the major states in India. The results illustrated that the highest mean values were observed in the Punjab and Haryana regions, with maximum wheat crop growth values of 5.8 and 5.56 tonnes/ha, respectively. Most of the districts in the Bihar region exhibited low wheat yields. The wheat crop data distribution appeared normal, as the kurtosis and skewness values were in the ranges of − 2 to + 2 and − 7 to + 7, respectively.

Figure 5 presents the analysis of wheat growth in the mentioned study sites from 2000 to 2018 and shows that the maximum growth values were observed in 2011 in the Panipat, Sirsa, and Fatehabad districts, with values of 5.56 tonnes/ha, 5.36 tonnes/ha, and 5.46 tonnes/ha, respectively. In contrast, the lowest growth value was observed in 2014, with 3.16 tons/ha in Gurgaon. The Sangrur district in Punjab exhibited maximum growth in 2018, with 5.8 tonnes/ha, whereas the lowest growth was observed in Gurdaspur in 2014, with 3.5 tonnes/ha. The annual variability of rainfall in the Rajasthan region influenced crop growth. In this region, the maximum yield was 4.48 tonnes/ha, and the lowest yield was 1.1 tonnes/ha in Bikaner. In the Uttar Pradesh region, the highest yield was 4.76 tonnes/ha and the lowest was 1.61 tonnes/ha in the Hapur and Moradabad districts, respectively. The Baruch district in Gujarat had the lowest wheat yield of 1.1 tonnes/ha in 2006, whereas the maximum yield of 4.56 tonnes/ha was observed in Junagadh. Among all the investigated districts, Sitamarhi had the lowest yield of 0.87 tonnes/ha.

Fig. 5
figure 5

Growth of the wheat crop in terms of yield (tonnes/ha)

Initially, feature engineering was performed to identify the correlation among features based on the Pearson correlation method with a threshold of 0.6 to the independent variable (yield) for all datasets, as presented in Fig. 6. Thereafter, noncorrelated features were removed and partitioned into the training and testing sets of the network. In the wheat dataset, the correlated features were found to be 33. For Dataset 2, 19 correlated features were fed into the training model, and for Dataset 3, this value was 7. For Dataset 4, based on the correlation analysis, the noncorrelated features were decreased to 196 features. The hyperparameters in the PSO-CNN-Bi-LSTM model are presented in Table 3.

Fig. 6
figure 6

Correlational analysis among dependent and independent attributes of wheat crop data

Table 3 Summary of the hyperparameters in the PSO-CNN-Bi-LSTM model

The framework of the proposed model includes one convolutional layer that has a 3*3 kernel size with 32 filters in a sequence and two bidirectional long short-term memory layers that have 32 neurons in one unit that control the overall capacity of a particular layer with a dropout rate of 0.2 and learning rate of 1e − 5 to train the network with the selected hyperparameters optimized using the PSO method. The model utilizes an optimizer that can update the weights and biases in the network, bridging the gaps between them and the loss function in the network.

4.3 Comparative Analysis

A comparative analysis was performed to confirm the credibility of the proposed PSO-CNN-Bi-LSTM method. This section highlights the obtained outcomes of the proposed model with existing baseline methods and existing recent studies in Sects. 4.3.1, 4.3.3, and 4.3.2. It also discusses the validation results of the actual and computed yields.

4.3.1 Comparison with Baseline Methods

This section highlights the comparison of the proposed method with existing machine learning, deep learning, and hybrid models, represented in Tables 4, 5, 6, and 7 for Dataset 1, Dataset 2, Dataset 3, and Dataset 4, respectively, in terms of the MSE, MAE, and RMSE.

Table 4 Comparison of the results of existing baseline methods with the proposed method for Dataset 1
Table 5 Comparison of the results of existing baseline methods with the proposed method for Dataset 2
Table 6 Comparison of the results of existing baseline methods with the proposed method for Dataset 3
Table 7 Comparison of the results of existing baseline methods with the proposed method for Dataset 4

For Dataset 1 (Table 4), we observe that the LSTM values for the evaluation metrics are 0.89, 0.78, and 0.94, and are relatively better in comparison to those of the neural network, with values of 10.1, 2.57, and 2.17, and CNN, with values of 6.8, 1.4, and 2.6, respectively. The CNN-Bi-LSTM model performs slightly better than the neural network, with values of 2.21, 1.17, and 1.48, respectively. The Bayesian optimized CNN-Bi-LSTM model is superior to existing approaches, with values of 0.97, 1.02, and 0.98, respectively. The CNN-LSTM-PSO model performs reasonably well, with values of 0.57, 0.57, and 0.75, respectively. The PSO-CNN-Bi-LSTM model demonstrates the best performance among the discussed experimental approaches, with values of 0.18, 0.39, and 0.42, respectively, the lowest values among those of the other baseline methods. Hyperparameter tuning using PSO improved the efficiency of the hybrid deep learning model. It is also observed that the mean square error is lowest with the proposed model and highest with the neural network.

For Dataset 2 (Table 5), we investigate that the hybrid deep learning model CNN-Bi-LSTM exhibits poor performance, with values of 1.96, 1.20, and 1.40. The performance improves significantly when optimizing the hyperparameters of the network using particle swarm optimization, and the attained values are 0.58, 0.55, and 0.76, respectively. Additionally, CNN-LSTM-BO performs slightly less than the CNN-LSTM PSO optimized model, with values of 0.72, 0.65, and 0.84, respectively. The proposed model results are closer to the CNN-LSTM sigmoid-based optimized model, with values of 0.59, 0.55, and 0.77, respectively. It is also observed that the CNN performance is lower with Dataset 2, whereas the neural network approach performs better than the CNN approach, with values of 0.62, 0.63, and 0.78, respectively.

Table 6 demonstrates the analysis for Dataset 3, in which the LSTM values for the performance metrics are 0.85, 0.48, and 0.92, respectively, whereas the CNN performance is lower than that of the LSTM approach, with values of 0.83, 0.39, and 0.91, respectively. The accuracies of the CNN-LSTM-PSO and CNN-LSTM-sigmoid models are similar in terms of the MSE and RMSE. The Bayesian–Gaussian optimized hybrid model performs better than CNN-Bi-LSTM, with values of 1.07, 0.80, and 1.03, respectively. The PSO-based optimized CNN-Bi-LSTM model has higher accuracy than the existing baseline models, with values of 0.82, 0.33, and 0.90, respectively.

For Dataset 4 (Table 7), we observe that LSTM performs poorly alone compared to the CNN method, with values of 115.6, 8.5, and 10.7, respectively. The accuracy of the CNN method is closer to that of the neural network, with values of 16.3, 3.90, and 4.03, respectively. The CNN-LSTM optimized network with PSO performs better than the CNN-LSTM optimized with the sigmoid function, with values of 40.6, 37.7, and 39.2, respectively. The performance of the CNN-Bi-LSTM model improves significantly due to the presence of bidirectional layers with convolutional layers to process the data, with results of 3.06, 3.52, and 1.74, respectively. The performance of the same model after PSO is improved by 0.02%, with values of 2.98, 3.45, and 1.72, respectively.

Figure 7 represents the overall performance of the baseline approaches in comparison to the proposed approach for Wheat_data (Dataset 1), Agro_data (dataset 2), Crop and livestock_FAO (Dataset 3), and Soybean_data (Dataset 4). The demonstrations show that the proposed model produces higher results than the benchmark models for the considered datasets. Hence, the hyperparameter tuning of the proposed model using particle swarm optimization increases the model accuracy, and its significance is observed in the analysis of all datasets.

Fig. 7
figure 7

Performance evaluation in terms of the a MSE, b MAE, and c RMSE

Figure 7a–c shows the performance of the baseline models in terms of the mean squared error, mean absolute error, and root mean squared error, and the lowest values are obtained using PSO-CNN-Bi-LSTM in comparison to all other techniques. This signifies the performance of the proposed model outcomes. Hence, hyperparameter tuning using a PSO-based hybrid deep learning model can lead to significant results compared to those of other existing methods.

4.3.2 Validation of the Model

The results were validated using the predicted values to the actual values of the original dataset with threefold cross-validation, represented graphically in Fig. 8. The computed results were evaluated based on the yield attribute using 500–600 epochs and a validation split of 0.05 in terms of the predict_test and Y_test data values for the abovementioned datasets. The actual value versus predicted yield value (in tonnes/ha) for Dataset 1 was in the range of 3.77–3.67 tonnes/ha and 4.68–4.63 tonnes/ha for the wheat crop data.

Fig. 8
figure 8

Validation results of the proposed model (PSO-CNN-Bi-LSTM) with the observed values for a Dataset 1, b Dataset 2, c Dataset 3, and d Dataset 4

For the wheat crop yield values in Dataset 1, differences of 0.1–1.6 were observed between the actual and predicted values, as shown in Fig. 8a. In contrast, for Dataset 2, a validation result for the rice yield that fell within a narrow range of 2734.2 to 2790.4 was observed, as shown in Fig. 8b. The accuracy of the validation results observed in Dataset 4, represented in Fig. 8d, was higher compared to that of the other publicly available datasets (Dataset 2 and Dataset 3), shown in Fig. 8b, c, respectively, due to the availability of a larger training dataset.

4.3.3 Comparison with Existing Hybrid Models

In this section, we have discussed the comparative analysis of the proposed method with existing approaches for crop yield prediction in Table 8. The proposed PSO-CNN-Bi-LSTM method achieved higher prediction accuracy and reliability than those of existing approaches.

Table 8 Comparative analysis with existing studies

Based on the experimental analysis and comparative analysis with existing studies, we have observed that the proposed PSO-CNN-Bi-LSTM performs best on all baseline models in the analysis of the four datasets. We have analyzed the performance of our approach with the studies [6, 10, 12, 14,15,16,17,18, 21]. Figure 9 presents the comparative analysis of the proposed model with existing baseline and state-of-the-art hybrid studies in terms of the RMSE. The graph shows that the proposed method outperforms the other methods. Among all these studies, we have investigated our results with the soybean crop dataset and found significant improvement by training the network with optimized hyperparameters fed to the convolutional and bidirectional hybrid model. The optimization of the network is performed using PSO to improve the efficiency of the model with Dataset 4, and a mean square error value of 2.98 is obtained. In contrast, the mentioned study achieved an RMSE value of 4.15–4.91 [12, 25].

Fig. 9
figure 9

Comparative performance of existing state-of-the-art models and the proposed PSO-CNN-Bi-LSTM models

5 Contribution and Significance

The experimental outcomes of this study illustrated the applicability of the PSO-CNN-Bi-LSTM model for estimating and predicting crop yield based on spatial, temporal, meteorological, and remote sensing-derived nonstationary and nonlinear characteristics of wheat crops. We improved the effectiveness of this hybrid deep neural network by optimizing the hybrid deep learning layers with particle swarm optimization to overcome the need to manually select the optimal hyperparameters in the network. The PSO-CNN-Bi-LSTM model improved the training time of the network by incorporating only relevant dataset characteristics and overcoming the limitations of traditional approaches [10, 16, 19, 47,48,49]. The convolutional neural network, neural network, and CNN-Bi-LSTM performed poorly compared to the proposed method, with root mean squared errors of 2.6, 3.1, and 1.48, respectively. In contrast, the optimized CNN-LSTM network with PSO result was close to the proposed approach, with a value of 0.57 tonnes/ha. We evaluated the yield estimates with publicly available large datasets to ensure their applicability to improve the training time and enhance the results, with an RMSE value of 1.72 [12]. The results were evaluated from district- to state-level yield estimates for wheat crops. We also investigated the vegetation cover using remote sensing-derived MOD13A2-based vegetation indices during the wheat crop-growing season in major wheat-producing states in India and found that the NDVI peaked during mid-March to mid-April, with a values in the range of 0.4–0.6, whereas the average VCI value ranged from 40.8 to 58.6.

The correlation between the actual and predicted yields demonstrated that the proposed approach could estimate the yield in a wheat dataset with 3.77 tonnes/ha to 3.67 tonnes/ha. In contrast, the Dataset 4 results signified the lowest difference between the observed and predicted values due to large data availability. The proposed model overcomes the limitation of the overfitting issue in a hybrid deep neural network by including a nature-inspired optimization approach that fits the function appropriately in terms of the training data, thereby reducing the training time required by the network and overcoming the trial-and-error method implemented by the network to find the optimal set of hyperparameters.

The primary technical challenge of the proposed work includes the data-intensive nature of deep learning methods, as they cannot be utilized well for problems with data scarcity. The accuracy of the proposed model can be further enhanced by incorporating daily sensor-based captured characteristics of crops to provide more training samples to a comprehensive prediction model. In our study, we have investigated wheat crops. Therefore, to generalize the applicability of the proposed model, this study can be extended to diverse season time series data to evaluate the forecasting and growth of crops. In the future, advanced analytical tools can be considered to address the black-box nature of the hybrid deep model and build an inherently interpretable framework.

6 Conclusion

In this study, we performed yield prediction using an optimized hybrid deep learning model named PSO-CNN-Bi-LSTM and compared the proposed model with existing baseline models with three publicly available datasets. In the proposed approach, the advantages of convolutional and Bi-LSTM layers are utilized to capture the relevant features and to obtain the long-term dependencies bidirectionally for yield prediction. Subsequently, the proposed model was validated by comparing the predicted yield with the observed yield. Finally, the performance of the proposed model was evaluated using the MSE, MAE, and RMSE, and their results were found to be lower than those of the baseline models for all four datasets. The evaluation metric results of the PSO-CNN-Bi-LSTM model for Dataset 1, Dataset 2, Dataset 3, and Dataset 4 demonstrate the good performance of the model, with values of 0.18, 0.39, and 0.42; 0.58, 0.55, and 0.76; 0.82, 0.33, and 0.90; and 2.98, 3.45, and 1.72, respectively.

The results have shown that the proposed model performed better than the existing models for the training and testing data. Despite our model’s promising results, the proposed work has some limitations. The first limitation of Dataset 1 includes the total number of entries. The total number of entries can be improved in the near future by introducing more training data into the dataset by considering a more significant number of state data instead of only data from major wheat producer states. Introducing more training data can further enhance the efficiency of the model. The second limitation is the training time required by the proposed model for Dataset 3. To ameliorate the training time of the proposed model for Dataset 3, appropriate features for experiment can be selected from the dataset. In the future, a more robust approach can be proposed to overcome the limitations that can deliver better performance in terms of the evaluation metrics.