Introduction

Forest biomass is a basic measure for evaluating the forest ecosystem, and it is also an essential variable for quantifying the structure and function of the ecosystem (Paulo et al., 2012; Rodrguez-Veig et al. 2019). As an important part of the carbon cycle, effective forest biomass monitoring can help us understand the interactions between the biosphere and the atmosphere (Pang et al., 2017; Rödig et al., 2017; Zhang et al., 2019). Deciduous broad-leaved forest is one of the most widely distributed forest vegetation types in the world, and it plays an important role in regulating climate, as well as maintaining water and soil (Souza & Longhi, 2019). Recently, with increasing and changing climate, deciduous broad-leaved forests are facing unprecedented threats (Laurin et al., 2020; Pope et al., 2020). The effects of climate change on rangelands and broad-leaved forests were studied using free satellite data from the GEE platform in a recent research project (Orusa & Mondino, 2021). The use of remote sensing to estimate deciduous broad-leaved forest biomass plays an important role in the study of forest ecosystems and their contribution to the global carbon cycle.

Traditional biomass calculation methods have the defects of large workload and high costs, such as the clear-cutting method (Liu et al., 2020) and the standard wood method (Jiang et al., 2017). In addition, the regression method is also commonly used (Li et al., 2012; Zaki et al., 2018; Zhang et al., 2020). Therefore, it is challenging to meet the requirements of these methods for estimating forest biomass at large-scale (Han et al., 2019; Koju et al., 2019; Rodig et al., 2017; Wan et al., 2018). Remote sensing technology has the advantages of the wide detection range and short update time, so combining remote sensing data with a small sample set of ground survey data has become a useful approach to estimate forest biomass at large-scale (Gwenzi et al., 2017; Kankare et al., 2013). In temperate and subtropical regions, deciduous forest is the most typical forest type, and the study of deciduous forest biomass change has important implications for climate change (Ghosh & Behera, 2018; Landuyt et al., 2020; Raha et al., 2020). In terms of research data, the biomass estimation of deciduous forest was carried out mainly by optical remote sensing data and lidar data (Joshi & Dhyani, 2019; Kristen et al., 2018; Wang et al., 2020; Nandy et al., 2017; Senger et al., 2020). Environmental variables (e.g., rainfall, humidity and soil) can affect the horizontal distribution of species biomass (Fu et al., 2019). Additionally, some forest parameters, such as stand age, leaf area index and canopy closure, can also improve the accuracy of biomass estimation (Li et al., 2020a, 2016; Mutanga et al., 2012; Nguyen et al., 2018; Yang et al., 2018). Compared with linear regression model, machine learning can improve model accuracy when the biomass is more than 120 Mg·ha−1 (Gao et al., 2018). Most studies on aboveground biomass primarily focus on the coniferous forests, coniferous and broad-leaved mixed forests, and evergreen broad-leaved forests (Dai et al., 2016; Dimitrov & Roumenina, 2013; Hu et al., 2016; Luo et al., 2021; Nie et al., 2017; Shen et al., 2018; Stovall et al., 2017). However, there is limited research on combining optical remote sensing information with machine learning to estimate the biomass of natural deciduous broad-leaved forests.

This study focuses on the development of quantitative models for biomass in the natural deciduous broad-leaved forest of Mazongling Nature Reserve in China. Vegetation indices and texture information were extracted using Worldview-2 remote sensing data. Additionally, terrain factors extracted from DEM (Digital Elevation Model) and ground measured data were obtained. An optimal biomass remote sensing quantitative inversion model was constructed using a machine learning algorithm. This study estimated the biomass of forest and analyzed its distribution. Its results provide a scientific reference for the protection and utilization of forest resources in Mazongling Nature Reserve.

Materials and Methods

Overview of the Study Area

Mazongling Nature Reserve is located in the southwest of **zhai County, Anhui Province, China (115°31′-115°50′E, 31°10′-31°20′N; Fig. 1). It is one part of Anhui Tianma National Nature Reserve, with a total area of 4640.85 ha. The reserve belongs to the north subtropical humid monsoon climate zone, and it protects north subtropical evergreen-deciduous broad-leaved mixed forest as well as rare wild animals and plants. Tree species occurring on Mazongling Nature Reserve include Cunninghamia lanceolata (Lamb.) Hook., Pinus taiwanensis Hmyata, Quercus serrata var. brevipetiolata (A.DC.) Nakai, Castanea seguinii Dode, Cyclobalanopsis glauca (Thunb.) Oerst, and shrubs include Loropetalum chinense (R. Br.) Oliv., Rhododendron simsii Planch., Rhus chinensis Mill. The highest elevation in the reserve is 1671 m, the valley is vertical and horizontal, and the natural vegetation is lush. Its annual average temperature is 13.3 °C, and the average temperature in summer is 20 °C. The annual sunshine hours are 2225.5 h. Rainfall is abundant in the reserve, and the annual rainfall is 1480 mm.

Fig. 1
figure 1

Location of the study area

Research Data

Sample Plot Data

The sampling survey was conducted from July 23 to 31, 2019. To comprehensively investigate the forest resources in the study area, stratified and typical sampling methods were used to establish 35 deciduous broad-leaved forest plots of different ages and site conditions. The sample plots were 20 m × 20 m. All the living trees in the plots with a diameter at breast height greater than 5 cm were measured, and tree heights were measured using a laser range finder. Differential GPS (DGPS) was used to determine the locations of sample plots. The dominant species in the study area were found to be hardwood tree species, so the forest biomass of sample plots were calculated using the general calculation method of hardwood biomass proposed by Li and Lei (2010). Based on the 6th and 7th Chinese National Forest Inventory data, Li and Lei proposed a calculation model for hardwood tree species after comparing three estimation methods (i.e., the Intergovernmental Panel on Climate Change method, the Continuous Biomass Expansion Factor method, and the Empirical (Regression) Model Estimation method). The model has been widely used in China due to its high accuracy and good applicability. Its specific formula is

$$W = 0.044\left( {D^{2} H} \right)^{0.9169} + 0.023\left( {D^{2} H} \right)^{0.7115} + 0.0104\left( {D^{2} H} \right)^{0.9994} + 0.0188\left( {D^{2} H} \right)^{0.8024} ,$$
(1)

where W (Mg·ha−1) is the forest biomass, D (cm) is the breast diameter, and H (m) is the tree height. The estimated biomass via Eq. (1), as well as the locations of 35 sample plots (Fig. 1, Table 1), were used to establish a forest biomass model by machine learning.

Table 1 Biomass statistics of sample plots

Remote Sensing Data

Worldview-2 satellite images from June 23, 2019 were used as the remote sensing data. The spatial resolution of panchromatic and multispectral images was 0.46 m and 1.85 m, respectively. Their band information is shown in Table 2. A radiation correction was conducted using ENVI5.3 software to obtain radiance data. The MODTRAN4 + radiative transfer model was used for atmospheric correction of radiance data and to obtain reflectivity data. Gram-Schimdt transform was used to fuse panchromatic images and multispectral data to obtain true color high-resolution images. 1:10,000 topographic maps were used to conduct geometric corrections for the remote sensing data, and their RMSEs were kept within 1 pixel.

Table 2 Multispectral information for WorldView-2 remote sensing

Remote Sensing Classification of Forest Types

According to the size of sample plots, the characteristics of forest resources in the study area, and field investigation results, forest resources were categorized into four types: deciduous broad-leaved forest, coniferous forest, coniferous and broad-leaved mixed forest, and non-forest land. After the preprocessing of WorldView-2 data, RF, maximum likelihood method and Mahalanobis distance method were selected in ENVI5.3 to classify forest types. Verification data and the Kappa coefficient were used to test the classification accuracy. After classification, the majority/minority processing was conducted to classify broken patches from the original classification results into the category of background.

Feature Selection

The coordinate of the center point of each sample plot was chosen to be the center pixel. The average pixel value in a window size of 20 × 20 acted as the remote sensing feature. Vegetation index and gray-level co-occurrence matrix (GLCM) texture information were extracted. The window size of the GLCM texture information was defined as 9 × 9 after comparing different sizes, using the default 0° direction and a pixel statistical interval. Terrain factors, such as slope, aspect and elevation, were extracted from Digital Elevation Model data at a resolution of 12.5 m using the ArcGIS10.2 platform. 36 candidate factors were selected. They are NDVI, RVI, EVI, DVI, SAVI, MSAVI, B532_entropy, B3_entropy, B4_entropy, B5_entropy, B532_secondary moment, B3_secondary moment, B4_secondary moment, B5_secondary moment, B532_dissimilarity, B3_dissimilarity, B4_dissimilarity, B5_dissimilarity, B532_mean, B3_mean, B4_mean, B5_mean, B532_homogeneity, B3_homogeneity, B4_homogeneity, B5_homogeneity, B532_correlation, B5_correlation, B532_contrast, B3_contrast, B5_contrast, B532_variance, B3_variance, B4_variance, B5_variance, and Slope. The types and detailed descriptions of the modeling factors are shown in Table 3.

Table 3 Biomass modelling factors of natural deciduous broad-leaved forest in Mazongling Nature Reserve

Model Variable Selection

Boruta and Recursive Feature Elimination (RFE) algorithms in R language were used to select variable sets related to the dependent variable. Boruta algorithm is based on the same idea of a random forest classifier. It adds randomness to the system and collects results from an ensemble of randomized samples and to assess the importance of each feature. This iterative process can reduce the misleading impact of random fluctuations and correlations (Amiri et al., 2019). RFE algorithm trains a model on a training set using all predictors. It calculates each variable importance and ranks them in order to seek an optimal variable set model. RFE seeks to improve generalization performance by removing the least important features whose deletion will have the least effect on training errors (Hayet et al., 2020). As the variables used by the Boruta algorithm could be highly correlated, we removed the highly correlated variables using the Pearson correlation coefficient. We set the threshold of the correlation coefficient to 0.9 to ensure that the absolute value of the correlation coefficient of all the prediction variables was below 0.9. This procedure could reduce the excessive abandonment of prediction variables due to the collinearity between prediction variables. Finally, b3_mean, b3_secondary moment, b3_variance, b4_secondary moment, b5_mean, slope, and NDVI were selected as predictors.

Machine Learning Algorithm

We used the k-NN, ANN, and RF machine learning algorithms in the platform of RStudio to construct a forest biomass model.

k-Nearest Neighbour (k-NN) Method

k-NN algorithm is a typical non-parametric algorithm, which estimates biomass based on the observation data of neighboring sampling points (Hoef & Temesgen, 2013). The basic principle of k-NN is that it finds k points, which are the k-nearest neighbors closest to the spatial distance from the prediction variable space of the training set, and it takes the average value of the k-nearest neighbor response variables to predict the value of the object (Mcroberts et al., 2016). Euclidean distance, a linear distance between two observations,\(d_{{(x_{a} ,x_{b} )}}\) is a common distance measure for constructing a forest biomass model based on k-NN. The formula is defined in Eq. (2).

$$d_{{\left( {x_{a} ,x_{b} } \right)}} = \sqrt {\sum\limits_{i - 1}^{P} {\left( {x_{ai} ,x_{bi} } \right)}^{2} } ,$$
(2)

where \(x_{a}\) and \(x_{b}\) are two sample points, and \(P\) is the dimension of each sample.

k-NN method is flexible and transparent, and it has strong generalization ability. However, when there are many features, many feature combinations will be generated, thus reducing the prediction efficiency and model accuracy. Therefore, the super parameter ‘k’, which means the k points closest to the target in the spatial distance, needs to be set when modelling in R language. If k is too small, then the modelling with training data is too sensitive, and the stability of the model is poor. If k is too large, the range of average value becomes too large, and the prediction error is large (Kumar et al., 2021). In practice, k ranges from 3 to 10.

Artificial Neural-Network (ANN) Method

ANN is a multi-layer feed-forward neural network with information forward propagation and error backward propagation (Fig. 2). Firstly, information is processed layer by layer from input layer to hidden layer, and outputs are compared with expected outputs. Reverse propagation is performed when the error between model outputs and expected outputs is greater than a predetermined value. Then, the internal weights and thresholds of the network are adjusted according to the prediction error, and the network is transferred to forward propagation again. This process is repeated until the error reaches the predetermined value, so that the outputs and the predictions are close enough to each other (Dong et al., 2020; Mao et al., 2019).

Fig. 2
figure 2

Artificial neural network

Decay’ and ‘size’ parameters are required when using the ‘nnet’ package of R language to build an ANN model. The parameter of “decay” is used as a penalty for the sum of squares of the weights. The use of “decay” can both help the optimization process and avoid over-fitting (Raji et al., 2020). ‘Decay’ was set as 0.001, 0.01, and 0.1 to reduce the possibility of over-training. ‘Size’ is defined as

$${\text{size}} = \sqrt {P + O} + m,$$
(3)

where ‘size’ is the number of hidden units, P is the number of nodes in the input layer, O is the number of nodes in the output layer, and m is an integer constant between 0 and 10.

Random Forest (RF) Method

RF is a classifier that contains multiple decision trees, and it uses multiple decision-tree algorithms to carry out repeated predictions for the same inputs (Dong et al., 2020). Multiple random samples can be obtained to establish the corresponding decision trees through several rounds of bootstrap sampling. In this way, a random forest is formed.

The regression procedure of RF is achieved by using the ‘random forest’ data package in R software. Two key parameters are involved in this process: ntree and mtry. ‘Ntree’ is the number of decision trees, which is also the number of times that bootstrap is used to re-sample. ‘Mtry’ is the number of stochastic characteristics, which is also the number of input variables and usually one-third of the number of decision trees. However, ‘mtry’ needs to be tuned to achieve an optimal value (Tavares Júnior et al. 2020).

Model Accuracy Assessment

Model accuracy can be verified using leave-one-out cross-validation. That is to say, for N samples data, each available sample is taken as a test set, and the remaining N-1 samples are used as a training set. This procedure repeats N times, then N classifiers can be obtained, and the average on the results from N times is taken as the final performance index. This method uses almost all the samples to train the model, and the evaluation results are more reliable. There is no randomness and the entire process was repeatable (Wolfrum et al., 2020). The coefficient of determination (R2; Eq. (4)) and root mean square error (RMSE; Eq. (5)) were used to evaluate the models. Generally, greater R2 and lower RMSE indicate a better model fit.

$$R^{2} = \frac{{\mathop \sum \nolimits_{i = 1}^{N} (x_{i} - \overline{x})\left( {y_{i} - \overline{y}} \right)}}{{\sqrt {\mathop \sum \nolimits_{i = 1}^{N} (x_{i} - \overline{x})^{2} \mathop \sum \nolimits_{i = 1}^{N} (y_{i} - \overline{y})^{2} } }},$$
(4)
$$RMSE = \sqrt {\frac{{\mathop \sum \nolimits_{i = 1}^{N} (y_{i} - x_{i} )^{2} }}{N}} ,$$
(5)

where \(x_{i}\) is the measured value of the i-th sample plot, \(y_{i}\) is the model estimated value of the i-th sample plot, \(N\) represents the number of sample plot, \(\overline{x}\) represents the average value of the measured values, and \(\overline{y}\) is the average of the estimated values.

Results

Forest Type Classification in Mazongling Nature Reserve

The Kappa coefficients for the remote sensing classification of forest types using RF, maximum likelihood, and Mahalanobis distance methods were 0.97, 0.92, and 0.80, respectively. We selected the RF method with the greatest Kappa coefficient to classify the forest types (Fig. 3). Deciduous broad-leaved forest covered 2275.97 ha (49.04%), coniferous forest covered 1163.71 ha (25.08%), coniferous and broad-leaved mixed forest covered 735.38 ha (15.85%), and non-forest covered 465.78 ha (10.04%) of the total study area. Among these four different forest types, deciduous broad-leaved forest was primarily distributed in the Lingtou zone.

Fig. 3
figure 3

Classification of forest types using the random forest method

Construction of Remote Sensing Quantitative Model of Forest Biomass

The greatest R2 and the smallest RMSE from three models were determined by using leave-one-out cross-validation. The results are shown in Table 4 and Fig. 4.

  1. 1.

    For the RF model, the maximum of RMSE was 36.83 Mg·ha−1 when the mtry was set as 1, and the minimum values was 32.27 Mg·ha−1 when the mtry was set as 7. The model precision was the highest when mtry was 7, R2 and RMSE were 0.68 and 31.85 Mg·ha−1, respectively.

  2. 2.

    For the k-NN model, the maximum of RMSE was 46.11 Mg·ha−1 when k was 9, and the minimum values of RMSE was 40.74 when k was 5. RMSE gradually increased as k increased, and the model was the most accurate when k was 5, R2 and RMSE were 0.48, and 40.74 Mg·ha−1, respectively.

  3. 3.

    For the ANN model, three different values of decay (0.001, 0.01, and 0.1) and hidden layers with sizes of 2 to 12 hide units were compared. The model was found to be the most accurate when decay = 0.1 and size = 2, R2 and RMSE were 0.69, and 31.53 Mg·ha−1, respectively.

Table 4 Comparison of results between three machine learning models in this study and other researches
Fig. 4
figure 4

Root mean square error

Therefore, the most accurate ANN model was selected to construct the remote sensing quantitative estimation model of natural deciduous broad-leaved forest biomass in Mazongling Nature Reserve.

Spatial Distribution of Deciduous Broad-Leaved Forest Biomass in Mazongling Nature Reserve

The verification results of the optimal regression model using leave-one-out cross-validation are shown in Fig. 5. This ANN model had the most accurate prediction (R2 = 0.69, RMSE = 31.53 Mg·ha−1). Therefore, with this optimal ANN model, the above-ground-biomass (AGB) of natural deciduous broad-leaved forest was estimated using WorldView-2 images for Mazongling Nature Reserve (Fig. 6). The estimated biomass from this model was 90.34 ± 47.96 Mg·ha−1. The AGB of natural deciduous broad-leaved forest in Mazongling Nature Reserve was primarily distributed in Lingtou and Heshang** zones, followed by Dacao** and Dongshan zones. The lowest AGB (48 Mg·ha−1) was located in Qian** Village zone.

Fig. 5
figure 5

Scatter diagram of correlation between model predicted values and measured values of forest biomass in Mazongling

Fig. 6
figure 6

Spatial distribution of broad-leaved deciduous forest biomass in Mazongling Nature Reserve

Discussion

Due to the complex vegetation and numerous tree species in sample plots, we did not use standard wood method. The biomass in the sample plots was calculated using the general calculation method of hardwood biomass proposed by Li and Lei based on the 6th and 7th Chinese National Forest Inventory data (Fu et al., 2022; Huang et al., 2022; Ju et al., 2022). The calculation of biomass of different tree organs is mainly based on the two parameters of tree height and DBH, and the R2 of height curves of other hard broad trees reaches 0.95. The number of sample plots should be increased in the future research that includes all age-class. Model accuracy can be improved by using the biomass model of the same zone, same family or same genus.

The WorldView-2 remote sensing image in this study was acquired in June 2019. Vegetation in the study area is in the growing season and is relatively lush. Because of the problems of different objects having the same spectrum and the same objects having different spectrum in the image (Ashutosh & Roy, 2021), there are omissions and mistakes when carrying out classifications, although its Kappa coefficient is very high. For example, the division of between "coniferous forest" and "coniferous and broad-leaved mixed forest", and that of between "broad-leaved forest" and "coniferous and broad-leaved mixed forest". The WorldView-2 remote sensing images used in this study had a small amount of cloud cover, which slightly impacted the classification of forest types and the inversion of forest biomass. However, it only accounted for 4.8% of the total area in the study area, which met the cloud content requirement (< 10%) for analyzing remote sensing images. Thus, the WorldView-2 images were not de-clouded so as to avoid detailed damaging after the de-clouding process. Regional image replacement can reduce the influence of cloud cover and improve image utilization. However, this study was based on only one year of remote sensing data (i.e., 2019). Therefore, further research on the spatial changes of forest biomass is necessary to improve the accuracy of model estimation.

The predictors selected in this study were able to construct a remote sensing quantitative model of deciduous broad-leaved in Mazongling Nature Reserve. However, the collinearity among predictors was insensitive, and the linear correlation between forest biomass and factors was not high. Additionally, there were positive and negative correlations between biomass and predictors. Therefore, it was not suitable to use a linear model to capture the relationship between biomass and remote sensing factors as well as geographic factors. However, an ANN model with strong nonlinear fitting ability was more suitable to decipher the relationship.

The results showed that the accuracy of the ANN model was the highest with R2 = 0.69. It is lower than that of the multiple linear regression biomass model (Wei, 2019). It is necessary to compare the multiple linear regression models with the machine learning model, so as to fully study the differences between the models and provide a sufficient basis for selecting a more accurate inversion model. Therefore, this study analyzed many references of machine learning algorithms for estimating forest biomass, especially for broad-leaved forest. The results showed that the difference of R2 and RMSE were a bit large. On the one hand, the biomass caused by normal growth is different due to different site conditions (soil, climate, terrain, etc.) of forest type. On the other hand, the difference of modeling candidate factors also plays an important role in model construction. In addition, it was less accurate than Antonio Montagnoli's model using lidar in the Alps (Montagnoli et al., 2015). This could be due to the light saturation in the WorldView-2 remote sensing images. The vegetation density of the deciduous broad-leaved forest in Mazongling Nature Reserve was so high that the electromagnetic radiation information received by remote sensing could no longer reflect changes in biomass. It led to inaccurate estimations for areas with high biomass, causing light saturation of biomass. As a result, the vegetation index and texture factor data fluctuated slightly in some areas, affecting model accuracy and biomass inversion. Therefore, further research is to determine the saturation point of remote sensing and improve the accuracy of remote sensing estimation of forest biomass. This study mainly focuses on the biomass modeling of deciduous broad-leaved forest. Biomass remote sensing inversion model of Pinus forest, Taxodium forest, coniferous and broad-leaved mixed forest, and mixed forest should have been constructed separately, which can help discuss and compare the consistency and difference between the mixed inversion model and the single forest type biomass model (Raj & Jhariya, 2021; Wang et al.,

References

Download references

Acknowledgements

The authors are thankful to Taijun Guang and Hongchao Li for surveying and data processing in the study.

Funding

This research was funded by Anhui Provincial Natural Science Foundation (Grant NO.1808085QC74), Anhui Dabie Mountains Forest Ecosystem Research Station (Grant NO.2020132041), and Graduate Innovation Fund of Anhui Agricultural University (Grant NO.2020ysj-18).

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization, Q.H., X.T. and D.Y.; Formal analysis, X.T., Q.H. and D.Y.; Funding acquisition, X.T. and Q.H.; Methodology, X.T. and Q.H.; Investigation, X.T., Q.O., M.X., P.F. and D.Y.; Software, D.Y., X.T. and Q.O.; Supervision, Q.H. and X.T.; Writing—original draft preparation, X.T. and H.L.; Writing—review and editing, Q.H. and X.T. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Qingfeng Huang.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

This article is published under an open access license. Please check the 'Copyright Information' section either on this page or in the PDF for details of this license and what re-use is permitted. If your intended use exceeds what is permitted by the license or if you are unable to locate the licence and re-use information, please contact the Rights and Permissions team.

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tang, X., Yu, D., Lv, H. et al. Construction of Remote Sensing Quantitative Model for Biomass of Deciduous Broad-Leaved Forest in Mazongling Nature Reserve Based on Machine Learning. J Indian Soc Remote Sens (2024). https://doi.org/10.1007/s12524-024-01901-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s12524-024-01901-6

Keywords