Introduction

Coffee is one of the most popular and extensively consumed beverages in the world, and it is distinguished by its flavor and aroma. In terms of finance, it is the second most important agricultural product after petroleum [23]. Global coffee consumption has risen steadily over the previous two decades, owing to the increasing number of coffee shops and coffee-based products and beverage formulations [14]. Coffee has little nutritional value and is solely drunk for its stimulant impact, i.e., the presence of caffeine, as well as its distinct aroma and flavor (Grosch et al., 2001).

The flavor is a very important property in determining the sensory quality of food products, hence assessing it is an important responsibility in the food industry. This mix is determined by aspects like species, growth and harvesting conditions, storage, intensity, and roaster type [15]. Coffee is a very important beverage because it is one of the most extensively consumed beverages in the world. As a result, it is a profitable commodity that has resulted in a billion-dollar corporation. The availability of analysis and control tools for the quality of coffee, starting with the beans, is critical for those involved in trading this product.

Many methods have been developed for the non-destructive evaluation of agricultural products. These methods have been developed based on the principles and qualitative characteristics, but only some of them have been able to be technically and industrially justified [5].

Optical methods, which use high-speed light detection and computer data processing to evaluate the quality and classification of items with remarkable accuracy, are one of the most essential non-destructive approaches. Near-visible infrared spectroscopy is one of the most precise optical technologies, and it is based on the placement of light sources, samples, and detectors in three different confrontational, transitory, and reflecting modes.

Chemical bonds have their frequency in infrared spectroscopy, and each of these frequencies may be identified according to the kinetic energy levels of these bonds, allowing the identification of functional groups in molecules [22]. The near-infrared area employs light with wavelengths ranging from 750 to 2500 nm to provide spectral information known as composite bands and overtones. Composite bands are linear combinations of normal frequencies or their integer multiples, and overtones are frequency values that correspond to the integer multiple of normal vibrations. (Pasquini et al., 2003).

There has been a lot of research done so far on the use of the NIR spectroscopic approach in the qualitative evaluation of agricultural goods all around the world (Nicola et al., 2007). Some researchers integrate the complete NIR spectrum (800–2500 nm) with the VIS band (380–800 nm), while others combine the SWNIR with the VIS spectrum. SWNIR is favored by most researchers due to a significant reduction in overall expenditures and, on the other hand, an increase in infrared spectrum penetration.

The results of the high-quality calibration models proposed in Esteban-Diez et al. (2004) study, which are comparable in terms of accuracy to the evaluations provided by a trained sensory panel, are promising and demonstrate the feasibility of using a similar methodology in on-line or routine applications to predict the sensory quality of unknown espresso coffee samples via their respective NIR roasted coffee spectra. Alessandrini et al. [2] investigated the use of NIR spectroscopy to predict coffee roast degree. The proposed calibration models produced results that were comparable to conventional analyses in terms of accuracy, demonstrating the promising feasibility of NIR methodology for on-line or routine applications to predict and/or control coffee roasting degree via NIR spectra. Correia et al. [7] conducted a study on the quality control of Brazilian coffee using portable near-infrared spectroscopy. The results demonstrated that this method is effective at predicting and detecting adulteration with a low level of quantification (LOQs of 5–8 wt %) and may also be used to commercially control the quality of coffee samples. Because of its high accuracy, infrared spectroscopy has become increasingly practical.

In this research, the spectroscopy absorbance method was used to evaluate the quality of coffee species using statistical analysis and modeling methods. These methods include principal component analysis (PCA), partial least squares (PLS), linear discriminant analysis (LDA), and artificial neural networks (ANN). Given that brewed Arabica coffee can be distinguished from brewed Robusta coffee by its high acidity and fruity aroma [9] and by the amount of caffeine it contains (Dobrzanski, 2009), the results of gas chromatography–mass spectrometry for both variables were taken into consideration in this study.

Materials and methods

Samples

Two species of coffee beans—Arabica and Robusta—were prepared to make samples. Three different classes of the Arabica and two classes of the Robusta species were chosen as various samples. Colombian variety of Arabica coffee beans was served to the first class (AC1). Also, the Kenyan "AA” variety of Arabica coffee beans (AK2) was chosen for the second class, and the Brazilian variety Arabica "Bracosta" coffee beans were chosen for the third class (AB3). Indian “Cherry” variety of Robusta coffee beans (Rch1) made up the fourth class, and the Vietnamese "Bach" variety of Robusta was chosen for the fifth (RV2). A combination of the first five classes was used for the sixth and seventh classes, for example, the sixth class used classes 1 and 5 (MA1), and the seventh class used classes 2, 3, and 4 (MA2) (MR1). Each sample was 50 g in weight, therefor we had 7 classes and 3 samples from each one of classes. Total of 147 samples were prepared. Every class of coffee beans roast level was medium. After the samples were prepared, data were collected using a spectroradiometer, and then destructive data were collected using GC–MS.

Gas chromatography–mass spectrometry (GC–MS)

In order to isolate and identify the composition of coffee compounds, gas chromatography-mass spectrometry (GC–MS) was used (Agilent 7890 B series GC, Agilent 5977A Series MSD, USA). At the time of injection into GC–MS, one microliter of espresso coffee was diluted in 2 ml of dimethyl ketone. The prepared samples (Fig. 1) were first injected into a gas chromatographic apparatus and the most suitable column thermal programming was obtained for the complete separation of the espresso soluble compounds.

Fig. 1
figure 1

a Preparing samples b Labeled and prepared samples from each class

The constituents of each sample and the retention time of each compound were also obtained using GC–MS. Finally, by using these two indicators and examining the proposed mass spectra of GC–MS (Wiley 275-L) computer libraries and comparing these parameters with standard compounds, the type and amount of compounds in the samples were identified [1]. Figure 2 shows the GC–MS result obtained for one sample. It includes 43 peaks which are different compounds.

Fig. 2
figure 2

GC–MS results for one of samples

The presence of a particular chemical compound is indicated by each peak point, so it is necessary to check each peak point in order to extract the desired data. The most frequent peaks in all samples, including caffeine and palmitic acid, were chosen for this research's destructive measurement data (calibration) because points with larger peaks are thought to indicate the presence of more of these compounds in the samples. Figure 3 displays the specifics of a peak in data obtained using a GC–MS device for one sample.

Fig. 3
figure 3

Specifics of a peak in GC–MS data

To identify the type of compounds in the samples a 7890B Series GC gas chromatography device connected to Agilent 5977A Series MSD mass spectrometer with HP-5MS column 30 m long, 0.25 mm in diameter and with 0.25 Micrometers of static phase layer thickness were used. The thermal programming of the column was selected from 60 to 325 °C with an increase of 4 degrees Celsius per minute and Helium 99% (Messer® CANGas) was used as carrier gas. Table 1 shows the average percentage of caffeine and palmitic Acid in every species.

Table 1 Average data obtained for each variety by GC–MS (%)

It is worth mentioning that the GC–MS device used the EI ionization method as the standard. Also, the electron energy and emission current were 5–241.5 eV and 0–315 µA, respectively. Table 2 shows details of properties of chemicals used. Also, Fig. 4 illustrate calibration curves for caffeine and palmitic acid.

Table 2 All chemicals used properties
Fig. 4
figure 4

a Caffeine calibration curve and equations. b Palmitic Acid calibration curve and equations

VIS/NIR spectroscopy

Vis/NIR spectroscopy tests were performed using a PS-100 spectroradiometer (Apogee Instruments, INC., Logan, Utah, USA) with a silicon detector, 2048 pixels, 1 nm resolution, and halogen-tungsten light source in the wavelengths range of 350–1100 nm (Fig. 5).

Fig. 5
figure 5

A schematic of spectrometry

Pre-processing of spectral data

After extracting spectral data from each sample. Due to the presence of noise in the data obtained from coffee beans, before performing the analysis on the data, Pre-processing and normalization methods should be used to reduce the amount of noise. After trying different normalization methods such as the Gaussian filter, Median filter, and Savitzky-Golay method. Due to the results, the best method suitable for the dataset of this study Savitzky-Golay was selected. All spectral wave samples after using the Savitzky-Golay smoothing method are shown in Fig. 6.

Fig. 6
figure 6

All samples spectral waves after preprocessing by Savitzky-Golay smoothing method

In the parameters section, the polynomial order was selected as zero and in the smoothing point section, a number of the left side and right side points were selected as one and we had 3 smoothing points in which the symmetric kernel method was used.

Model performance evaluation

The root mean square error and the coefficient of determination (R2) were used to evaluate the regression prediction models. In this case, the closer the R2 values are to one, the more acceptable the model results are, and on the other hand, the closer the RMSE values are to zero, the more accurate the regression models are because errors in calculations are closer to zero. Unscrambler X version 10.4 (2016, CAMO software) was used for data preprocessing.

$$R = \frac{{\sum\nolimits_{i = 1}^{n} {(x_{i} - \overline{x} ) \cdot (y_{i} - \overline{y} )} }}{{\sqrt {\sum\nolimits_{i = 1}^{n} {(x_{i} - \overline{x} )^{2} } \cdot \sum\nolimits_{i = 1}^{n} {(y_{i} - \overline{y} )^{2} } } }}$$
(1)

where Xi is values of X variables in a samples, X̅ is mean of the values of the x-variable,

Yi is values of the y-variable in a sample and Y̅ is mean of the values of the y-variable.

When the correlation is closer to + 1, it indicates that the relationship is linearly increasing, and when it is closer to –1, it indicates that the relationship is linearly decreasing. The numbers between 1 and –1 indicate the degree of linearity between model variables and measured (observed) variables. There is no linear relationship between the variables if the correlation coefficient is equal to zero.

The coefficient of determination (R2) determines how much of an absolute linear relationship exists between two variables.

The root of the mean square error (RMSE) (also known as the root of the mean square deviation) is a tool for comparing differences between measured and anticipated values in a modeling environment. These significant discrepancies are known as residuals, and the RMSE method is used to gather and total residuals on a single predictive power scale.

The root mean square error (RMSE) formula for predicting variable Ymodel is as follows:

$$RMSE = \sqrt {\frac{{\sum\nolimits_{i = 1}^{n} {(Y_{obs,i} - Y_{mo\,del,i} )^{2} } }}{n}}$$
(2)

where Yobs is measured values and Ymodel is predicted values of sample i.

Results and discussion

VIS/NIR spectra interpretation

Figure 7 shows the mean Vis/NIR absorbance spectra in 7 different sample groups tested. The spectral curves of all five sample groups had approximately the same direction of motion and behavior in the range of 400 to 1000 nm. The peaks in the diagrams were in the same zone and only their magnitude was different. In the spectral data, a downward trend with small peaks was observed, so that the movement trend, as shown in Fig. 7, can be more accurately described as a descending diagram. This is a result of a large number of chemical compounds present in coffee.

Fig. 7
figure 7

An example spectral (absorbance) data obtained by spectroradiometer. (Note: absorbance does not have units since it is the negative logarithm of a percent)

Principal component analysis (PCA) results

The principal component analysis is a type of unsupervised method in multivariate data analysis whose main purpose is to reduce the size of the data. This method is also known as Carhunen-tube expansion with singular vector decomposition (SVD). Mathematically, PCA is an orthogonal linear converter that transfers data from a vector space to another space with smaller dimensions on a new coordinate system. Thus, the first principal component (PC1) is an axis in a new space that, when any data is displayed on it, forms the results of a new variable whose variance varies among all axes that can be selected as the first major component has the highest value. The second major component (PC2) is another axis in the new space that is perpendicular to the first axis (Fig. 8).

Fig. 8
figure 8

PC scores for raw data of all samples

The purpose of using multivariate regression modeling methods is to establish a relationship between the measured qualitative properties and the spectral data. PCA is a factor analysis method that determines the factors directly and without estimating the similarities from a correlation matrix. After data preprocessing, using PCA is recommended to detect important factors so that we can count them as groups in spectral data [12]. It is also used to reduce the dimension of data that has multiple variables and to transfer the main variables to the subspace defined by the principal components (PCs). The main components are the linear compositions of the main varieties in the direction of maximum variance. Interpretation of each principal component independently provides a general overview of dataset structure by exhibiting the relationships between samples as well as identifying and distinguishing the outliers. Generally, the first few principal components are used to analyze common features between samples and their grou**.

Figure 9 shows the PCA for all data in which outliers are specified. Principal component analysis was performed to obtain the main components (PCs) in the coffee beans (Fig. 8).

Fig. 9
figure 9

PC’s influence and detected noises

The PCA is used to reduce data dimensions and obtain new PCs in smaller dimensions. PCA was performed to detect differences in caffeine and palmitic acid levels in different species and, according to the obtained form, PC1 and PC2 had the highest effects, with 28% and 13%, respectively (Fig. 8). According to the influence chart, there is a significant correlation between the data, but another clear thing is the presence of outliers (Fig. 9).

Noises were unavoidable in the data collection using the spectroradiometer used in this study. The reason for this can be stated in addition to the errors that are irreparably present in the device itself. There were also errors in probability when working with the device. According to this, a certain period of time had to elapse after the light source of the device was turned on until the light source was completely fixed on the samples. Therefore, the data recorded by the device during this period would be noisy. Also, at the end of the data collection of a sample, when the light source is removed from the sample, in a limited and shorter period of time than the time period mentioned at the beginning of data collection, the data recorded by the device will be noisy again. The movement of the light source and its heat are the main reasons for these noises. The outliers in the whole data set can be caused by the same noise generated during the data collection process. To deal with these outliers, the wavelengths that lead to the occurrence of these outliers can be removed from the overall dataset (Fig. 10).

Fig. 10
figure 10

Influence chart after removing outliers

It should be noted that outliers may not be corrected even with preprocessing operations and may disrupt the overall form of the data and the results. After removing outliers, PC1 and PC2 increased to 41% and 28%, respectively (Fig. 11).

Fig. 11
figure 11

PC scores after removing outliers

Partial least squares regression (PLSR) results

The PLS strategy calculates the values first, as opposed to the covariance-based technique, which meets the model parameters first and then the item values by returning them to the set of all markers (such as the values met for each variable hidden in each data set). To do this, the precise linear combination of their experimental markers is used to complete the latent variables, and PLS then uses these completed representations as ideal substitutes for hidden variables.

Partial least square regression (PLS-R) is generally used and is an efficient method used to carry out multivariate regression, [3] especially for solving linear regression quandaries.The non-parametric PLSR approach does not require data normalization [21]. It also has excellent statistical power despite the lost data. It concurrently describes numerous independent and dependent variables.

The PLSR was used to create calibration models based on GC–MS reference data and near-infrared spectral data for the entire dataset.

The results of calibration and prediction models for caffeine and palmitic acid using all specified wavelengths for all classes was calculated. Calculated and predicted coefficients of determination (R2) for the palmitic acid model and their very small difference indicate the accuracy of the obtained model (Fig. 12).

Fig. 12
figure 12

a Palmitic acid PLSR calibration model chart b Palmitic acid validation model chart

This difference between the coefficient of determination of caffeine for the calculated and predicted values is very, very small (Fig. 13), but in general, the coefficient of determination of the palmitic acid model is higher than caffeine. The distance between each sample and the main components in the score charts represents the impact of the sample's characteristics on the main component. The samples are increasingly similar the closer the sample distance is. Additionally, the information obtained is more reliable the higher the percentage of explained variance, which is 94% for palmitic acid (Fig. 14) and 90% for caffeine (Fig. 15).

Fig. 13
figure 13

Palmitic acid PLS factor scores

Fig. 14
figure 14

a PLSR calibration model chart for caffeine b PLSR validation model chart for caffeine

Fig. 15
figure 15

PLS factor scores for caffeine model

The calibration table for PLSR and validation (prediction) table for PLSR are presented in Tables 3 and 4, respectively. PLSR model accuracy for PLSR model when the reference (calibration) data was palmitic acid was 97.3% and the model accuracy for the caffeine reference data was 94.71%, and also, the accuracy of the validation data was calculated at 96.58% and 94.09%, respectively. As a result, the prediction power of the model is higher for palmitic acid than for caffeine.

Table 3 Calibration table for PLSR
Table 4 Validation (prediction) table for PLSR

SEC = standard error of calibration.

Linear discriminant analysis (LDA)

This method was first designed by Fisher in 1936 for a two-class problem [11] and later generalized to multi-class problems by Rao in 1948 [20]. The LDA is similar to the PCA in that it looks for a linear combination of variables that best describes class differences. A major application of both of these methods is to reduce the number of data dimensions. The LDA method, unlike the PCA method, can extract multiple wavelength information to improve the resolution between classes. As a result of the output response of the wavelength, this method was used to detect seven coffee species. The following is the outcome of the species identification:

LDA model prediction for all data sets displays that the accuracy when the category was palmitic acid was 97.92%, and for the caffeine category it was 95.83%. The results for the LDA confusion matrix chart for palmitic acid and the LDA model chart for caffeine are shown in Fig. 16 and Fig. 17, respectively.

Fig. 16
figure 16

LDA confusion matrix chart for palmitic acid

Fig. 17
figure 17

LDA model chart for caffeine

Artificial neural networks (ANN) results

ANNs are computational models made up of interconnected nodes that represent "features" or properties of the examined dataset. These nodes create a network that, when properly "trained," may encapsulate the relationship between inputs and outputs of interest. MLPs (multi-layer perceptrons) are one of the most often used neural networks for classification. The input, hidden, and output layers make up a multi-layer perceptron (MLP) [16] (Fig. 18).

Fig. 18
figure 18

ANN model specifications

A spectral dataset was employed as input data into the network in this investigation. Input data, a hidden layer with ten neurons, an output layer, and output data were all included in the network. To develop our Artificial Neural Network, 70% of all datasets were chosen as training data, 15% as test data, and 15% as validation data (data set selection was randomly done by software). The ANN model was completed using data gathered using the destructive approach of Gas Chromatography-Mass Spectrometry (GC MS). The adaptive learning function learnGDm was chosen once the training function was taught. MSE was the performance function. There were two layers, each with ten neurons. The neural network toolbox of Matlab software was utilized to calculate the ANN in this work. ANNs are a sort of parallel processing computing system based on distributional or connectionist methods, as previously stated. We employed a feed-forward back propagation neural network in this research. The most popular neural networks are MLF (Multilayer feed-forward) neural networks, which are taught using a back-propagation learning technique. They are used to solve a wide range of chemistry-related issues [26]. A feed-forward neural network is a type of artificial neural network in which nodes do not form a cycle. An input layer, hidden layers, and an output layer are all present in this type of neural network. It is the first and simplest type of artificial neural network [24].

As shown in Fig. 19, the training, test, and validation data R (regressions coefficient) for palmitic acid are 0.9971, 0.9976, and 0.9978, respectively. The regression for all data is 0.9973. ANN output data was sent to Excel software and R2 was calculated for our dataset, which was equal to 0.9730 (Fig. 20).

Fig. 19
figure 19

Palmitic acid ANN model results including training, validation and test

Fig. 20
figure 20

Palmitic acid ANN model’s equation and R2

All the methods mentioned before were applied to our data, but this time our target data was caffeine. As shown in Fig. 21, the training, test, and validation regressions were 0.9951, 0.9958, and 0.9936. The total regression was 0.9948. R2 was calculated by the output data of ANN and it was equal to 0.9900 (Fig. 22).

Fig. 21
figure 21

Caffeine ANN model results including training, validation and test

Fig. 22
figure 22

Caffeine ANN model’s equation and R2

Among the methods used to classify and identify different coffee species, the ANN method produced the most accurate and powerful result (98.96%). Also, considering that the R2 of caffeine-related data in the ANN model had the highest accuracy among all methods, it can be concluded that the caffeine content in coffee is the best feature for classifying and distinguishing species. Due to the small difference between the numbers of acidic (palmitic acid) and caffeine models, acidity can also be used as a suitable indicator for classification.

Tables 5 and 6 show the calibration table for ANN method calculations and all models used R2, respectively. The most accurate model for both reference data is ANN. For palmitic acid, as reference data LDA and PLSR models, accuracy are very close while using caffeine as reference data LDA model is more accurate than the PLSR model.

Table 5 Calibration table for ANN method calculations
Table 6 Describes the R-square of all models used

Conclusions

Caporaso et al. [4] found that sucrose content could be estimated (R2 = 0.65; prediction error of 0.7% w/w coffee, with an observed range of 6.5%), but the PLS model performed better for caffeine and trigonelline prediction (R2 = 0.85 and R2 = 0.82, respectively; prediction errors of 0.2 and 0.1%, on a range of 2.3 and 1.1% w/w coffee, respectively). In other research conducted by Yuwita et al. (2019) prediction results of calibration best fat obtained r = 0.999, R2 = 0.999, SEC = 0.008%, and RMSEC = 0.008%, validating the value of r = 0.998, R2 = 0.999, SEP = 0.009%, and RMSEP = 0.008%, of the model fat coffee beans, can be identified by non-destructive evaluation using spectral analysis. In other research conducted by Corriea et al. (2018), the results of microNIR spectroscopy indicated that this method is effective and prone to the detection of adulterations with minimum quantification levels (LOQs of 5–8 wt %), thus this method is a systematic way to control the quality of commercial coffee samples. Hence, microNIR can decrease and simplify the time of analysis and sample preparation, and also it can guarantee the effectiveness of real-time data acquisition owing to its portability.

The results of the analysis of regression methods for non-destructive detection of the caffeine and palmitic acid content of coffee beans in this research exhibited that near-infrared spectroscopy in the range of 400–1100 nm can be used to detect and predict coffee bean speciess and varieties. Also, results showed that the PLSR model has a good ability to predict palmitic acid and caffeine content. The ANN method has the best performance in determining the caffeine and palmitic acid content of coffee beans with the most accuracy and also has an acceptable performance in identifying and distinguishing different speciess and varieties of coffee due to the content of caffeine and palmitic acid.