1 Introduction

1.1 Motivation

Owing to the development of modernization, industrial production and control are gradually gaining attention. Intrusion detection is particularly important for the security of industrial control [1,2,3,4,5]. The intrusion detection system (IDS) is a commonly used defense method in cybersecurity and industrial control systems (ICS) research [6]. Recently, IDS uses machine learning [7,8,9] or deep learning as the main approach in ICS security. Using machine or deep learning [10,11,12], the IDS can obtain accurate detection results and provide a more accurate alarm for attacks under different protocols [13,14,15,16,17]. This protects ICS safety.

However, the industrial control network is different from the traditional network, and each factory’s network is relatively independent and closed. Owing to this, the data on the industrial control network are not universal. The conventional traffic generated in the actual industrial production process is larger than the actual attack traffic. This imbalanced distribution data set seriously affects the lack of training of machine and deep learning methods used in intrusion detection. This results in a decline in the accuracy of the intrusion detection. Although devices such as the testbed at Sandia National Laboratory can alleviate this difficulty, the number and types of attacks generated by the testbed are limited. Therefore, it is challenging to meet the IDS training and testing requirements. Unlike the traditional network traffic, the traffic protocol format in the industrial control network is relatively fixed. It is approximately composed of fixed-length fields such as Modbus. Therefore, different attacks are highly similar in feature engineering based on machine learning IDS, which interferes with machine training. In addition, because more manual extraction processes are used to establish feature engineering, it is easy to consider the prior knowledge of researchers and interfere with the model’s classification process of the model [7,8,9]. Therefore, increasing the quantity and quality of the ICS attack traffic data has become a focus issue and attracted both industrial and scholarly attention.

1.2 Limitation of the prior state-of-the-art methods

Combining machine or deep learning algorithms and IDS becomes the mainstream method for protecting ICS. The intrusion detection models in IDS include feature engineering and model training. Both of these two processes require a certain number of high-quality datasets. However, the current ICS data sets are unevenly distributed owing to the lack of attack traffic data, which cannot meet this requirement. This affects the training effect and detection accuracy of the IDS.

To solve the difficulty of the imbalanced data sets, several methods have been proposed in the past few decades. These are approximately divided into two categories: algorithm and data levels [18]. For the methods proposed at the algorithm level, the goal is to reinforce the existing classifier learning algorithms on a small number of categories. Gu et al. [19] and Khabsa et al. [20] proposed an improved support vector machine to allocate more penalty weights by assigning misclassified minority instances during the training. This is to improve the accuracy of minority class categories when dealing with unbalanced data sets. However, this method does not consider the different contributions of the minority class examples when learning decision boundaries, leading to over-fitting, which is affected by noise. Based on the aforementioned methods, it can be observed that the algorithm-level methods for solving the problem of imbalanced data greatly depend on the classification ability of the classifier. Moreover, they do not fundamentally solve the problem of the misclassified minority class. Furthermore, the decision boundary is also generated based on the existing data and cannot represent the actual data distribution.

Another solution to such problems is the use of the methods based on the data-level. The solution attempts to rebalance the category distribution by resampling the given imbalanced data. In Tao’s investigation, compared with the algorithm-level methods, data-level methods seemed more common for imbalanced datasets [21]. It was reported that the balanced datasets obtained by resampling can be used for learning by any classification algorithms. However, in many practical applications of imbalanced datasets, the minority class samples are often unavailable in the learning phase. The lack of minority class samples makes it difficult for the traditional resampling methods to deal with imbalanced datasets. Furthermore, it does not enable the IDS to achieve better training results. This makes the industrial equipment get attacked owing to minority class samples such as attack traffic data. Therefore, how to use the existing ICS imbalanced datasets to improve the detection accuracy of IDS to protect the industrial production process better has become a limitation that researchers need to solve.

1.3 Proposed approach

To solve the aforementioned limitations, in this study, we propose an end-to-end system known as data expansion intrusion detection system (DEIDS) for protecting ICS’ safety. As shown in Fig. 1, the structure of DEIDS includes format conversion, data expansion, and intrusion detection modules. First, the format conversion module adjusts the raw attack traffic data’s formation to suit the IDS’ model training and feature extraction. Then, we use a discriminator to determine whether each category’s traffic data are in a balanced state and whether the amount is sufficient. The data expansion module contains a redesigned convolutional neural network (CNN) and a data expansion algorithm. The redesigned CNN learns and extracts features to enhance the discrimination of the attack traffic. The algorithm uses enhanced features extracted by the redesigned CNN to expand to obtain the generated attack traffic tensors. The traditional machine or deep learning model is used in the intrusion detection module, and the generated attack traffic tensors are used for the training. The trained model has a strong ability and robustness to detect ICS attacks. The flexible architecture increases the generalization of DEIDS.

Fig. 1
figure 1

Simple flow of DEIDS

Fig. 2
figure 2

Schematic diagram of the distribution of the data generated by our method and the traditional method

In this study, the data expansion module receives insufficient samples judged by the discriminator. This module has two submodules, the feature extractor and the sample expander. The insufficient samples in the converted format are first transmitted to the feature extractor. More accurate data features can be extracted from the traffic data using the feature extractor. Then, a new algorithm named CenterBorderline_SMOTE (CB_SMOTE) is designed and proposed in the sample expander. This expands the data features in the feature extractor and strengthens the boundaries of the data features. Finally, this submodule makes and obtains the feature engineering by using the expanded data features that benefit the training of the intrusion detection model. To design the algorithm, we refer to the interpolation idea of SMOTE method and improve this method. We abandon the K-nearest neighbor idea, solving the limitation that it is easy to cause blindness in selecting data expansion location and data volume when the distribution differentiation of the positive and negative samples is not high. The boundary distribution is enhanced to solve the distribution marginalization, and it uses the existing minority class data to ensure that the newly generated data are within the decision boundary. The schematic diagram of the distribution of the generated data for different strategies is shown in Fig. 2. The method we use can ensure that the generated data are always kept on the positive side. Large data will not be generated on the boundary or in areas where it is difficult to demarcate, resulting in misclassification of the model.

1.4 Novelty and advantage of our approach

In this study, the main novelties and advantages of the proposed approach are as follows:

  • Enhanced feature extraction The traditional feature extraction depends entirely on the establishment of feature engineering by artificial operation or deep learning methods such as CNN. This study designs a feature extractor that modifies and redesigns the structure of CNN and uses reconstructed CNN to extract more accurate features. The advantage of this method is that it can improve the accuracy of feature extraction, avoiding unnecessary interference caused by human disoperation or prior knowledge.

  • Efficient data expansion The traditional method based on SMOTE uses the K-nearest neighbor. However, the idea K value is difficult to determine, the amount of calculation is significant, and it can be easily affected by noise problems. This study designs and proposes a new data enhancement method based on SMOTE–CB_SMOTE. Using the class boundary sample as the classification boundary, the seed sample is obtained by directly comparing the sample with the fitting center distance and aggregation degree. The new sample is synthesized on the connection between the seed sample and the fitting center to realize the oversampling strategy.

1.5 Key contributions

The main contributions of this study are as follows.

  • DEIDS can accurately identify ICS attacks and obtain high-precision intrusion detection. The system is suitable for intrusion detection based on the ICS. Compared with the traditional intrusion systems, DEIDS has more extensive applicability and practicability. In the case of insufficient attack samples, it can use the internal algorithm to generate data to improve the detection accuracy. Therefore, it can protect industrial devices from attacks better.

  • We present a novel algorithm based on SMOTE combined with a data’s fitting center (FC) known as CB_SMOTE. The algorithm can solve imbalanced ICS datasets by producing attack traffic data. The algorithm can also deal with the boundary and strengthen the boundary’s distinction to enhance the whole dataset to improve its quality. The datasets expanded by this algorithm are suitable for various types of machine or deep learning intrusion detection models. The intrusion detection model trained by this dataset can have a strong robustness and detection accuracy.

  • The experimental results show that DEIDS has a very high detection accuracy. It can extract more valuable and accurate features, which provides excellent features. Compared with the traditional over-sampling methods, the expanded algorithm used by DEIDS can acquire higher quality attack data than the conventional generated data method. The detection accuracy of the intrusion detection model trained by the open ICS datasets is over 97\(\%\). The generated datasets can train the intrusion detection model and make the models obtain a more powerful detection ability to protect the ICS better.

2 Related work

Random over-sampling is a non-informative method, which rebalances the class distribution by randomly copying the minority class samples [22]. The disadvantage is that because of the repetition of the information in the training set, the accurate replication of substitution will lead to the over-fitting of the subsequent supervised classification algorithm. To overcome this defect, Chawla et al. [23] proposed an information-based over-sampling method known as the synthetic minority over-sampling technique (SMOTE). The algorithm generates new minority class instances by interpolating the k-nearest minority class neighbors. This method can provide more helpful information for classification than the random over-sampling, because it (the proposed method) creates new artificial minority class instances rather than simply copying the original minority class instances. To avoid the SMOTE algorithm’s sample coverage problem, Han et al. [24] proposed a Borderline-SMOTE algorithm. This algorithm searches for “dangerous” samples in a small number of classes and generates new samples based on these samples. Therefore, He et al. [25] constructed a distribution function of newly generated samples based on the degree of “dangerous” samples to determine the number of unique samples generated based on each “dangerous” sample. Jo et al. [26] implemented a clustering-based sampling algorithm known as the cluster-based over-sampling (CBO), which is suitable for cases in which multiple disconnected aggregation points exist in the class distribution. Liu et al. [27] introduced the concept of the class average distance and proposed an unbalanced dataset learning algorithm known as DB_SMOTE using the center of gravity of sample data. This method is simple to use and adapts the use of datasets with clear boundaries.

However, the method based on SMOTE faces the following difficulties. First, as an interpolation method, SMOTE can quickly interpolate low-dimensional data to achieve the purpose of over-sampling. However, it is not suitable for high-dimensional data. Second, the SMOTE cannot guarantee the accuracy and effectiveness of the generated data owing to the K-nearest neighbor. This makes the selected interpolation space impure, making part of the generated data noisy.

Because of the advancements in machine and deep learning methods, samples are oversampled to enhance the data by combining algorithm-level methods. GAN [28,29,30], flow-based model, and other deep learning methods integrate the idea of algorithm-level methods and oversample the samples to enhance the data [

Fig. 3
figure 3

Structure of the DEIDS

The structure of the system designed here is shown in Fig. 3. First, the system transforms the data format. The attack traffic data are converted to tensors needed for model training, and the system will use these tensors created by the attack data to train the CNN. They will then be passed through the discriminator for the first time to judge whether each attack category’s amount is in a balanced state. Based on the assumption that it belongs to equilibrium distribution attack data, the number of each category’s attack data is enough. In such a case, it will be transferred to the intrusion detection module for training and detection. Nonetheless, if there are few attack data in one type or some types, the tensors created by the attack data will be passed to the data expansion module. Using our designed classification importance discrimination module (CIDM), we can extract the relevant attack details of the attack traffic samples and construct detailed features of the attacks based on their classification importance. Then, using our proposed algorithm CB_SMOTE, entering these attack features can increase the number of minority class samples’ attack features. After the generated feature is obtained, the feature is restored based on the standard tensor format to generate the feature engineering. Finally, the expanded attack samples train the intrusion detection model, effectively solving the low-quality and insufficient data and improving the detection accuracy. After the operation of the whole system, researchers can expand the imbalanced ICS attack datasets effectively and accurately. More accurate attack feature sets can be extracted for intrusion detection training by our system to make the intrusion detection results obtained more accurate.

3.1 Feature extractor based on the classification importance discrimination module

Because the artificial operation in feature engineering will interfere with the feature extraction, which affects intrusion detection models’ detection accuracy, we design and use the feature extractor to extract the data features instead of the artificial feature engineering. It uses the reconstructed CNN’s powerful feature extraction ability to extract features strongly associated with the attack type.

Figure 4 shows the procedure of the traffic flow tensors’ generation process. In this study, the conversion rules are formulated based on the characteristics of industrial control network traffic. First of all, based on the conversion rules, this study intercepts the first fixed number of bits of each traffic packet, saves the original capture sequence, and integrates it into a flow. Second, the hexadecimal data in the flow payload are converted into decimal data and are sent into a flow tensor line. These tensors contain the payload or a flow of original ICS traffic for some time. Padding data are used to make up the insufficient number of bits to ensure that the length of each flow is the same. Finally, the flow tensor line is reshaped into the form of a flow tensor square.

Fig. 4
figure 4

Generation process of the traffic flow tensor

The traditional CNN model generally divides into convolutional and fully connected layers. The convolutional layer is responsible for extracting pixel region in the image, and the fully connected layer performs resha** analysis based on the pixel region to complete the classification. Zhou et al. [35] have shown that the pixel area retained after several training iterations in the convolution process is the target part that can help the classification to extract the features. However, the fully connected layer negatively affects the retained information to affect the feature extraction adversely. They perceive that CNN has a significant positioning ability by using the global average pooling (GAP) layer, and the data information contained in the pixels position will not be negatively affected. This indicates that it will not cause a loss of the original image.

Thus, to extract the attack features more accurately, we draw lessons from Zhou’s method [35] to replace the fully connected layer in the traditional CNN model with the weight layer to eliminate the negative impact of the fully connected layer on the model feature extraction process. The reconstructed CNN uses a combination of the GAP and weight layers. The weight layer can reasonably complete the fully connected layer’s classification task, and the data will not be affected.

Fig. 5
figure 5

Schematic diagram of the reconstructed CNN structure

The reconstructed CNN structure is shown in Fig. 5. To make this method suitable for the network attack flow, the existing methods need to be adjusted. Although the prior knowledge possessed by researchers can help reduce the weight of the matrix and extract more valuable attack features, for machines, the effect of classification depends more on the feature attributes of the data. This attribute is likely to contradict or conflict with human knowledge. Therefore, for the classification model, prior human knowledge is intrusive and misleading. Moreover, Zhou’s method selects several features with a high degree of activation as classification features. This is not desirable, because this is bound to cause the loss of information. In our paper, we designed a module known as the CIDM. This module replaces the taking of only the most essential feature details in Zhou’s method. However, it establishes a large matrix to preserve the feature classification importance degree in each flow. After passing through the CIDM, each flow can accurately capture every feature that plays a decisive role in the classification. Each type of attack can also use the CIDM to observe the precise details of the attack, and the characteristic information and location play the role of the attack.

Because of the CIDM, the CNN’s feature extraction steps have been adjusted to a certain extent. As shown in Eq. (1), the flow data’s tensors filtered by the CIDM can output a set of data features Ton. The feature set Ton represents the nth feature T in the O-dimension. The CNN provides each feature a corresponding weight woc during the training process. This weight w describes the importance of the O-dimension’s features when the flow tensor is classified as C. Thus, when the image is classified into category C, the data feature set filtered by CIDM can be classified as Dc based on the corresponding importance provides to each feature during the CNN’s training process. We used this importance to select the features that play a more decisive and important role in the classification, as shown in Eq. (1).

$$\begin{aligned} T_o^n=\, & {} \begin{bmatrix} T_0^0 &{} T_0^1 &{} \cdots &{} T_0^n\\ T_1^0 &{} T_1^1 &{} \cdots &{} T_1^n\\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ T_o^0 &{} T_o^1 &{} \cdots &{} T_o^n\\ \end{bmatrix} \nonumber \\ D_c=\, & {} \Sigma _ow_o^c\Sigma _nF_o(T_o^n)=\Sigma _o\Sigma _nw_o^cF_o(T_o^n) \end{aligned}$$
(1)

Fo (Ton) denotes the importance of the degree of feature Tn in dimension O of the flow tensor. We performed identical deformation similarly. The highest convolution layer’s importance tensors of the detection model were not globally pooled. All the tensors were weighted and summed based on the weights extracted from the weight matrix. The classification importance degree \(D_{c}\) corresponding to the original tensors can be obtained.

Because the introduced GAP layer can be used to reduce the dimension of each dimension’s important tensors, when it is finally necessary to classify the flow tensors, the probability of being classified to C is expressed by Eq. (2).

$$\begin{aligned} P_c=\frac{\exp (D_c)}{\Sigma _n\exp (D_c)} \end{aligned}$$
(2)

Pc represents the probability that the flow tensor is classified as C, and n represents the total number of categories owned. To judge whether the tensor classification is correct, we retained the distribution of each class probability to calculate various classification metrics subsequently.

3.2 Sample expander based on CB_SMOTE algorithm

After all the attack flow tensors pass through the redesigned CNN, the CNN can accurately extract each attack feature that plays a decisive role in classifying the attack types. Therefore, each attack type’s full feature importance degree can be obtained by accumulating the classification importance degree of all the flow tensors in each category. Then, we select the essential attack features extracted by CNN under each category as the feature engineering of this attack type. Because the quality and quantity of feature engineering cannot meet adequate training requirements and classification, it is necessary to expand and enhance feature engineering effectively. In this study, a module known as sample expander is designed for this purpose. For this module, we propose a method based on the data weights and distribution boundaries known as CB\(\_\)SMOTE. This method can effectively expand and enhance insufficient attack samples and take the corresponding measures for boundary samples to avoid cases where boundary samples are easily misclassified as noise. This method can simplify the expansion steps and only expand samples in the key positions that affect model detection.

In the sample expander with the CB\(\_\)SMOTE algorithm, we use the feature engineering of each attack type obtained by the feature extractor to make a targeted expansion. \({S=\left\{ F_i,i=1,2\ldots ,n\right\} }\) is used to build a set of attack details for each category of data for the attack details extracted by the CNN. Fi represents the feature engineering matrix of attack i, which the importance obtained by the CIDM screens out. Then, in all the flow tensors of each attack type, the features of the corresponding position are regarded as the feature engineering of this flow based on the filtered importance and are saved to Fi. The following matrix can be observed, where n represents the number of features selected by each flow tensor and m denotes the number of flow tensors under this attack type. To expand the feature set of attack details of the minority classes, we designed the CB_SMOTE method. The method extracts the attack details from the same class’s attack detail set and locates the exact position’s eigenvalues in fi. Then, it extracts the eigenvalues of the same position and calculates the average value. The average value in the data space is recorded as the Fitting Center (FC) and is calculated as shown in Eq. (3), where m here has the same meaning as the m in Fi.

$$\begin{aligned} F= \begin{bmatrix} f_1^1 &{} f_1^2 &{} \cdots &{} f_1^n\\ f_2^1 &{} f_2^2 &{} \cdots &{} f_2^n\\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ f_m^1 &{} f_m^2 &{} \cdots &{} f_m^n\\ \end{bmatrix} \nonumber \\ FC=\frac{1}{m}\Sigma _{j=1}^mf_m \end{aligned}$$
(3)

The average distance from the features of the exact location in the attack detail of this type to the FC point is recorded as the average Fitting Distance (FD). This distance reflects the degree of aggregation in the details of this type of attack. The smaller the value is, the more compact is the distribution’s degree. Otherwise, the distribution is sparse. The calculation is shown in Eq. (4).

$$\begin{aligned} FD=\frac{1}{m}\Sigma _{j=1}^mDis(f_m,FC) \end{aligned}$$
(4)

The over-sampling strategy’s key is to find the basic features and iteratively generate new features. During sample classification, samples at the boundary are the most prone to classification errors. Thus, we must pay special attention to more important weights to boundary samples. Here, we define to facilitate the description, and this definition is recorded as the Basic Feature (BF). The BF refers to the feature in which the distance from the same position feature to the FC is greater than the FD. The method of calculation of BF is shown in Eq. (5), where T i represents the feature set of the ith attack.

$$\begin{aligned} BF=\{T_i\mid Dis(f_m,FC)>FD\} \end{aligned}$$
(5)

Subsequently, all the BFs in the exact location are found to build a candidate set. The FC is designated as the reference point to avoid adding too much interference into the generated feature. A line segment is formed using the attack features and reference points in the candidate set. We generate new attack features on the line segment to ensure that the generated attack features are located inside the correct class. Based on the basic principle of the SMOTE algorithm, we combine the algorithm with the Generated New Feature (GNF) as shown in Eq. (6):

$$\begin{aligned} GNF=S_i+(S_i-FC)\times r \end{aligned}$$
(6)

where Si belongs to the candidate set sample and r is a random number taken between [0,1].

Based on the previous description, the greater the distance between the attack features of the candidate set and the features of FC is, the more likely these features are to be misclassified. Thus, the number of features generated for such a sample needs to be increased accordingly. This is suitable for improving the classification model’s accuracy. According to the Euclidean distance calculation method, the distance Dis(Si, FC) from each feature to the FC’s feature can be obtained. Therefore, the sum of Euclidean distances from each attack detail to FC is obtained by accumulation. The synthesis is recorded as S. Finally, we can obtain the distribution function P as shown in Eq. (7).

$$\begin{aligned} P_i=\frac{Dis(S_i,FC)}{S}, \sum _{i=1}^kP_i=1 \end{aligned}$$
(7)

We can obtain the number of new samples generated for each candidate attack feature by multiplying the probability of the feature distribution by the total number of features.

The ICS attack data characteristics and the correlation between information points are used to simplify the sample expansion process. This indicates that samples can be expanded for a specific location pixel of an individual attack detail. The corresponding position pixel information can be expanded based on the change in the first position. This method retains ICS attack data characteristics, and it is not easy to produce a large number of error data when expanding the sample.

The following shows the implementation of the CB_SMOTE. Assume few class sample sets \({DS=\left\{ (f_i,N_i),i=1,2,3,\ldots ,n\right\} }\), where i represents the number of attack detail categories for the samples, fi represents the collection of specific attack details, and Ni represents the number of fi category attack details.

figure a

In the algorithm, the int() function is rounding up. The balance factor is used to determine the total number of generated samples, which can be initially set to one here based on the requirement. This is to ensure a balanced relationship between the over-sampled dataset and majority of sample sets.

3.3 Sample expander based on boundary enhancement CB_SMOTE

Because the proposed method generates new features by the random position on the line segment, if the CB_SMOTE cannot develop enough new features near the boundary, it does not enhance the quality of the boundary sample. To perform the aforementioned operation, we adopt the design idea of improving the boundary attack data. The boundary data and their nearest original boundary data are connected. The generated data between the boundary data and the nearest neighbor data are defined as the sample set to be generated. Therefore, the expanded dataset is calculated using the aforementioned CB_SMOTE algorithm. Thus, we can obtain sufficient generated data to enhance the boundary attack data quality and avoid misclassification.

In this study, the boundary dataset to be expanded is selected based on the previous paragraph’s description. The Real Center (RC) of the boundary attack dataset (the two boundary midpoints of the dataset to be expanded), and the result compared with the FC, the Difference Value (textbiDV), is expressed by Eq. (8):

$$\begin{gathered} {\text{DV}} = {\mid }{\text{RC}} - {\text{FC}}{\mid }, \hfill \\ {\text{where}}\;\left( {{\text{RC}} = \frac{{X_{{{\text{boundary}}L}} - X_{{{\text{boundary}}R}} }}{2}} \right) \hfill \\ \end{gathered}$$
(8)

where XboundaryL and XboundaryR are represented as the boundary point at the left and right boundaries, respectively. The distance of DV between RC and FC can be calculated using Eq. (8). The existence of DV indicates that there is an error between the actual and estimated values. Therefore, the generated sample cannot be well represented as the distribution of the original data. We use the sample expander specially designed for the boundary in this study to enhance the boundary of data distribution. The implementation of the algorithm used by the sample expander, which is specially designed for the boundary, is described later.

figure b

If the DV needs to be further expanded, new data are calculated and generated by applying the aforementioned algorithm. Each time the data are generated, the FC is calculated, and a new FC is obtained. Then, the distance between the FC and RC is compared until the difference is less than or equal to a particular threshold value. This can effectively solve the classification accuracy of the boundary features.