1 Introduction

Bearings are critical common units in rotating machinery, whose failures may lead to severe influences on the overall system performance. Moreover, the bearings’ failure may damage other components in the machine. For instance, bearings’ failure in high-speed wind turbines [1] could initiate the degradation of other parts of the gearbox, with expensive replacement costs. Authors in [2] have estimated the gearbox cost of a 5 MW wind turbine to be more than USD 600 000. Thus, monitoring bearing degradation and estimating their remaining useful life (RUL) are essential for rotating machinery preventive maintenance activities [3, 4].

The bearings RUL estimation approach comprises data acquisition, feature extraction, and prognosis with remaining useful life estimation. Vibration signal data acquisition commonly uses accelerometers [1], and vibration signals often incorporate noises. Besides, bearings degrade according to a stochastic process, where degradation and failure information are embedded in the vibration signal’s raw data. It is thus helpful to extract features incorporating bearings’ condition information. The extracted features would carry degradation and failure information, among which some are redundant inevitably [5]. Feature selection allows identifying the most relevant original features subset, while discarding redundant ones [6]. Through feature selection, one can estimate the RULs with enough accuracy based on appropriate features [7,8,9,10,19] worked out data fusion-based features for degradation prognosis. However, as mentioned in [15, 16, 18], to reduce the sensibility to noise and outliers and obtain monotonic features, it is necessary to preprocess the data by smoothing, then evaluate the monotonicity defined in [16]. Smoothing methods highlight the patterns while removing outliers from the raw data. The monotonicity performance for feature selection to health prognosis largely relies on the smoothing methods. The method selection and parameter setting are both involved during the application of smoothing methods. However, capturing the global trend rather than reflecting local variations are conducted during the smoothing process.

Bearings degradation process could be interpreted as stochastic processes. There are several degradation regimes involved in one specific bearing degradation process. Thus, the assumption is that degradation features, during the whole lifecycle of the specific bearing, are derived from a mixture distribution. Each component distribution of this mixture distribution is associated with one certain degradation regime of the bearing degradation process. Hence, the mixture distribution analysis is introduced in this work to reveal the bearing vibration signal-based feature space.

As the main contributions in this paper: to overcome the drawbacks caused by smoothing methods, we propose a new feature selection metric characterized by a monotonous evolution based on the mixture distribution’s weighted assessment of the feature data instead of the smoothed feature. We call this new feature selection metric “distribution monotonicity.” Additionally, based on distribution monotonicity, a new feature selection approach is proposed.

The remainder of this paper is organized as follows. Section 2 recalls the main background for estimating the RUL of the bearings, the existing definition of a feature’s monotonicity and its limitation, and lastly, states the problem. Section 3 proposes the new features selection metric called distribution monotonicity. Moreover, a new feature selection approach based on the proposed metric is also proposed in Sect. 3. Then, Sect. 4 provides the performance assessment results of the proposed feature selection metric and method through a case study by using an existing real benchmark datasets from the literature. Lastly, in Sect. 5, this work is concluded and some directions for future research are also discussed.

2 Background and problem statement

2.1 The RUL modeling

The health prognosis refers to remaining useful life estimation. The general scheme in [20] is the RUL estimation model based on the fuzzy inference system for bearings by using a feature set extracted from vibration signals.

The scheme of health prognosis process is shown in Fig. 1. This scheme does not involve either a fixed bearing failure threshold or any available information on failure modes. Feature data extracted from monitoring vibration signals is used as the model input, denoted by \({V}_{k}\). \(k=\mathrm{1,2},\cdots ,K\) is the rank of observation. As the degradation processes among bearings are diverse, manually determining bearings failure thresholds could inevitably introduce noise and error. To avoid this issue, the estimated past useful life ratio is adopted as the output of the model, denoted by \({\widehat{\rho }}_{k}\).

Fig. 1
figure 1

The schema for the RUL estimation

The RUL model accuracy depends on the quality of features extracted from the vibration signals for health conditions monitoring. Specifically, this paper addresses the problem of the bearing degradation features selection for the RUL estimation.

2.2 Existing monotonicity estimation metric

For mechanical components, such as bearings, there is no ability of self-healing during their degradation processes. In other words, the degradation of a bearing throughout its life cycle is monotonic. To achieve an accurate estimation of bearing RUL, vibration signal based features should be selected according to their monotonicity [15].

An existing monotonicity estimation metric \({M}_{i}\), defined in Eq. (1) [16] can be used to evaluate the degree of monotonicity of the \({i}^{th}\) feature.

$${M}_{i}=\left|\sum_{k=1}^{K-1}\frac{sgn({v}_{\left(k+1\right),i}-{v}_{k,i})}{K-1}\right|$$
(1)

where \(v\) is the data point of \({V}_{k}\) and \(i=\mathrm{1,2},\cdots ,I\) is the features label. Typically, the function defined by Eq. (1) takes values in \(\left[\mathrm{0,1}\right]\). It corresponds to the statistic average of the trend of each pair of adjacent data points of the selected feature.

The presence of noise and outliers inevitably influences the data quality used for the feature extraction. An example is provided in [15], where the value of monotonicity is low when the function in Eq. (1) is directly applied, without smoothing, to the root mean square (RMS), a time-domain feature. In practice, it is necessary to adequately smooth the data before assessing the monotonicity [16]. There is a variety of smoothing methods, such as Savitzky-Golay filtering [21], adaptive LOESS filtering [22], and spline smoothing [23]. Smoothing methods highlight the trend patterns by removing outliers from data. Meanwhile, smoothing the same feature using different smooth methods or the same smoothing method with different parameter settings will result in different monotonicity values derived from Eq. (1). Thus, the performance of the monotonicity metric for feature selection largely depends on smoothing method selection and parameter setting.

In summary, there are two drawbacks with this existing monotonicity estimation metric due to the smoothing implementation. First, they are exclusive to capture the global trend rather than reflect local variations. A significant part of meaningful information could be removed during the smoothing implementation. Second, the monotonicity assessment for feature selection largely relies on the smoothing methods choice and their parameter setting. Hence, this work aims to work out a selection method of meaningful features for accurate bearings RUL estimation based on a monotonicity assessment method without smoothing.

3 Feature selection for bearing RUL estimation

3.1 Distribution Monotonicity

To overcome the drawbacks of the monotonicity metric defined in Eq. (1), we suggest considering each data point for a feature monotonicity assessment. The sample data V denotes features which are extracted from a vibration signal collected throughout the whole life cycle of one bearing. The sample data V is a multivariate time series of multiple features. Since the bearing degradation evolution is complex, it is assumed that the sample data V follows a finite and countable multivariate mixture distribution. Let \(f({V}_{k})\) denote the probability of \({V}_{k}\), given the multivariate mixture distribution of features.

The new monotonicity assessment function, called distribution monotonicity, is denoted by \(DM\). \(DM\) is a vector of features weighted trends rather than the features statistical trend. Elements of \(DM\) are defined in Eq. (2):

$${DM}_{i}=\left|\sum_{k=2}^{K}\frac{f({V}_{k})sgn({v}_{k,i}-{v}_{(k-1),i})}{K-1}\right|$$
(2)

Thus, the impact term \(f({V}_{k})\) is introduced into the monotonicity analysis, which is associated with each measure to characterize the impact of each data point on the monotonicity of a given feature. Hence, distribution monotonicity defined in Eq. (2) analyzes not only the features statistical monotonicity but also the underlying distribution attribute of each data point. In other words, distribution monotonicity assesses the essence of monotonicity to features.

3.2 Multivariate mixture distribution analysis in the feature space

A multivariate mixture distribution is illustrated by Eq. (3) with a constraint in Eq. (4), which is a weighted mixture of multivariate normal distributions [24]. Each single component distribution defined in Eq. (5) is associated with a single regime component in samples space [24]. Specifically, it can make sense that each single component distribution is associated with a single degradation stage in the bearings features space. In addition, the probability of membership is defined in Eq. (6) with the Bayesian rule [24].

$$f({V}_{k})={\sum }_{j=1}^{J}{w}_{j}{g}_{j}({V}_{k};{C}_{j},{H}_{j})$$
(3)
$${\sum }_{j=1}^{J}{w}_{j}=1$$
(4)
$$g_{j} \left( {V_{k} ;C_{j} ,H_{j} } \right) = \frac{1}{{(2\pi )^{I/2} |H_{j} |^{1/2} }}e^{{( - \frac{1}{2}\left( {V_{k} - C_{j} } \right)H_{j}^{ - 1} \left( {V_{k} - C_{j} )^{T} } \right)}}$$
(5)
$$P\left( {\theta_{{R_{j} }} |V_{k} } \right) = \frac{{P\left( {\theta_{{R_{j} }} } \right)P\left( {V_{k} |\theta_{{R_{j} }} } \right)}}{{P(V_{k} )}} = \frac{{w_{j} g_{j} \left( {V_{k} ;C_{j} ,H_{j} } \right)}}{{f(V_{k} )}}$$
(6)

where Hj and Cj are respectively the covariance matrix and the mean vector associated with the \({j}^{th}\) regime \({\theta }_{{R}_{j}}\).

As the definition in Eq. (3), \(f({V}_{k})\) could be considered as a function with parameters \({w}_{j}\), \({H}_{j}\), and \({C}_{j}\). Hence, the maximum-likelihood estimation problem of parameter \({w}_{j}\) could be formulated with the Lagrange multiplier method as in Eq. (7) [24].

$$\begin{array}{c}{\widehat{w}}_{j}=\end{array}\underset{{w}_{j}}{argmax}\left({\sum }_{k=1}^{K}\mathit{ln}f\left({V}_{k}\right)- \lambda {\sum }_{j=1}^{J}\left({w}_{j}-1\right)\right) =\underset{{w}_{j}}{argmax}\left({\sum }_{k=1}^{K}\mathit{ln}{\sum }_{j=1}^{J}{w}_{j}{g}_{j}({V}_{k};{C}_{j},{H}_{j})-\lambda {\sum }_{j=1}^{J}({w}_{j}-1)\right)$$
(7)

where \(\lambda\) is the Lagrange multiplier.

The part in the brackets of Eq. (7) is convex due to \(f({V}_{k})\) the mixture of multivariate normal distributions. Hence, the Eq. (8) is satisfied while \({w}_{j}\) takes the maximum-likelihood value \({\widehat{w}}_{j}\). Based on the derivation in [24] on Eq. (8) using Eq. (6), \({\widehat{w}}_{j}\) defined in Eq. (9) is the maximum-likelihood estimated values of the parameter \({w}_{j}\).

$$\frac{\partial ({\sum }_{k=1}^{K}\mathit{ln}{\sum }_{j=1}^{J}{w}_{j}{g}_{j}({V}_{k};{C}_{j},{H}_{j})-\lambda ({\sum }_{j=1}^{J}{w}_{j}-1))}{\partial {w}_{j}}=0$$
(8)
$${\widehat{w}}_{j}=\frac{1}{K}{\sum }_{k=1}^{K}P({\theta }_{{R}_{j}}\left|{V}_{k}\right.)$$
(9)

3.3 Probability of membership estimation

To obtain estimated value of \(P({\theta }_{{R}_{j}}\left|{V}_{k}\right.)\) in Eq. (9), the subtractive clustering method is conducted on a \(K\times (I+1)\) matrix [25]. The \(K\times (I+1)\) matrix is a collection of all the input and output data samples. All the parameters setting of subtractive clustering follows the work in [20].

Through subtractive clustering, the centroid of the \({j}^{th}\) cluster in the input space is obtained as in Eq. (10). While, the degree of membership \({\mu }_{j}({V}_{k})\) is also obtained, which is considered as the potentiality measurement of data point \({V}_{k}\) for being in the \({j}^{th}\) cluster of sample space.

$${C}_{j}={\left[\begin{array}{cc}{c}_{j,1}& {c}_{j,2}\end{array} \begin{array}{cc}\dots & {c}_{j,I}\end{array}\right]}^{T}$$
(10)

where j = 1, 2, …, J is the label of the clusters.

According to definition in the subtractive clustering, the degree of membership \({\mu }_{j}({V}_{k})\) is subject to the following conditions:

$${\mu }_{j}({V}_{k})\in [\mathrm{0,1}],\forall j,k;{\sum }_{j=1}^{J}{\mu }_{j}\left({V}_{k}\right)=1,\forall k;0<{\sum }_{k=1}^{K}{\mu }_{j}({V}_{k})<K,\forall j.$$

Compared with the definition of the probability of the membership function in Eq. (6), \({\mu }_{j,k}\) can be considered as the estimated value of the probability of the membership \(P({\theta }_{{R}_{j}}\left|{V}_{k}\right.)\). Hence, Eq. (9) is rewritten as Eq. (11).

$${\widehat{w}}_{j}=\frac{1}{K}{\sum }_{k=1}^{K}\widehat{P}\left({\theta }_{{R}_{j}}\left|{V}_{k}\right.\right)=\frac{1}{K}{\sum }_{k=1}^{K}{\mu }_{j}({V}_{k})$$
(11)

3.4 Estimation procedure of function \({\varvec{f}}({{\varvec{V}}}_{{\varvec{k}}})\)

The following describes the procedure for the impact term \(f({V}_{k})\).

Step 1. The centroid vector \({C}_{j}\) is obtained through subtractive clustering.

Step 2. The covariance matrix \({H}_{j}\) is formed as in Eq. (12) following the setting in [20].

$${H}_{j}=\left[\begin{array}{cccc}{\sigma }_{1}^{2}& 0& \cdots & 0\\ 0& {\sigma }_{2}^{2}& \cdots & 0\\ \vdots & \vdots & \ddots & \vdots \\ 0& 0& \cdots & {\sigma }_{I}^{2}\end{array}\right]$$
(12)

Note that: in fact, feature correlations are not totally irrelevant. However, this section aims to estimate the impacts respectively corresponding to each data point of every single feature (variable). Thus, the correlation between features is ignored. Hence, the covariance matrix \({H}_{j}\) is defined as a diagonal matrix.

Step 3. Parameter \({w}_{j}\) is obtained using formula (11).

Step 4. Substitute the previous parameters into Eq. (5) to obtain \({g}_{j}({V}_{k};{C}_{j},{H}_{j})\). Then, substitute \({g}_{j}({V}_{k};{C}_{j},{H}_{j})\) into Eq. (3) to obtain \(f({V}_{k})\).

3.5 The algorithm of feature selection approach

Step1: Extract features from vibration signals collected over the training bearings lifespan.

Step 2: Calculate the distribution monotonicity value of each feature in every training bearing.

Step 3: Calculate the average values of the distribution monotony of the same feature in all training bearings.

Step 4: Order the features according to the decreasing average value of the distribution monotonicity.

Step 5: Select the number of first sorted features to feed the RUL estimation model. (The number of selected features is fixed according to model complexity requirements. The complexity of the model increases with the number of input features.)

4 Case study

4.1 Benchmark datasets

We used the IEEE PHM 2012 benchmark dataset (bearing type: Ball bearing) to assess the effectiveness of the distribution monotonicity proposed in this paper for bearings health prognosis. By using the PRONOSTIA testbed [26], the IEEE PHM 2012 dataset is collected under three different load conditions with the same type of bearings. There is no information about failure modes. And no fixed failure threshold is provided. In this paper, two ball bearings with the tags (1–1, 1–2) were used as training bearings under load condition 1, and one ball bearing (1–3) was used as a test bearing under the same load condition.

Condition 1 operates at 1800 rpm with a 4000 N radial load. Vibration signal data is collected over an entire run-to-failure period for each bearing. The data sampling rate is 25.6 kHz, and the recording duration is 0.1 s with 10 s sample intervals. Two accelerometers collect the vibration signals in the vertical and horizontal directions. For simplicity, we consider only the vibration signals in the horizontal direction.

4.2 Feature selection

We selected features with fewer or no parameters to decline the influence of parameter selection during the features extraction process in this case study. Thus, Table 1 displays 24 features extracted from the vibration signals collected for each training bearing over an entire run-to-fail period. Following the procedure introduced in Sect. 3, we applied respectively the monotonicity metric with smoothing preprocessing, and the distribution monotonicity without smoothing preprocessing, to implement the feature selection. As its advance in processing even spacing data, Savitzky-Golay filter is adopted for the smoothing preprocessing [27].

Table 1 The list of features

4.3 Assessment

Respectively, methods based on clustering algorithms, neural networks, and metaheuristics are the two main categories of identification methods implemented in the T-S FIS [33]. The identification method consisting of the subtractive clustering and least square estimation (FSC-LSE) could be a representative of the first category [25]. The method composed by the least square estimation and backpropagation gradient descent (BPGD-LSE) is belong to the second category [34]. For convenience, in the sequel, we shall denote A the FSC-LSE based method, B the BPGD-LSE based method.

In this paper, the effectiveness of the distribution monotonicity for bearings health prognosis, is evaluated with FIS prognosis models identified respectively by two identification methods (e.g., method A and B). The parameters setting, involved in methods A and B, follows the work in [20]. Note that: to avoid considering the influence of the smoothing method choice and its parameters selection, the selected features imported into the model are without smoothing preprocessing.

We use the relative root mean square error (RRMSE) (13), and an average of ARRMSE (ARRMSE) (14) to evaluate the models’ performance.

$$RRMS{E}_{s}=\sqrt{\frac{1}{K}{\sum }_{k=1}^{K}{\left(\frac{{\rho }_{k}-{\widehat{\rho }}_{k}}{{\rho }_{k}}\right)}^{2}}$$
(13)
$$ARRMSE=\frac{1}{S}\sum_{s=1}^{S}RRMS{E}_{s}$$
(14)

where \(s=\mathrm{1,2},\dots S\) is the number of selected features. The features are selected from whole features set in descending order by corresponding monotonicity assessment value.

4.4 Numerical results

Based on training bearings of the IEEE PHM 2012 datasets, features sorted by monotonicity are shown in Fig. 2. Features sorted by distribution monotonicity are shown in Fig. 3.

Fig. 2
figure 2

Features sorted by monotonicity (related to IEEE PHM 2012 datasets)

Fig. 3
figure 3

Features sorted by distribution monotonicity (related to IEEE PHM 2012 datasets)

In Tables 2, 3, Figs. 4 and 5, the results of the health prognosis of test bearings based on different numbers of features are shown, respectively selected by using the monotonicity and the distribution monotonicity.

Table 2 Results with method A
Table 3 Results with method B
Fig. 4
figure 4

The results based on method A related to IEEE PHM 2012 datasets

Fig. 5
figure 5

The results based on method B related to IEEE PHM 2012 datasets

Based on the results shown in Tables 2, 3, Figs. 4 and 5:

  1. (1)

    When the number of features exceeds 11, the RUL prediction performance fluctuates wildly. Therefore, we suggest using the first 5 to 11 features for the bearing monitoring.

  2. (2)

    Compared with using monotonicity, fewer number of first features are needed to obtain the results with acceptable accuracy by using distribution monotonicity.

  3. (3)

    Compare with using monotonicity, the values of ARRMSE are smaller with the same type of model by using distribution monotonicity.

  4. (4)

    For a model identified by a particular method, an increase in the variety of the input features is not always accompanied by an increase in its accuracy. It can make sense that some input features might contain inaccurate information resulting in a drop in model accuracy.

5 Conclusions

Feature selection for bearing degradation monitoring is a difficult but highly important preliminary step. The performance of existing monotonicity-based feature selection approach for bearing RUL estimation highly relies on the data smoothing preprocessing implementation. However, a significant part of meaningful information could be removed during the smoothing implementation. Moreover, the choice of various smoothing methods and their parameters setting can lead to uncertainty in the performance of existing monotonicity-based feature selection approach.

To avoid these drawbacks, we proposed a mixture distribution analysis-based metric for feature selection for bearing RUL estimation, called distribution monotonicity. This metric involves the mixture distribution probability associated with each measure to characterize the impact of each data point on the monotonicity of a given feature. Hence, the proposed metric analyzes not only the features statistical monotonicity but also the underlying distribution attribute of each data point. In other words, distribution monotonicity assesses the essence of monotonicity to features. Moreover, it has no requirement of smoothing implementation. Further, a distribution monotonicity-based feature selection approach is also proposed.

The performance of the proposed metric and feature selection approach were assessed using a benchmark data from the literature. Compared with using monotonicity with smoothing preprocessing, fewer number of features are needed to obtain the results with acceptable accuracy by using distribution monotonicity. In addition, the values of ARRMSE are smaller with the same type of model by using distribution monotonicity.

The numerical results show that mixture distribution analysis-based monotonicity evaluation, namely distribution monotonicity, and the proposed feature selection approach are promising for the feature selection for bearings’ RUL estimation.

Grounded on the results obtained, our future works will investigate the appropriate method for determining the number of selected features setting with an automatic feature selection approach.