1 Introduction

The detection of various events and, consequently, the prevention of many dangers can be greatly enhanced using information obtained from sound signals, one of the most important aids for humans to understand their surroundings. The vision influences many human reactions; however, sounds are often used to announce alerts or make quick decisions, especially in everyday situations. The advantage of the sound signal is that it is not limited to direct vision, and this characteristic is one of the reasons why it is superior to an image in some situations. In many cases, by hearing sounds without seeing the event directly, one can prejudge details or nature of the event and take the necessary action [1].

The emergence of various advances in hardware, software, and Machine Learning (ML) algorithms, mainly of Deep Neural Networks (DNNs), has enabled researchers to perform complex functions. Additionally, the ability of DNNs to address various complex applications has led to changes in many conventional ML methods in many areas, including image, natural language, and sound processing and analysis. While DNNs typically require substantial computational resources for tasks such as image and video processing, the demand is comparatively lower for one-dimensional (1D) data. Working with 1D audio data or corresponding 1D features simplifies the computational requirements, unlike the intricate calculations required for processing images and videos. Consequently, this issue is less prominent in 1D SED systems [2].

A fundamental aspect of sound processing and analysis is detecting sound events, which has several applications, such as in security, medicine, and monitoring of urban events, and can be used simultaneously with information acquired by security and traffic cameras to increase detection accuracy and coverage. For example, in security systems, namely, in situations where an imaging camera cannot fully acquire the scene of an event for some reason, sound signals can be used in parallel to increase the accuracy and efficiency of the event detection system [3, 4]. In most sound event recognition systems based on DL, researchers have attempted to improve their efficiency and accuracy using standard sound features and modifying the structure of the used DL network. Few studies have focused on extracting useful sound features to optimize the performance of those SED systems based on Deep learning (DL). However, many different features can be extracted from sound signals, which usually require less computation than those extracted from images. The inherent instability of the features in the time and noise sensitivity of sound is also higher than those from images. Therefore, it is interesting to increase the efficiency, speed, and accuracy of a DL-based system by using a new feature extraction pattern that is particularly interesting in detecting sound events. In most SED systems, Mel coefficients and standard time-frequency domain features such as wavelets have been used to extract features. This study uses EMD and IMF for feature extraction, confirming their effectiveness in the proposed SED method. The proposed approach reduces the number of features required to detect a sound event while maintaining the system’s performance. The main contributions and advantages of the proposed method can be categorized as follows:

  1. 1.

    Compared to conventional methods, the number of features required to detect a sound event is reduced;

  2. 2.

    The EMD method shows more robust characteristics against noise and distortion than other sound features;

  3. 3.

    The ability to detect multiple sound events simultaneously demonstrates the power of the EMD method in SED systems.

The article is organized as follows: In the next section, an overview of the related state-of-art is given, including the advantages and disadvantages of the current approaches; in the third section, the proposed method is described; in the fourth section, results of the proposed method are presented and compared against the ones of other methods; and finally, the findings of the current study are summarized and future works suggested.

2 Literature review

The approaches usually proposed in this area are based on multiclass classifiers because SED is considered a multiclass classification problem. In the field of feature extraction, the most commonly used features have been Mel-based features such as Log-Mel [5, 6], Log-Mel Power Spectrograms (LMS) [7, 8] and Mel Frequency Cepstral Coefficients (MFCC) [9,10,11]. In addition to MFCC, features such as linear predictive coding [12], discrete cosine transforms [13, 14], wavelet [9, 15], Perceptual Linear Prediction (PLP) [16], Linear Prediction Cepstral Coefficients (LPCC) [17], and Line Spectral Frequencies (LSF) [18] have been used in various studies for SED. MFCC has been used as a usual feature in a wide range of acoustic and sound-based machine-learning methods, for example, in voice disorder detection [19], emotion recognition [20,21,22], singing voice separation [23], fault detection using acoustic and sound data [24, 25], leak detection [26] and tree cutting events detection [27].

Several researches and experiments show that IMFs extracted using EMD from sound signals show a good response. For example, Pandya et al. [28] used sound signals in conjunction with IMF features and the K-nearest neighbor classifier to detect problems in ball bearings.

Table 1 Deep-learning approaches that have been used in SED systems

Amarnath et al. [56] used IMF to detect faults in a helical gearbox using sound and vibration signals and obtained good results. Zahra et al. [57] used Multivariate Empirical Mode Decomposition to detect seizures from medical electroencephalogram signals with an Artificial Neural Network (ANN) classifier. Bagherzadeh [58] used IMF to predict the sound signal envelope. Cheema and Singh [59, 60] used EMD to capture nonlinear dynamics of phonocardiogram signals to detect stress. Yao et al. [61] applied EMD to extract features from sounds and detect faults in a planetary gearbox by using the Random Forest (RF) classifier. Ning et al. [62] relied on EMD to extract sound features and detect gas pipe leakage from sound data with an RF classifier. Erdogan and Narin [63] used the cough signal, EMD, and a deep neural network to diagnose COVID-19 disease. Vican et al. [64] detected the pulse in the fetal phonocardiography signal by EMD. Therefore, it can be said that the EMD method has been successfully used for sound feature extraction in medicine and industry and has proven its efficiency. In most new SED studies, DL algorithms have shown their superiority in terms of classification accuracy compared to conventional methods. Hence, conventional DL algorithms and Convolutional Neural Networks (CNNs) have been frequently used individually and combined. For example, CNNs were used in [33,34,35, 42,43,44, 53]. ResNet was used as a CNN with some modifications in [32, 42, 55]. Recurrent Neural Network (RNN) in combination with a CNN, called CRNN, was considered in several works, such as the ones presented in [29, 37,38,39,40, 42, 45,46,47,48,49,50,51, 54], for SED.

Meng et al. [5] used a bidirectional gated recurrent unit (BGRU) as an RNN for sound event detection. Politis et al. [65] analyzed the classifiers used in Sound Event Localization and Detection in the DCASE 2019 Challenge and concluded that most were CRNN. Some studies, such as [66], have also used Generative Adversarial Networks (GANs). A limited number of researchers have used hidden Markov models [67], regular neural networks [68], support vector machines [69], cross-correlation [70], and ensemble learning [71], which is far less than the number of researchers that used DL algorithms. Table 1 identifies state-of-the-art DL-based approaches used in SED systems.

As a summary, the advantages of previous methods are:

  1. 1.

    Widespread Use: MFCCs and spectrogram-based features have been widely used in SED systems for their simplicity and effectiveness;

  2. 2.

    Interpretability: Those features are easily interpretable by humans, aiding in understanding the characteristics of sound events;

  3. 3.

    Established Performance: Due to their extensive usage, there are well-established performance metrics and benchmarks, making comparing different approaches easier.

The disadvantages of previous methods can be summarized as:

  1. 1.

    High Dimensionality: Conventional features such as MFCCs can result in high-dimensional feature vectors, leading to increased computational complexity and memory requirements;

  2. 2.

    Limited Discriminative Power: While effective for many applications, conventional features may lack the discriminative power needed to distinguish between subtle variations in sound events;

  3. 3.

    Fixed Representations: Features like MFCCs provide fixed representations of sound, which may not capture the dynamic nature of certain events or adapt well to changing environments.

By addressing these limitations, the proposed method aims to overcome the challenges associated with conventional feature extraction techniques. Using IMF and extracting locally based features, the proposed scheme reduces the feature dimensions while maintaining or improving the efficiency and accuracy of SED systems. In the current study, an LSTM was used as the classifier.

3 Proposed method

This study developed two event detection approaches: a segment-based approach and an activity-based approach. In the segment-based event detection approach, a sound clip is cut into multiple fixed-size segments, and the system processes each segment individually. Since multiple events may occur simultaneously, a practical solution in a segment-based event detection system is to train a binary classifier for each event separately. This classifier indicates whether or not an event occurred in a segment. Activity-based event detection specifically detects the start and end of an event in a sound clip and can estimate the duration of the event.

Fig. 1
figure 1

Block diagram of the proposed method for segment-based event detection based on DL

3.1 Segment-based event detection

The proposed method for segment-based event detection is depicted in Figs. 1 and 2. According to Figs. 1 and 2, the proposed segment-based event detection method includes two main parts: feature extraction and classification. IMFs are used in feature extraction, and Long Short-Term Memory (LSTM) or ensemble learning is used for classification.

Fig. 2
figure 2

Schematic of the proposed method for segment-based event detection based on ensemble learning

3.2 Feature extraction

The input sound is divided into several time intervals in the feature extraction phase, depending on the selected approach. The intervals can be chosen with or without overlap, and should not be so long that the online capability of the method is compromised or so short that the feature extraction process cannot provide the desired result. Moreover, excessively shortening the time intervals amplifies the impact of noise. It is important to note that all feature extraction methods inherently require a minimum number of samples to extract features, which depends on the sampling frequency and feature type.

Algorithm 1
figure a

EMD for IMF Extraction.

3.3 IMF

Sound is considered a quasi-linear or non-stationary signal; hence, time series-based methods are required to model its nonlinear and nonstationary behavior. Due to the wide application of time series in various fields such as economics, medicine, and industry, many methods have been proposed to analyze these signals quickly, such as the ones based on the spectrogram, wavelet analysis, Wigner-Ville distribution, evolutionary spectrum, and principal component analysis. Mel coefficients, which were specifically developed based on the human auditory system and are very efficient in feature extraction from sound signals, have also been used. All current methods attempt to identify and extract the inherent characteristics of the nonlinear and nonstationary sound signal that change less over time and depend on the desired output. However, most of these methods have problems with unstable and nonlinear signals, mainly:

  1. 1.

    When a signal is nonstationary, it is generally computed the harmonic components, which require a large amount of data to extract the characteristics of the signal over time.

  2. 2.

    Most of those methods require a linear system to obtain signal information, and in nonlinear systems, a lot of data is needed to model the nonlinear components. In addition, the EMD method produces a collection of IMFs that allows the system to extract instantaneous frequencies from the signal at different time scales.

IMFs are well-functioning Hilbert transforms that can extract the instantaneous frequencies of a system in short periods and model the phenomenon under study on the time-frequency axis, even if they are transient. The main concept used in IMF is the instantaneous frequency, which differs from the time-independent frequency defined in most transforms, such as the Fourier transform. In the concept of instantaneous frequency, the frequency can vary with time, similar to frequency modulation. One must first understand the Hilbert transform to understand the concept of IMF. The Hilbert transform of a signal, X(t), is defined as:

$$\begin{aligned} Y(t) = \frac{1}{\pi }p.v.{\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} \int _{ - \infty }^\infty {\frac{{X(t')}}{{t - t'}}} dt', \end{aligned}$$
(1)

where \(p.v.{\hspace{1.0pt}}\) is the Cauchy principal value. The Hilbert transform of X(t) is combined with the signal itself as a complex function, Z(t):

$$\begin{aligned} Z(t) = X(t) + iY(t) = \alpha (t){e^{i\theta (t)}}, \end{aligned}$$
(2)

where \(\alpha \) and \(\theta \) are the absolute value and argument of the polar form of the complex function, Z(t). Based on \(\theta \), the instantaneous frequency of the signal X(t) can be defined as:

$$\begin{aligned} \omega = \frac{{d\theta }}{{dt}}. \end{aligned}$$
(3)

Even with the above definition, there is still ambiguity in the definition of instantaneous frequency because calculating the Hilbert transform requires an infinite number of samples. The Hilbert transform is limited within EMD; therefore, an alternative function class is defined as IMF, which can define the instantaneous frequency locally. In the entire period, the number of extrema and zero crossings of IMF should equal the original signal or the maximum difference of this number should be 1 (one). After applying EMD and extracting IMFs, one obtains:

$$\begin{aligned} \mathrm{{X(t) = }}\sum \limits _{i = 1}^N {IM{F_i}(t) + r(t)}, \end{aligned}$$
(4)

where N is the number of IMFs and r(t) is the residual signal representing the computational error. In most cases, the residual signal is monotonic and has low amplitude. The pseudocode for EMD is given in Algorithm 1.

In Algorithm 1, \(\gamma \) is assumed to be 0.2, as suggested in the used Matlab software. The maximum number of IMFs in this step is assumed to be 10, and all remaining IMFs in the signal are discarded. After extracting IMFs, each IMF’s energy and average frequency are extracted as the final feature. The energy is calculated using the sum of the square powers of the amplitudes in each IMF. The average frequency is defined by [72] as:

$$\begin{aligned} Average\,Frequency = \frac{{\sum \limits _{j = 1}^M {{f_j}{P_j}} }}{{\sum \limits _{j = 1}^M {{P_j}} }}, \end{aligned}$$
(5)

where \({P_j}\) is the power of the signal at frequency \({f_j}\). If a signal has less than 10 IMFs, the energy values and the average frequency of IMFs that do not exist are assumed to be 0 (zero). If a signal has more than 10 IMFs, the first 10 IMFs are included in feature extraction, and the rest is discarded.

3.4 Deep Learning

RNNs are particularly suited for processing serial data, where subsequent samples depend on previous ones. In traditional RNNs, due to their simple structure and limited recurrent coefficients in the hidden layers, weights are updated using the gradient relationship, which fails in the case of long-time series. This limitation in maintaining and understanding long-term patterns is a weakness of RNNs.

Several techniques have been proposed to address this weakness of RNNs, including non-gradient-based training patterns such as simulated annealing and discrete error propagation [73, 74], explicitly introduced time delays [75,76,77] or time constants [78], and hierarchical sequence compression [79] are among them with each having its limitations and advantages.

Fig. 3
figure 3

Reduction of the effect of distant samples in updating the hidden layer of an RNN

3.4.1 LSTM

LSTM belongs to the modified RNN architectures [80]. This DL model is considered the most effective for] simultaneously capturing long-term and short-term patterns. Considering the importance of processing time series and video data whose results depend on current and past data, LSTM is considered one of the most widely used Neural Network (NN) models. Its structure is a modified RNN with the purpose of long-term data retention. In a typical RNN, as the data are updated, the influence of the data on the more distant samples decreases compared to the closer samples until it eventually becomes almost 0 (zero). Figure 3 depicts the reduction of the effect of distant samples in updating the hidden layer of an RNN, which indicates the small impact of distant samples and their ineffectiveness over time.

Fig. 4
figure 4

A typical RNN structure

Fig. 5
figure 5

The usual LSTM network

Figure 4 shows a typical RNN where the output depends on the previous states and the new input. Figure 5 shows the LSTM structure, which is also essential to state that its cell is more complex than a simple RNN. In Fig. 5, \(\sigma \) denotes the activation function. The cell’s input and output activation functions (\({\sigma _g}\) and \({\sigma _h}\)) are usually hyperbolic tangent functions (tanh) or logistic sigmoid functions, although in some cases, \({\sigma _h}\) is the identity function. Dashed lines represent weighted ’peephole’ connections. In peephole connections, in addition to the input and previous internal states, hidden states are also used to control input and output, and forgetting gate activation functions assist in increasing the degrees of freedom and capabilities of LSTM. Forget Gate determines which inputs and previous states affect the output and which should be ignored. The presence of the forget gate allows the cell to learn long-term patterns.

Fig. 6
figure 6

Simplified LSTM network used in the current study

Equations (6) to (11) show the effect of the blocks depicted in Fig. 5 on the LSTM output, where \({x_t}\) is the input, \({h_{t - 1}}\) the previous state, \({c_t}\) the previous states of Forget Gate, W the weights, b the bias of each part, \(\sigma \) the activator functions, \({\odot }\) the Hedmard’s multiplication, and the output is \({h_t}\) [81]:

$$\begin{aligned} {i_t}\mathrm{{ }} = \mathrm{{ }}{\sigma _f}\left( {{W_{xi}}{x_t}\mathrm{{ }} + \mathrm{{ }}{W_{hi}}{h_{t - 1}}\mathrm{{ }} + \mathrm{{ }}{W_{ci}}{\hspace{1.0pt}} \mathrm{{ }} \odot \mathrm{{ }}{\hspace{1.0pt}} {c_{t - 1}}\mathrm{{ }} + \mathrm{{ }}{b_i}} \right) , \end{aligned}$$
(6)
$$\begin{aligned} {f_t}\mathrm{{ }} = \mathrm{{ }}{\sigma _f}\left( {{W_{xf}}{x_t}\mathrm{{ }} + \mathrm{{ }}{W_{hf}}{h_{t - 1}}\mathrm{{ }} + \mathrm{{ }}{W_{cf}}\mathrm{{ }} \odot \mathrm{{ }}{c_{t - 1}}\mathrm{{ }} + \mathrm{{ }}{b_f}\mathrm{{ }}} \right) , \end{aligned}$$
(7)
$$\begin{aligned} {g_t}\mathrm{{ }} = {\sigma _g}\left( {{W_{xc}}{x_t}\mathrm{{ }} + \mathrm{{ }}{W_{hc}}{h_{t - 1}}\mathrm{{ }} + \mathrm{{ }}{b_c}} \right) , \end{aligned}$$
(8)
$$\begin{aligned} {c_t}\mathrm{{ }} = \mathrm{{ }}{f_t}\mathrm{{ }} \odot \mathrm{{ }}{c_{t - 1}}\mathrm{{ }} + \mathrm{{ }}{i_t}\mathrm{{ }} \odot \mathrm{{ }} {g_t}, \end{aligned}$$
(9)
$$\begin{aligned} {o_t}\mathrm{{ }} = \mathrm{{ }}{\sigma _f}\left( {{W_{xo}}{x_t}\mathrm{{ }} + \mathrm{{ }}{W_{ho}}{h_{t - 1}}\mathrm{{ }} + \mathrm{{ }}{W_{co}}\mathrm{{ }} \odot \mathrm{{ }}{c_t}\mathrm{{ }} + \mathrm{{ }}{b_o}} \right) , \end{aligned}$$
(10)
$$\begin{aligned} {h_t}\mathrm{{ }} = \mathrm{{ }}{o_t}\mathrm{{ }} \odot \mathrm{{ }}{\sigma _h}\left( {{c_t}} \right) . \end{aligned}$$
(11)

Based on our best knowledge, most SED methods used simplified LSTMs [31, 52]. Peephole LSTM has been predominantly used in other domains [82,83,84]. Hence, the proposed method uses the simplified LSTM shown in Fig. 6. In the used LSTM network, the previous equations were changed as:

$$\begin{aligned} {i_t}\mathrm{{ }}= & {} \mathrm{{ }}{\sigma _f}\left( {{W_{xi}}{x_t}\mathrm{{ }} + \mathrm{{ }}{W_{hi}}{h_{t - 1}}\mathrm{{ }} + \mathrm{{ }}{b_i}} \right) , \end{aligned}$$
(12)
$$\begin{aligned} {f_t}\mathrm{{ }}= & {} \mathrm{{ }}{\sigma _f}\left( {{W_{xf}}{x_t}\mathrm{{ }} + \mathrm{{ }}{W_{hf}}{h_{t - 1}}\mathrm{{ }} + \mathrm{{ }}{b_f}\mathrm{{ }}} \right) , \end{aligned}$$
(13)
$$\begin{aligned} {g_t}\mathrm{{ }}= & {} {\sigma _g}\left( {{W_{xc}}{x_t}\mathrm{{ }} + \mathrm{{ }}{W_{hc}}{h_{t - 1}}\mathrm{{ }} + \mathrm{{ }}{b_c}} \right) , \end{aligned}$$
(14)
$$\begin{aligned} {c_t}\mathrm{{ }}= & {} \mathrm{{ }}{f_t}\mathrm{{ }} \odot \mathrm{{ }}{c_{t - 1}} + {g_t} \odot {i_t}, \end{aligned}$$
(15)
$$\begin{aligned} {o_t}\mathrm{{ }}= & {} \mathrm{{ }}{\sigma _f}\left( {{W_{xo}}{x_t}\mathrm{{ }} + \mathrm{{ }}{W_{ho}}{h_{t - 1}} + \mathrm{{ }}{b_o}} \right) , \end{aligned}$$
(16)
$$\begin{aligned} {h_t}\mathrm{{ }}= & {} \mathrm{{ }}{o_t}\mathrm{{ }} \odot \mathrm{{ }}{\sigma _g}\left( {{c_t}} \right) . \end{aligned}$$
(17)

3.4.2 Fully connected layers

The fully connected layer is equivalent to the hidden layer in typical NNs. This layer combines an affine function and a nonlinear activation function. The affine function is defined as \(\mathrm{{y = Wx + b}}\). The nonlinear activation function can be defined from a class such as sigmoid, tanh, or rectified linear unit (ReLU). The fully connected layer in the proposed structure has a no nonlinear function and only has 1 (one) affine function. The connections between the LSTM and middle layers and the middle and output layers are usually made through this layer.

3.4.3 Softmax

The softmax layer or softmax function, also known as softargmax or normalized exponential function, maps the input vector to a set of numbers between 0 (zero) and (1) one. This function provides a smooth and continuous approximation to the differentiable maximum function. The sum of the output numbers of this function, which is a probability distribution, is necessarily equal to 1 (one). This layer is not trained and maps the input to the interval: [0 1]. The formula for the softmax function is:

$$\begin{aligned} {y_i} = \frac{{{e^{{x_i}}}}}{{\sum \limits _{i = 1}^N {{e^{{x_i}}}} }}, \end{aligned}$$
(18)

where \({\mathrm{{x}}_i}\) is the input vector, i.e., the output of the fully connected layer, and \({\mathrm{{y}}_i}\) is the output corresponding to each input.

3.4.4 Cross entropy loss

Cross entropy is used to maximize the accuracy of the entire classifier. The value of cross entropy increases rapidly when the predicted probability deviates from the actual value. Thus, minimizing cross entropy is equivalent to bringing the predicted probability closer to the real value. A classifier trained using cross entropy is more accurate and effective than a classifier trained using other optimization criteria where the last layer can produce probability values. In information theory and pattern recognition, minimizing cross entropy is equivalent to achieving maximum likelihood. Minimizing the cross entropy is equivalent to minimizing the Kullback-Leibler divergence between the probability distribution of the real output and the probability distribution of the classifier, corresponding to the maximum similarity between the ideal output and the classifier’s output.

3.4.5 Ensemble learning

In addition to DL, ensemble learning is used in the proposed method. Ensemble learning methods use a set of weak classifiers instead of a single classifier to improve efficiency. The parameters of an ensemble classifier are the number of weak classifiers, the type of classifiers, and whether they are similar or different. Bootstrap aggregation,i.e., bagging, performed better than other types of ensemble learning in our study. Finally, the results of all classifiers are combined, and the dominant class, i.e., the class selected by most classifiers, is chosen as the final output. The proposed method used decision trees as weak classifiers in the ensemble learning structure.

3.5 Event Activity Detection System

A sound activity detection system determines the start and end of an event or the duration of the event in a sound clip. Due to the complexity of detecting the beginning and end of a sound event and its structural differences from the segment classification, this study takes into account changes in the extracted features instead of the approach normally used in conventional methods. The proposed sound activity detection method is divided into two steps:

  1. 1.

    Detect if there is a sound event in the input clip;

  2. 2.

    And, if there is, find the start and end of the event based on the change in the used features.

Since this is a hierarchical method, the efficiency of both steps, which involve detecting the sound occurrence and correctly labeling the start and end, directly impacts the method’s accuracy. The block diagram of the first step is depicted in Fig. 7.

Fig. 7
figure 7

Block diagram of the first step of the sound activity detection method

The first step of the activity detection system is similar to that of the segment-based event detection system. In this case, the IMF features of the entire signal are obtained, so one has only 20 features for each clip \(\mathrm{{x(t)}}\). According to the studies conducted in this case, ANN is a better choice than DL and ensemble learning methods because of the small number of input features. The second step is initiated if the first part’s output is an event’s occurrence. Figure 8 depicts the block diagram of the second step’s training phase.

Fig. 8
figure 8

Block diagram of the training phase of the second step of the sound activity detection method

Only audio clips containing selected events were used in the training phase. First, the input signal is split into 1-second segments without overlap. The end of the segment before the event is selected as the beginning of the event. The start of the segment after the ending event is considered the end of the event. As in the segment-based approach, IMF features are extracted for each segment. In this way, one has 20 features for each interval. Any feature that changes when a sound event starts or ends can be used to detect event activity. This process is very challenging because background noise and other events occur in different segments and cause changes in the value of features while having no meaningful relationship to the beginning and end of the event. To solve this problem, an averaging and regularization step is added to the proposed method to separate the effects of noise and other sound events from selected events. In the regularization block, the change in feature is measured relative to the overall signal by dividing the derivative value by the average value of the derivative in the entire signal. The regularization step removes the background noise effectively. In the averaging block, the pattern of the derivative vector is determined by averaging the absolute value of the regularized derivative of the features at the beginning and end of the selected event.

Fig. 9
figure 9

Block diagram of the final phase of the sound activity detection system

In some cases, another sound event coincides with the selected event, and averaging removes the effects of these interfering sound events. When a feature does not change with a selected event, its regularized value average is approximately 1 (one). The threshold value used to select or discard a feature is 1.3. The changed features, features whose averaged regularized value is greater than 1.3, and their rate of change, i.e., the averaged regularized value, are stored as the pattern of the selected event. The activity detection phase is depicted in Fig. 9.

In the final phase, similar to the training phase, the signal is split into non-overlap** one-second segments. For each segment, 10 IMFs are calculated, and the average frequency and energy characteristics of IMFs are extracted as features. In the time domain, a derivative of the obtained features is computed, followed by regularization, as in the training phase. Some regularised features are selected based on the pattern stored in the training phase, and the correlation coefficient between the stored pattern and regularized derivatives of all segments is calculated. The maximum correlation coefficients are selected as the beginning and end of the event. If only one maximum is detected, it is considered the event’s starting point, and the event is assumed to last until the end of the clip.

4 Experimental settings

4.1 Dataset

The proposed method was tested on the URBAN-SED datasetFootnote 1, which is widely used in this field [85]. The URBAN-SED dataset contains 10000 labeled samples of ten events in the urban area, which are classified as air conditioners, car horns, children playing, dog barking, drilling, engine idling, gunshot, jackhammer, siren, and street music (Fig. 10). Regarding time, all samples in the dataset have the same length of 10 seconds. The dataset contains a total of 100,000 seconds (approximately 28 hours) of sound, with almost 50,000 events tagged. All sounds contain background and Brownian noise, which can be heard as the typical “hum” of most crowded urban environments.

The dataset was created using the Scaper library, a soundscape synthesis and enhancement library. The sounds of the included events were taken from the UrbanSound8K dataset, which is completely real, and the scaper library added the sounds of urban environments. The labeling was done automatically according to the time of the added event. The UrbanSound8K dataset contains 8732 urban environment events with times shorter than 4 seconds. The UrbanSound8K dataset is a modified version of the UrbanSound dataset, which contains 1302 samples totaling approximately 27 hours. To standardize the comparison between different methods, UrbanSound8K was divided into three subsets: training, testing, and validation, with 6000 samples in the training group and 2000 samples in the testing and validation groups.

4.2 Evaluation metrics

In both segment-based sound event detection and sound activity detection approaches, the developed system was trained separately for each event. In the segment-based approach, a binary classifier determines whether an event has occurred in the segment. In the activity detection approach, each true label for the start and end of an event in a clip is assumed to be a true positive label, making it possible to evaluate the system using binary classification metrics. The following evaluation parameters were used for the segment-based approach before evaluation:

  1. 1.

    True positive (TP) : The sound event occurred and was correctly detected;

  2. 2.

    True negative (TN) : The sound event did not happen, and the non-event is correctly detected;

  3. 3.

    False positive (FP) : The system detected an event that did not occur;

  4. 4.

    False negative (FN) : The absence of an event has been detected when the sound event occurred.

Fig. 10
figure 10

Events included in the URBAN-SED dataset

In the sound activity detection approach, the evaluation parameters were defined as follows:

  1. 1.

    True positive (TP) : The beginning and end of the sound event were correctly detected;

  2. 2.

    True negative (TN) : The sound event did not happen, and the non-event was correctly detected;

  3. 3.

    False positive (FP) : The system detected an event, but it did not happen, or if the event did occur, the beginning and end of the event were incorrectly marked;

  4. 4.

    False negative (FN) : Detecting the absence of an event when the sound event occurred.

It is essential to note the imbalance between the two classes: the presence or absence of sound events. After specifying TP, TN, FP, and FN, the Precision (PR), Recall (RE), F-score (F1 or F-score), and Accuracy (ACC) can be calculated. The ideal value for all of these parameters is 1 (one). In the present study, Segment F1 and Event F-scores [38, 86, 87] were used as primary assessment metrics, and precision, recall, and accuracy were used as secondary assessment metrics [38, 86].

4.3 Other parameters

Another important parameter of the proposed method is the segment length. In [42], the segment length was assumed to be 0.1 seconds, and in [55], 1 (one) second. A longer segment length leads to more background noise, and a shorter length decreases valuable information for event detection. Therefore, the choice of segment length is a trade-off between noise and valuable information. In this study, the segment length was assumed to be 1 (one) second. For IMF extraction, the parameters were selected as follows:

  1. 1.

    The Cauchy type convergence criterion (\(\gamma \) in the EMD pseudocode), which is one of the stop** criteria, was set to 0.2;

  2. 2.

    The maximum number of iterations, which is one of the stop** criteria, was set to 100;

  3. 3.

    The maximum number of IMFs, which is one of the decomposition stop criteria, was chosen to be 10;

  4. 4.

    The maximum number of extrema in the residual signal, which is one of the stop** criteria for decomposition, was set to 1 (one):

  5. 5.

    The ratio between signal and residual energy, which refers to the ratio between the energy of the signal at the beginning of the iteration and the average envelope energy, is one of the decomposition’s stop criteria and was set to 20;

  6. 6.

    The envelope construction is based on the spline-based interpolation method.

Table 2 Parameters of the ADAM method

For the RNN, the first parameter is the number of hidden layers of LSTM, which was assumed to be 200. Two Adaptive Moment Estimation (ADAM) and Root Mean Squared Propagation (RMSprop) methods were used to train the RNN, and the results were compared. The parameters of the ADAM method are listed in Table 2. During training in very large datasets, the ADAM method is better than the Gradient Descent method. Gradient Descent involves problems, such as many calculations and failure to reach the global minimum when there are many local minima. By simplifying the calculation of the learning rate for each parameter using the first and second moments of the gradient, the ADAM method reduces the computational volume and memory consumption of conventional stochastic gradient descent. On the other hand, RMSprop is a modified version of gradient descent where the step size for each parameter is adjusted using a decaying average of partial gradients. A decaying moving average allows the algorithm to eliminate early gradients and works based on the most recently observed partial gradients during the search process.

5 Results and discussion

This section aims to analyze the impact of the different parameters on the proposed method’s efficiency and compare the final results with the ones obtained by related state-of-the-art methods. In the first step, the results of the segment-based event detection are reported, and the effects of data balancing and some other parameters on accuracy are discussed. In the second step, event-based results are reported and compared with the ones obtained by other methods.

Table 3 Number of training samples before and after data balancing
Table 4 F-score of proposed features using different classifiers on the URBAN-SED dataset in the segment-based approach

5.1 Segment-based event detection

In the first step, the proposed segment-based event detection method was tested for a segment length of 1 (one) second without overlap using LSTM and ensemble learning. To show the efficiency and effectiveness of the proposed method, the results were compared with the ones using mel features such as MFCC and log-mel, which are the most commonly used features in SED. The URBAN-SED is an unbalanced dataset for all events, and this unbalanced form can significantly reduce the classification efficiency. To solve this problem, similar negative segments were discarded by using the correlation coefficient as a similarity measure. Table 3 presents the number of training data in each class before and after the balancing process, assuming a segment length of 1 (one) second. Table 4 presents the acceptable F-score results of the proposed feature, which indicate a strong correlation between features extracted from IMFs and events. In the worst case that corresponds to a gunshot event, even after the balancing process, there is a significant imbalance in the ratio between the positive and negative samples (Table 3), which may cause the system’s low accuracy.

Table 5 Number of clips with (P) and without (N) a sound event on the URBAN-SED dataset
Table 6 ANN training functions that were considered in the first part of the activity detection method

5.2 Sound activity detection

In this experiment, the training and test data were approximately balanced; therefore, a balancing process was unnecessary. Table 5 presents the number of clips of the two classes for each sound event separately. For this step, different ANN structures were tested, including Patternnet, Cascadeforwardnet, Feedforwardnet, and Fitnet, with Feedforwardnet being the best ANN structure found. Various functions and values were considered when choosing the training function and the number of hidden layers and nodes in each layer. The studied training functions are listed in Table 6. Among the studied functions, trainbr and trainlm showed better results (F-score), which was chosen because of the shorter training time. According to the analysis performed, the structure with 1 (one) hidden layer and 10 nodes showed the best F-score. Table 7 presents the F-score for some of the studied situations.

Table 7 F-score values obtained for some of the investigated situations
Fig. 11
figure 11

ROC curves of trained ANNs for different events

The results in Table 7 indicate that the feed-forward network with a hidden layer and 10 neurons is the best structure for this step. After training the forward ANN with the above parameters, a Receiver Operating Characteristic Curve (ROC curve) was used to depict the F-score. Figure 11 shows the ROC curve for several events and the chosen threshold level. Maximizing the F-score was prioritized when choosing the threshold level based on the ROC. The main finding, depicted in Fig. 11, where one can observe the magnified area in the middle of the curve, is the divergent behavior of the events. Consequently, during the training phase, it is imperative to compute the threshold for each event independently. The chosen thresholds are indicated in the legend of Fig. 11, positioned at the top right corner.

Table 8 TN, FP, TP, and FN values of the proposed method after selecting the threshold based on the ROC for each event
Table 9 Average absolute value of regularized derivatives at the start and end of the considered events, with values exceeding the threshold level in bold

Table 8 presents the results of the proposed method after selecting the threshold based on the ROC for each event. When a sound event is detected in the first step, the proposed method submits it into the second step to determine the start and end points of the event. Therefore, any error in the first step results in an error in the output, and reporting accuracy metrics for this step seemed redundant as it does not show the overall system’s performance. The average absolute value of the regularized derivatives for the start and end segments of the selected event was calculated in the training phase of the second part. Table 9 lists the calculated values for the different events. A threshold value of 1.3 was considered for selecting effective features, and the values above the threshold are indicated in bold in Table 9.

Based on the results presented in Table 9, the following deductions can be stated:

  1. 1.

    Average frequency features are more critical than energy features;

  2. 2.

    9th and 10th IMF are effective only in the gunshot and in the other events are useless;

  3. 3.

    In detecting a sound event’s start and end time, i.e., in the sound activity detection, based on the bold values of the table, it can be realized that only 81 out of 200 features are significant.

Table 10 Final results obtained by the proposed sound activity detection system

In the last phase of the second part, the selected IMF features, represented by the bold values in Table 9, were compared with the corresponding features of all segments of the input clip. The two maximum similarities are marked as the beginning and end of the event. Any false markings in this step should be added to the FP value and subtracted from the TP value. Tables 10 and 11 present the final results of the proposed sound activity detection system.

Table 11 Final values of the evaluation metrics obtained by the proposed sound activity detection system
Table 12 Details of the state-of-the-art methods used for comparison purpose

5.3 Comparison with existing methods

In this section, a comparison with state-of-the-art methods in this field is presented, which confirms the effectiveness of the proposed method. The number of features and the F-score are used in this comparison. The proposed method requires much smaller features (20) than existing ones, making it easier to implement with ML methods. The studies selected for the comparison were: [38, 41, 44, 49, 88,

6 Conclusion

This study proposed a novel feature extraction approach based on IMF features for SED systems. Since the proposed method has fewer features than the Mel coefficients, the most common feature in this field, it can be easily integrated with conventional ML methods. To prove the effectiveness of the proposed features concerning the average frequency and locally regulated energy of IMFs extracted from sound segments, the features were used as input in LSTM, ensemble learning, and ANN structures, and their efficiency was analized in comparison with state-of-the-art methods proposed in this field. Next, a novel approach for detecting sound activity was proposed based on a statistical analysis of features and detection of changes. The proposed approach uses just the features extracted from the IMF and achieves good results in detecting an event’s start and end points. Finally, the proposed method was applied to various events in the URBAN SED dataset, and its effectiveness was demonstrated for both segment-based event detection and sound activity detection. Comparison with state-of-the-art methods proposed in this field showed that the proposed features are as effective as those based on Mel coefficients despite their much smaller number. As a limitation, the effectiveness of IMF as the main part of the proposed method may vary depending on factors such as signal-to-noise ratio, the complexity of the sound events, and environmental conditions. Another limitation of the proposed approach is the small number of extracted IMF features, which can cause issues in training DL methods; for instance, in cases of overfitting, the test and validation sets may show significantly lower accuracy compared to the training data.

Future developments could combine the proposed approach with approaches that use Mel coefficients to improve the efficiency of various SED tasks. In addition, IMF properties can be used in other speech languages such as sound persian phoneme articulation [90, 91] or speech synthesis [92]. The analysis of IMF applicability in SED systems, considering varying levels of signal-to-noise ratios, is another suggested topic for future investigation.