1 Introduction

Sound signal classification, where patterns and features are extracted to categorize audio data, is fundamental for training models to recognize sounds automatically. This technology has become increasingly important beyond military applications, particularly in vessel classification based on their acoustic signature. This capability plays a vital role in monitoring maritime traffic and identifying noise sources within underwater environmental monitoring systems (Karbasi et al. 2011).

Significant efforts have focused on develo** algorithms for classifying vessels according to their emitted noise, aiming to improve classification accuracy despite signal variability from machinery changes and ocean environment complexity (Das et al. 2013). The performance of these algorithms relies on effective feature extraction methods and the choice of an appropriate classification algorithm (Zhang et al. 1998; Karbasi et al. 2011; Zhang et al. 2021), making it challenging to develop efficient methods that extract relevant information from raw data (Liu et al. 2014). The integration of machine learning techniques into underwater acoustics research has led to the development of automated classification approaches, advocating for the use of frequency domain features extracted from underwater acoustic signals as input for machine learning models (Liu et al. 2014; Bianco et al. 2019; Zhang et al. 2021).

Various feature extraction methodologies have been developed to tackle the challenges of classifying and extracting features from maritime vessel noise. For example, Li et al. (2008) proposed an algorithm based on the wavelet packet transform, extracting energy features in each frequency band through a four-stage decomposition, which resulted in low-dimensional feature vectors. Meanwhile, Karbasi et al. (2011) presented a method exploiting spectral changes over time by dividing audio segments into frames and extracting a spectral dynamic feature (SDF) vector. Das et al. (2013) utilized a cepstral approach to mitigate signal distortion caused by shallow underwater propagation, showing a significant reduction in multipath distortion effects.

Liu et al. (2014) proposed a method based on the line spectrum of ship noise, achieving high accuracy rates, albeit limited to classifying two ship classes. Additionally, Chen and Xu (2017) introduced low-frequency analysis and recording and detection of envelope modulation on noise for underwater acoustic signal feature extraction.

Lian et al. (2017) and Tong et al. (2020) introduced feature extraction approaches using Gammatone frequency cepstral coefficients (GFCCs) and Mel-frequency cepstral coefficients (MFCCs). These methods utilize Gammatone or Mel filter banks, respectively. Lian et al. (2017) proposed a modified version of GFCC specifically adapted for the underwater environment. Their experiments showed a decrease in accuracy with increasing background noise levels. Tong et al. (2020) used MFCCs extracted from real-world underwater targets as input to a k-nearest neighbor (k-NN) classifier. The results demonstrated improved classification accuracy with increasing MFCC order but at the cost of higher data size and computational complexity.

This work proposes a novel approach for classifying underwater vessels by introducing a feature extraction methodology based on the concept of SDFs presented in Karbasi et al. (2011) but specifically adapted to the challenges of underwater sound propagation. A new combined feature vector is proposed, fusing MFCCs extracted using a custom Mel filter design optimized for underwater vessel classification with the SDF vector. The results indicate that this combination can, in some cases, optimize results obtained using traditional feature extraction techniques.

The remainder of this article is structured as follows. Section 2 describes the methodology for extracting SDFs and forming the combined feature vector. Section 3 details the experimental sound data used in the study. Section 4 addresses the classification results and related discussion. Lastly, Section 5 concludes the article by summarizing the findings and suggesting avenues for future research.

2 Methodology

2.1 Classification system

A core challenge in machine-learning-based classification systems lies in representing high-dimensional raw audio signals with lower-dimensional feature vectors. This dimensionality reduction offers several advantages: improved computational efficiency, reduced risk of model overfitting to training data, enhanced feature extraction capabilities, and potentially increased data interpretability (Bianco et al. 2019). The resulting feature vector should capture essential information about the signal while remaining compact enough to facilitate efficient recognition.

A simplified version of the architecture proposed in our study for the preprocessing, feature extraction, and classification of underwater acoustic data is depicted in the flowchart shown in Fig. 1.

Fig. 1
figure 1

Simplified flowchart depicting the proposed classification system using a combined SDF and MFCC approach. The system takes an audio signal as input and aims to classify the underwater target(s). The outputs from the Mel filter bank contribute to the extraction of both the MFCC vector and the SDF vector. These two individual feature vectors are then combined to form a single, comprehensive feature vector that serves as input for classification algorithms

The process begins with an audio signal captured by a hydrophone. The signal is digitized and represented as x[l], where = 1, 2, ..., L and L is determined by the sampling frequency fs. The raw signal x[l] is then segmented into smaller time windows denoted by t and expressed as s[n], where = 1, 2, ..., Ns and \(Ns=t \cdot fs\) represents the number of samples per segment. Segmentation facilitates the processing of long audio files by dividing them into manageable portions.

Each segment \(s[n]\) is preprocessed before Mel filter bank processing. This step typically involves transitioning the signal from the time domain to the frequency domain. Segmentation and preprocessing are crucial for handling the inherent variability in vessel configurations and the nonhomogeneity of emitted noise over time. This approach results in a diverse set of feature vectors, each capturing specific characteristics of the vessel’s acoustic signature.

Following preprocessing, the segmented signals are fed into a Mel filter bank. The outputs from this bank serve as the basis for generating the combined feature vector used for classification. This vector incorporates both the widely used MFCCs and the novel SDF vector proposed in this work (building upon the work of Karbasi et al. (2011)). A detailed explanation of each step within the flowchart is provided in the following sections.

2.2 Preprocessing

When applying machine learning techniques to classification tasks, the sensitivity of raw data necessitates careful consideration. Proper data preprocessing significantly influences classification performance. Here, we employed preprocessing steps similar to those used for obtaining classical MFCCs, as depicted in Fig. 2.

Each segment s[n] undergoes the following preprocessing steps:

  1. (1)

    Pre-emphasis: The initial step involves pre-emphasizing \(s[n]\) using a first-order digital filter with the transfer function \(H(z)=\;1\;-\;\alpha \cdot z^{-1}\), where \(0.9<\alpha <1\) and \(\alpha\) is the pre-emphasis coefficient. Pre-emphasis equalizes the frequency spectrum, highlighting higher frequencies relative to lower ones. Higher values of \(\alpha\) correspond to more pronounced enhancements. The pre-emphasized signal is denoted as \(u[n]\) and defined as \(u[n]\;=\;s[n]\;-\;\alpha \cdot s[n-1]\).

  2. (2)

    Framing: The pre-emphasized signal \(u[n]\) is then divided into smaller blocks called ’frames’, typically ranging from 20 to 40 ms in duration. To ensure a smooth transition between frames, consecutive frames overlap by 50% (Rabiner and Juang 1999). The set of frames is denoted by \(\left\{ x_i[n],\;1\le i\le NF,\;1\le n\le N\right\}\), where NF is the number of frames and N is the frame size. Segmenting the signal into short intervals (milliseconds) approximates it as a stationary process before applying the Fourier transform (FT) as the signal characteristics change slowly within a segment of seconds.

  3. (3)

    Windowing: Each frame is multiplied by a window function to minimize spectral leakage caused by discontinuities at the frame boundaries introduced by framing (Majeed et al. 2015). The most commonly used window in signal processing is the Hamming window, defined as follows (Rabiner and Juang 1999):

    $$\begin{aligned} w[n]=0.54-0.46\cos \left( \frac{2\uppi n}{N-1}\right) . \end{aligned}$$
    (1)
  4. (4)

    Spectral estimation: An additional preprocessing step is applied to compute the spectral estimation of each frame using short-time FT (STFT). This results in spectral coefficients, which are complex numbers containing both magnitude and phase information (Majeed et al. 2015). Whereas phase information is often discarded, we utilize power spectral coefficients (PSCs) defined as

    $$\begin{aligned} p[k]=\frac{\left| \sum _{n=1}^{N}\;x[n]w[n]\text{e}^\frac{-j2\uppi nk}{N}\right| ^{2}}{N},\quad \quad 1\le k\le N. \end{aligned}$$
    (2)
Fig. 2
figure 2

a Simplified flowchart illustrating preprocessing steps for audio signal segments. The process begins with an input signal segment and ends with the generation of PSCs as the output. b Graphical representation of the preprocessing pipeline for audio signal segments. The process begins with a pre-emphasis of the input signal using a first-order digital filter. The pre-emphasized signal is then divided into smaller frames. These frames are often arranged in a matrix format, where each row represents the samples of a specific frame. Each frame is then multiplied by a Hamming window function. Lastly, the STFT is applied to the windowed frames. The resultant complex-valued coefficients are squared to obtain PSCs

2.3 Mel filter bank

The Mel filter bank consists of a set of bandpass filters with triangular envelopes, whose centers are spaced according to the Mel scale, as illustrated in Fig. 3.

Fig. 3
figure 3

Representation of a Mel filter bank, comprising a series of triangular bandpass filters, with centers spaced according to the Mel scale and overlap** frequency responses. This filter bank transforms the linear power spectrogram obtained from the STFT into a nonlinear representation based on the Mel-frequency scale. In the figure, f refers to the initial, central, and final frequencies of each filter, determining the range of frequencies analyzed by the filter, whereas H determines the shape of the filter’s frequency response, determined by the filter coefficients. The output of the Mel filter bank is a set of values representing the energy within the frequency band of each filter, aiding in capturing the spectral characteristics of an audio signal

The Mel filter bank is an important component in feature extraction for audio classification tasks. It transforms the linear power spectrum obtained from the STFT into a nonlinear representation on the Mel-frequency scale (Tong et al. 2020).

The Mel scale is a perceptually motivated frequency scale that approximates how humans perceive pitch. Below 1 kHz, the Mel scale exhibits a roughly linear relationship with frequency in hertz, providing finer resolution for lower frequencies. In contrast, it adopts a logarithmic spacing above 1 kHz, resulting in coarser resolution for higher frequencies (Rao and Manjunath 2017). This characteristic is particularly beneficial for underwater vessel classification because significant noise from vessels often resides in the lower frequency range (McKenna et al. 2012; Yao et al. 2023).

The conversion between the linear frequency scale (Hz) and the Mel scale is expressed mathematically in Eq. (3):

$$\begin{aligned} f_{\mathrm{Mel}}=2595\log \left( 1+\frac{f_{\mathrm{Hz}}}{700}\right) . \end{aligned}$$
(3)

Here, \(f_{\mathrm{Mel}}\) represents the frequency in Mel, and \(f_{\mathrm{Hz}}\) denotes the frequency in hertz. This equation essentially converts linear frequencies to the Mel scale using a logarithmic function. The constant factor 2595 scales the conversion such that lower frequencies map to lower Mel values and vice versa.

The Mel spectrum of the power spectrum \(p[k]\) is calculated by multiplying the power spectrum by each of the triangular filters using Eq. (4). The number of filters is established experimentally, as detailed in Section 2.4, and it determines the dimension of the Mel filter output vector.

$$\begin{aligned} e[m]\;=\;\sum \limits _{k=0}^{N-1}p[k]H_m[k],\quad \quad 1\le m\le M, \end{aligned}$$
(4)

where \(H_m[k]\) are the coefficients of the \(m\text{-th}\) Mel filter, M is the number of Mel filters, and N is the number of samples per frame. The output energies of these filters can be interpreted as subsamplings of the spectrum.

For the set of frames (NF), we obtain a set of vectors \(\left\{ e_i[m],\;1\le i\le NF,\;1\le m\le M\right\}\). These vectors are visualized within a matrix denoted by E in Eq. (5):

$$\begin{aligned} \varvec{E}= \left[ \begin{array}{ccccc} e_1[1]&{}\cdots &{}e_1[m]&{}\cdots &{}e_1[M]\\ \vdots &{}\ddots &{}\vdots &{}&{}\\ e_i[1]&{}\cdots &{}e_i[m]&{}\cdots &{}e_i[M]\\ \vdots &{}&{}\vdots &{}\ddots &{}\\ e_{NF}[1]&{}\cdots &{}e_{NF}[m]&{}\cdots &{}e_{NF}[M]\end{array}\right] ,\;\;\;\;\;1\le i\le NF,\;1\le m\le M. \end{aligned}$$
(5)

The matrix, E, serves as the basis for the subsequent calculation of MFCCs, which capture cepstral features, and for obtaining the SDF vector.

2.4 Mel filter bank project

A fundamental requirement in obtaining the output of the Mel filter bank is its proper configuration, ensuring its adaptation to the context of our problem. These filters are commonly constructed in the frequency domain, and central frequencies are usually distributed uniformly along the frequency axis (Dixit et al. 2016; Rao and Manjunath 2017). The selection of filter bank parameters, including the number of filters, their center frequencies, and bandwidths, directly influences the coverage and resolution characteristics of the Mel filter bank.

In this work, we adopted an experimental approach to customize the filter parameters for both MFCC and SDF calculations. This approach allowed us to tailor the filters to the unique characteristics of our dataset, potentially improving classification performance.

Previous research by McKenna et al. (2012) and Yao et al. (2023) suggested that underwater-vessel-generated noise primarily concentrates in frequencies below 1 kHz. To achieve a frequency band distribution that prioritizes high resolution in the low-frequency range and low resolution for high frequencies (Majeed et al. 2015), we designed a set of 42 experimentally optimized triangular bandpass filters. These filters were limited to a frequency range of 0 to 7 kHz. Our experiments revealed that including filters for frequencies exceeding 7 kHz did not yield significant improvements in classification accuracy. Notably, half of the filters were centered at frequencies below 1 kHz to ensure high resolution in this critical range for vessel noise. The final configuration of the Mel filter bank is presented in Table 1 and Fig. 4.

Table 1 Mel filter bank configuration. Forty-two triangular filters were designed with variable center frequencies and bandwidths
Fig. 4
figure 4

Mel filter bank specifically designed for underwater vessel classification. It comprises 42 experimentally optimized triangular bandpass filters tailored to capture the spectral characteristics of underwater vessel noise. The filter characteristics, including center frequencies and bandwidths, are defined in Table 1 to prioritize high resolution in the low-frequency range and low resolution for high frequencies

2.5 Mel-frequency cepstral coefficients

Figure 5 illustrates the phases involved in extracting MFCC feature vectors. The input signal, denoted as \(s[n]\), undergoes preprocessing steps. The preprocessed signal passes through a bank of Mel filters. In sequence, the output energies of each filter in the Mel filter bank are converted to the logarithmic domain. This emphasizes spectral variations and reduces the impact of large amplitude values. In the end, the logarithm of the Mel filter bank energies is transformed using discrete cosine transformation (DCT). This process correlates spectral information, minimizes redundancies, and emphasizes the most distinctive characteristics of the signal (Rao and Manjunath 2017; Tong et al. 2020). Additionally, applying the DCT allows for dimensionality reduction by discarding higher-order terms, as the most relevant information is typically present in the lower-order coefficients (Li et al. 2008). The resulting coefficients, obtained by applying the DCT, are known as MFCCs. They are mathematically represented by Eq. (6):

$$\begin{aligned} c[n]=\sum \limits _{m=1}^{M}\log \left( e[m]\right) \cos \left( \frac{\uppi \left( m-0.5\right) n}{M}\right), \quad n=1,2,...,p, \end{aligned}$$
(6)

where \(c[n]\) are the Mel cepstral coefficients, M denotes the number of Mel filters in the filter bank, and p is the number of DCT points and the number of MFCCs extracted from each frame, illustrated in the lines of the matrix C, in Eq. (7):

$$\begin{aligned} \varvec{C}= \left[ \begin{array}{ccccc} c_1[1]&{}\cdots &{}c_1[n]&{}\cdots &{}c_1[p]\\ \vdots &{}\ddots &{}\vdots &{}&{}\\ c_i[1]&{}\cdots &{}c_i[n]&{}\cdots &{}c_i[p]\\ \vdots &{}&{}\vdots &{}\ddots &{}\\ c_{NF}[1]&{}\cdots &{}c_{NF}[n]&{}\cdots &{}c_{NF}[p]\end{array}\right] ,\;\;\;\;\;1\le i\le NF,\;1\le n\le p, \end{aligned}$$
(7)

where NF represents the total number of frames in the segmented signal.

Fig. 5
figure 5

Process of extracting the MFCC feature vector for a frame

In the end, the MFCC feature vector for the entire segment is obtained by calculating the average of the columns of matrix C, as shown in Eq. (8):

$$\begin{aligned} mean_{\mathrm{MFCC}}=[\mu _1\;\cdots \;\mu _i\;\cdots \;\mu _p\;],\;\;\;\;\;1\le i\le p, \end{aligned}$$
(8)

where \(\mu _i\) represents the average value of the i-th column of matrix C.

2.6 SDF vector

Inspired by Karbasi et al. (2011), who used the SDF vector in an environmental sound classification task in the air, we made modifications to adapt the algorithm to the underwater environment.

Most feature extractors used in the context of vessel classification are the same as those commonly used in speech and music processing, such as zero-crossing rate (ZCR), MFCC, root mean square energy, spectral centroid, chroma feature, spectral flux, linear predictive coding (LPC) coefficients, and linear prediction cepstral coefficients (LPCC). The work by Sharma et al. (2020) detailed the process of obtaining each of the mentioned extractors.

Traditional feature extraction methods focus on the temporal and spectral characteristics of a signal within a single frame, where each frame is processed separately, thus failing to provide information about variations between consecutive frames.

Underwater audio analysis presents additional challenges due to the temporal variation of spectral features. These variations arise from the combination of multiple short-duration spectra with distinct characteristics, evolving over time. Consequently, effective underwater audio analysis requires techniques that capture the dynamic and variable nature of signals, identifying relevant temporal patterns and spectral features.

The proposed algorithm addresses these challenges by extracting dynamic information over 4-s segments. This approach leverages the temporal continuity of signals, explores nonuniformity, and captures how features change between frames. By extending the analysis window, the method provides a more comprehensive view of feature behavior over time, contributing to improved classification accuracy.

To extract SDFs, we apply the NF-point fast FT (FFT) to the outputs of each Mel filter across all frames, represented by the columns of matrix E in Eq. (5). We define \(v_m[i]\) as the vectors extracted from each column, and \(v_m[i]=\left\{ e_i[m],\;1\le i\le NF,\;1\le m\le M\right\}\). The results of the FFT applied to the columns of E are grouped into matrix Z as shown in Eq. (9).

$$\begin{aligned} \varvec{Z}= \left[ \begin{array}{c} \mathcal F\left\{ v_1[i]\right\} \\ \vdots \\ \mathcal F\left\{ v_m[i]\right\} \\\vdots \\ \mathcal F\left\{ v_M[i]\right\} \end{array}\right] \overset{\triangle }{=} \left[ \begin{array}{ccccc} z_1[1]&{}\cdots &{}z_1[i]&{}\cdots &{}z_1[NF]\\ \vdots &{}\ddots &{}\vdots &{}&{}\\ z_m[1]&{}\cdots &{}z_m[i]&{}\cdots &{}z_m[NF]\\ \vdots &{}&{}\vdots &{}\ddots &{}\\ z_M[1]&{}\cdots &{}z_M[i]&{}\cdots &{}z_M[NF]\end{array}\right] ,\;\;\;\;\;1\le i\le NF,\;1\le m\le M. \end{aligned}$$
(9)

To conclude the process of obtaining the SDF vectors, two additional steps are performed to reduce data redundancy and dimensionality. Similar to the calculation of MFCCs, we apply the logarithm to the matrix Z to equalize the spectrum. Next, we apply the DCT to extract cepstral parameters from the output energies of the filters (Karbasi et al. 2011). The result is a matrix of dynamic components (DC) in Eq. (10).

$$\begin{aligned} \varvec{DC}= \left[ \begin{array}{ccccc} dc_1[1]&{}\cdots &{}dc_1[n]&{}\cdots &{}dc_1[L]\\ \vdots &{}\ddots &{}\vdots &{}&{}\\ dc_m[1]&{}\cdots &{}dc_m[n]&{}\cdots &{}dc_m[L]\\ \vdots &{}&{}\vdots &{}\ddots &{}\\ dc_{M}[1]&{}\cdots &{}dc_{M}[n]&{}\cdots &{}dc_{M}[L]\end{array}\right] ,\;\;\;\;\;1\le m\le M,\;1\le n\le L. \end{aligned}$$
(10)

where L is the number of DCT points.

The DC matrix is subsequently rearranged, with its rows concatenated to compose the SDF vector, characterized by dimensions of 1 \(\times\) (LM), where M represents the number of Mel filters.

2.7 Combined feature vector

As mentioned earlier, the combined feature vector is formed by the combination of MFCCs with SDF. The schematic diagram in Fig. 6 shows the details of all the steps of the extraction of the combined feature vector. In summary, the matrix formed by the PSCs of the set of frames is subjected to a filter bank designed on the Mel scale. The output vectors of the Mel filter bank are used as input vectors for obtaining both MFCCs and SDF.

Fig. 6
figure 6

Process of extracting the combined feature vector from the input signal, going through all the stages discussed until obtaining the MFCC + SDF vector

3 Dataset

The dataset used for testing was obtained from recordings of underwater sounds in real environments available at http://atlanttic.uvigo.es/underwaternoise (Santos-Domínguez et al. 2016). This ShipsEar database consists of recordings of 90 vessels of 11 different types and ambient noise at various points along the northwest coast of Spain, totaling 2.94 h of recordings. ShipsEar also provides information on environmental conditions, images of vessels, location data, acquisition time, channel depth, and distance, among other details.

To optimize the testing process and ensure a balanced distribution of samples across vessel types, a subset of 53 audio files was selected from the database. These files were further segmented into 4-s intervals, resulting in the extraction of a combined feature vector for each segment. The selection of 4-s segments was done through simulations evaluating classification effectiveness with different segment durations.

The number of recordings, as well as the size and type of vessel, was a crucial factor in separating the data into distinct classes. The correct grou** of files that exhibit similar feature vectors influenced the algorithm’s accuracy rate. We chose to subdivide the 53 files into four distinct classes, grou** vessels with similar feature vectors into the same class when necessary. This approach promoted a balance in the quantity of samples for each class. The final configuration is presented in Table 2, similar to what was presented in Santos-Domínguez et al. (2016). Each class was manually assigned a label for identification.

Table 2 presents the subdivisions of the classes. The duration corresponded to the sum of all audio files belonging to the class, and the number of segments represented the number of feature vectors for each class as a vector was extracted for each 4-s segment.

Table 2 Data set with divisions into classes

Figure 7 presents the power spectral density (PSD) of noise emitted by Class C and Class D vessels, highlighting the concentration within the crucial frequency band below 1 kHz for underwater classification (McKenna et al. 2012; Yao et al. 2023). The figure provides valuable insights for the design of Mel filters tailored to underwater vessel classification, considering the relevant frequency bands.

Fig. 7
figure 7

PSD of noise emitted by a Class C vessels (merchant ship) and b Class D vessels (passenger ferry). The blue line represents the power spectrum, indicating the relative power at different frequencies. The dashed red line highlights the 1-kHz frequency, a critical band for underwater vessel classification because of the dominance of noise in this region

4 Experimental results

In this section, we analyze the results obtained when evaluating the performance of the proposed algorithm. For this purpose, evaluation metrics such as accuracy are used, measuring the proportion of correctly classified examples relative to the total number of evaluated examples (Jordan and Mitchell 2015).

The performance of the proposed feature vector was evaluated using four supervised classification methods: support vector machine (SVM), k-NN, random forest (RF), and decision tree (DT) (Saravanan and Sujatha 2018; Bianco et al. 2019; Sabara et al. 2020). These methods were chosen because of their widespread adoption in various classification tasks.

The proposed feature vector’s performance was compared against established feature extraction methods commonly used in underwater sound classification, including MFCC, PSC, LPC, LPCC, and ZCR (Sabara et al. 2020; Sharma et al. 2020). The performance of the combination of SDF with these extractors was also evaluated.

As shown in Table 2, each audio file was divided into six nonoverlap** 4-s segments, resulting in a total of 318 feature vectors for training and testing. The original sampling rate of 52734 Hz was preserved for the audio signals. In the preprocessing stage, we adopted a pre-emphasis coefficient of 0.97 and a rectangular window with a duration of 40 ms and 50% overlap and applied the Hamming window to each frame (Rabiner and Juang 1999; Alcaraz Meseguer 2009; Trang et al. 2015; Winursito et al. 2018).

To prevent overfitting, the dataset was split into training and test sets using a 75:25 ratio, allocating 75% of the data for model training and 25% for performance evaluation. Twelve files were randomly chosen for testing, resulting in 72 segments (6 segments per file) and 72 corresponding feature vectors. The remaining 41 files (246 segments) composed the training set, resulting in a final ratio of 246:72.

To address potential class imbalance in the test set, a stratified sampling approach was employed. Specifically, 4 files were randomly removed from Class D (24 segments), 3 files from Class B (18 segments), 3 files from Class A (18 segments), and 2 files from Class C (12 segments). This ensured a more balanced representation of each class in the test set.

For comparative purposes, MFCCs were extracted in two distinct ways, both using 42 Mel filters. The first approach followed the conventional distribution of the Mel filter bank, as described in Liu et al. (2014), Dixit et al. (2016), Rao and Manjunath (2017) and referred to as MFCC. The second approach, denoted as MFCC, utilizes the Mel filter configuration detailed in Section 2.4.

The SDF vector was computed using a four-point DCT (= 4), resulting in a matrix DC limited to four columns. Because the number of filters (M) was 42, the final dimension of the SDF vector was \(4\times 42=168\). Therefore, the combined feature vector ’MFCC + SDF’ comprised a total of 210 components (42 + 168).

The parameters of each classifier were optimized based on the dataset characteristics. For SVM, a radial basis function kernel yielded the best results compared with linear, polynomial, and sigmoid kernels. The k-NN classifier was configured to use the five nearest neighbors for classification. Meanwhile, the RF algorithm employed 100 estimated trees, whereas DT utilized the best split at each node instead of a random split.

Table 3 presents the classification accuracy achieved by the proposed feature vector and 15 other feature vectors using four classification algorithms.

Table 3 Comparison of classification accuracy for various feature vectors. The table depicts the classification accuracy achieved by the proposed feature vector (MFCC + SDF) compared with 14 feature extraction methods

Table 3 demonstrates that the SVM classifier outperformed other classifiers for most features, except for PSCs, ZCR, and LPCCs. This superiority could be attributed to the lower dispersion of the proposed algorithm’s extracted feature vectors in the n-dimensional space. This characteristic facilitated the construction of hyperplanes for class separation, contributing to the SVM’s superior performance compared with other classifiers.

The MFCC + SDF feature vector achieved the best results in most cases, reaching a 95% accuracy with the SVM classifier. This performance highlights the effectiveness of combining enhanced MFCC coefficients with SDF for underwater audio classification.

Combining the MFCC, MFCC, LPC, LPCC, PSC, and ZCR extractors with the SDF vector increased the correct classification rate across all scenarios. This improvement likely stemmed from the addition of statistically uncorrelated features of the SDF vector. By enriching the data representation, these additional features aided the classification process.

The SDF vector, proposed by Karbasi et al. (2011), was originally designed for the classification of different environmental sounds in air acoustics under the interference of multiple sources. In this work, a modified version of the SDF achieved 100% accuracy in separating underwater ambient noise (Class A) from vessel classes, as can be seen in Table 4. The table shows the confusion matrix resulting from the SVM classification of the SDF vector. The analysis of the main diagonal of the matrix, which presents the percentages of correct classifications for each class, revealed 100% accuracy for Class A. This means that the algorithm correctly classified all 18 segments from the three audio files of underwater ambient noise present in the test set. In contrast, Class D showed a significantly lower performance, with only 54.16% accuracy. This ability to discriminate environmental noise from other classes highlights the relevance of SDF for monitoring applications that aim to detect the presence of targets in underwater ambient noise without the need for specific vessel type classifications. Designing the Mel filter bank according to the specific characteristics of the target classes can significantly improve the performance of the feature extractor. This is evident from the results in Table 3. SVM classification using the ’MFCC’ feature vector, which employed a modified Mel filter bank configuration, yielded a higher correct classification rate (86.5%) compared with that of the conventional MFCC vector (81.2%) (Table 5). This highlights the importance of optimizing filter bank designs for improved feature extraction and classification accuracy. 

Table 4 Confusion matrix estimated from the SDF feature vector using the SVM classification algorithm

To improve the classification accuracy among vessel classes, the SDF vector, which performed well for Class A, was combined with the MFCC , MFCC, LPC, LPCC, and ZCR vectors. Tables 6 and 7 present the confusion matrices generated after combining the vectors MFCC and MFCC with the SDF, both classified by SVM. In both cases, the combination increased the percentage of correct classifications for Class A to 100%. Additionally, as shown in Table 3, the overall classification accuracy improved. When used independently, MFCC and MFCC achieved accuracies of 86.5% and 81.2%, respectively. However, combining them with SDF led to substantial improvements, reaching 95% and 90% accuracies, respectively.

Table 5 Confusion matrix generated using an SVM classifier with MFCC as the feature vector
Table 6 Confusion matrix estimated using the SVM classification algorithm and the combined feature vector (MFCC + SDF)
Table 7 Confusion matrix generated using the SVM classification algorithm and by combining MFCCs with the SDF as a feature vector

The choice of 4-s segments was based on experimental tests, where the segment size varied from 50 milliseconds to 10 seconds. An increase in accuracy was observed as the segment size increased, reaching the best performance with 4-s segments.

Figure 8 illustrates the distribution of MFCC + SDF and MFCC feature vectors in a two-dimensional plane by reducing the dimensionality to two principal components. Visually, it is noticeable how the principal components of the MFCC + SDF feature vector associated with ambient noise, represented in Fig. 8a, exhibit a distinct cluster from the feature vectors associated with vessels. This characteristic facilitates the classification task, resulting in a 100% accuracy rate in the classification of ambient noise segments using the MFCC + SDF vector as input to the classification algorithm, as shown in Table 6.

Fig. 8
figure 8

Two-dimensional representation of the distribution of feature vectors extracted using two distinct methods. The feature vector was reduced to two dimensions represented as principal components X1 and X2. a MFCC + SDF feature vector; b MFCC feature vector. The MFCC + SDF feature vectors associated with Class A occupied distinct spaces in the feature space compared with the vectors associated with vessel classes

5 Discussion and conclusions

This work presented a cepstrum-based feature extraction methodology for the classification of marine vessels. The SDF extractor, which leveraged the nonhomogeneous nature of vessel-generated noise, was introduced to capture feature changes across successive frames.

In this study, we compared the performance of a combined feature vector, formed by fusing SDF and the MFCC vectors, with that of traditional feature extraction methods under supervised learning algorithms. For this purpose, we used a database of real recordings of maritime vessels. The results indicated that the combination of SDF and the MFCC vectors provided a better classification of vessels than those of some traditional extraction methods evaluated for the dataset analyzed.

Despite employing various signal processing techniques, there is still room for improvement. Although representing a relatively large audio instance with a 4-s duration, the SDF vector requires dimensionality reduction optimization. Larger vectors demand higher computational effort during classification.

The results also suggest that for the analyzed dataset, cepstral domain extractors outperform temporal and spectral domain features for characterizing underwater vessel sounds. Considering the high accuracy achieved in classifying environmental noise, future work will involve evaluating the performance of the algorithm in classifying different underwater acoustic scenes with multiple sources. Furthermore, future work could explore variations in the proposed feature extractor, modifying preprocessing parameters such as the rectangular window duration and overlap rate. Assessing the impact of these changes on classification accuracy is crucial to optimizing system performance. Another avenue for future research is the detailed evaluation of the computational complexity of the method in comparison with other approaches, considering different stages of the process, such as feature extraction, model training, and classification.