Improved spectral dynamic features extracted from audio data for classification of marine vessels

de Brito Santos, Murillo; de Moraes Calazan, Rogério

doi:10.1007/s44295-024-00029-0

Improved spectral dynamic features extracted from audio data for classification of marine vessels

Research Paper
Open access
Published: 25 June 2024

Volume 2, article number 18, (2024)
Cite this article

Download PDF

You have full access to this open access article

Intelligent Marine Technology and Systems Aims and scope Submit manuscript

Improved spectral dynamic features extracted from audio data for classification of marine vessels

Download PDF

Murillo de Brito Santos¹ &
Rogério de Moraes Calazan¹

97 Accesses
Explore all metrics

Abstract

Underwater sound classification presents a unique challenge due to the complex propagation characteristics of sound in water, including absorption, scattering, and refraction. These complexities can distort and alter spectral features, hindering the effectiveness of traditional feature extraction methods for vessel classification. To address this challenge, this study proposes a novel feature extraction method that combines Mel-frequency cepstral coefficients (MFCCs) with a spectral dynamic feature (SDF) vector. MFCCs capture the spectral content of the audio signal, whereas SDF provides information on the temporal dynamics of spectral features. This combined approach aims to achieve a more comprehensive representation of underwater vessel sounds, potentially leading to improved classification accuracy. Validation with real-world underwater audio recordings demonstrated the effectiveness of the proposed method. Results indicated an improvement of up to 94.68% in classification accuracy when combining SDF with several classical extractors evaluated. This finding highlights the potential of SDF in overcoming the challenges associated with underwater sound classification.

Deep learning for time series classification: a review

Article 02 March 2019

The State-of-the-Art on Time-Frequency Signal Processing Techniques for High-Resolution Representation of Nonlinear Systems in Engineering

Article 26 June 2024

Robust dimensionality-reduced epilepsy detection system using EEG wavelet packets and machine learning

Article 18 June 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Sound signal classification, where patterns and features are extracted to categorize audio data, is fundamental for training models to recognize sounds automatically. This technology has become increasingly important beyond military applications, particularly in vessel classification based on their acoustic signature. This capability plays a vital role in monitoring maritime traffic and identifying noise sources within underwater environmental monitoring systems (Karbasi et al. 2011).

Significant efforts have focused on develo** algorithms for classifying vessels according to their emitted noise, aiming to improve classification accuracy despite signal variability from machinery changes and ocean environment complexity (Das et al. 2013). The performance of these algorithms relies on effective feature extraction methods and the choice of an appropriate classification algorithm (Zhang et al. 1998; Karbasi et al. 2011; Zhang et al. 2021), making it challenging to develop efficient methods that extract relevant information from raw data (Liu et al. 2014). The integration of machine learning techniques into underwater acoustics research has led to the development of automated classification approaches, advocating for the use of frequency domain features extracted from underwater acoustic signals as input for machine learning models (Liu et al. 2014; Bianco et al. 2019; Zhang et al. 2021).

Various feature extraction methodologies have been developed to tackle the challenges of classifying and extracting features from maritime vessel noise. For example, Li et al. (2008) proposed an algorithm based on the wavelet packet transform, extracting energy features in each frequency band through a four-stage decomposition, which resulted in low-dimensional feature vectors. Meanwhile, Karbasi et al. (2011) presented a method exploiting spectral changes over time by dividing audio segments into frames and extracting a spectral dynamic feature (SDF) vector. Das et al. (2013) utilized a cepstral approach to mitigate signal distortion caused by shallow underwater propagation, showing a significant reduction in multipath distortion effects.

Liu et al. (2014) proposed a method based on the line spectrum of ship noise, achieving high accuracy rates, albeit limited to classifying two ship classes. Additionally, Chen and Xu (2017) introduced low-frequency analysis and recording and detection of envelope modulation on noise for underwater acoustic signal feature extraction.

Lian et al. (2017) and Tong et al. (2020) introduced feature extraction approaches using Gammatone frequency cepstral coefficients (GFCCs) and Mel-frequency cepstral coefficients (MFCCs). These methods utilize Gammatone or Mel filter banks, respectively. Lian et al. (2017) proposed a modified version of GFCC specifically adapted for the underwater environment. Their experiments showed a decrease in accuracy with increasing background noise levels. Tong et al. (2020) used MFCCs extracted from real-world underwater targets as input to a k-nearest neighbor (k-NN) classifier. The results demonstrated improved classification accuracy with increasing MFCC order but at the cost of higher data size and computational complexity.

This work proposes a novel approach for classifying underwater vessels by introducing a feature extraction methodology based on the concept of SDFs presented in Karbasi et al. (2011) but specifically adapted to the challenges of underwater sound propagation. A new combined feature vector is proposed, fusing MFCCs extracted using a custom Mel filter design optimized for underwater vessel classification with the SDF vector. The results indicate that this combination can, in some cases, optimize results obtained using traditional feature extraction techniques.

The remainder of this article is structured as follows. Section 2 describes the methodology for extracting SDFs and forming the combined feature vector. Section 3 details the experimental sound data used in the study. Section 4 addresses the classification results and related discussion. Lastly, Section 5 concludes the article by summarizing the findings and suggesting avenues for future research.

2 Methodology

2.1 Classification system

A core challenge in machine-learning-based classification systems lies in representing high-dimensional raw audio signals with lower-dimensional feature vectors. This dimensionality reduction offers several advantages: improved computational efficiency, reduced risk of model overfitting to training data, enhanced feature extraction capabilities, and potentially increased data interpretability (Bianco et al. 2019). The resulting feature vector should capture essential information about the signal while remaining compact enough to facilitate efficient recognition.

A simplified version of the architecture proposed in our study for the preprocessing, feature extraction, and classification of underwater acoustic data is depicted in the flowchart shown in Fig. 1.

The process begins with an audio signal captured by a hydrophone. The signal is digitized and represented as x[l], where l = 1, 2, ..., L and L is determined by the sampling frequency fs. The raw signal x[l] is then segmented into smaller time windows denoted by t and expressed as s[n], where n = 1, 2, ..., Ns and $Ns=t \cdot fs$ represents the number of samples per segment. Segmentation facilitates the processing of long audio files by dividing them into manageable portions.

Each segment $s[n]$ is preprocessed before Mel filter bank processing. This step typically involves transitioning the signal from the time domain to the frequency domain. Segmentation and preprocessing are crucial for handling the inherent variability in vessel configurations and the nonhomogeneity of emitted noise over time. This approach results in a diverse set of feature vectors, each capturing specific characteristics of the vessel’s acoustic signature.

Following preprocessing, the segmented signals are fed into a Mel filter bank. The outputs from this bank serve as the basis for generating the combined feature vector used for classification. This vector incorporates both the widely used MFCCs and the novel SDF vector proposed in this work (building upon the work of Karbasi et al. (2011)). A detailed explanation of each step within the flowchart is provided in the following sections.

2.2 Preprocessing

When applying machine learning techniques to classification tasks, the sensitivity of raw data necessitates careful consideration. Proper data preprocessing significantly influences classification performance. Here, we employed preprocessing steps similar to those used for obtaining classical MFCCs, as depicted in Fig. 2.

Each segment s[n] undergoes the following preprocessing steps:

(1)
Pre-emphasis: The initial step involves pre-emphasizing $s[n]$ using a first-order digital filter with the transfer function $H(z)=\;1\;-\;\alpha \cdot z^{-1}$, where $0.9<\alpha <1$ and $\alpha$ is the pre-emphasis coefficient. Pre-emphasis equalizes the frequency spectrum, highlighting higher frequencies relative to lower ones. Higher values of $\alpha$ correspond to more pronounced enhancements. The pre-emphasized signal is denoted as $u[n]$ and defined as $u[n]\;=\;s[n]\;-\;\alpha \cdot s[n-1]$.
(2)
Framing: The pre-emphasized signal $u[n]$ is then divided into smaller blocks called ’frames’, typically ranging from 20 to 40 ms in duration. To ensure a smooth transition between frames, consecutive frames overlap by 50% (Rabiner and Juang 1999). The set of frames is denoted by $\left\{ x_i[n],\;1\le i\le NF,\;1\le n\le N\right\}$, where NF is the number of frames and N is the frame size. Segmenting the signal into short intervals (milliseconds) approximates it as a stationary process before applying the Fourier transform (FT) as the signal characteristics change slowly within a segment of seconds.
(3)
Windowing: Each frame is multiplied by a window function to minimize spectral leakage caused by discontinuities at the frame boundaries introduced by framing (Majeed et al. 2015). The most commonly used window in signal processing is the Hamming window, defined as follows (Rabiner and Juang 1999):
$$\begin{aligned} w[n]=0.54-0.46\cos \left( \frac{2\uppi n}{N-1}\right) . \end{aligned}$$
(1)
(4)
Spectral estimation: An additional preprocessing step is applied to compute the spectral estimation of each frame using short-time FT (STFT). This results in spectral coefficients, which are complex numbers containing both magnitude and phase information (Majeed et al. 2015). Whereas phase information is often discarded, we utilize power spectral coefficients (PSCs) defined as
$$\begin{aligned} p[k]=\frac{\left| \sum _{n=1}^{N}\;x[n]w[n]\text{e}^\frac{-j2\uppi nk}{N}\right| ^{2}}{N},\quad \quad 1\le k\le N. \end{aligned}$$
(2)

2.3 Mel filter bank

The Mel filter bank consists of a set of bandpass filters with triangular envelopes, whose centers are spaced according to the Mel scale, as illustrated in Fig. 3.

The Mel filter bank is an important component in feature extraction for audio classification tasks. It transforms the linear power spectrum obtained from the STFT into a nonlinear representation on the Mel-frequency scale (Tong et al. 2020).

The Mel scale is a perceptually motivated frequency scale that approximates how humans perceive pitch. Below 1 kHz, the Mel scale exhibits a roughly linear relationship with frequency in hertz, providing finer resolution for lower frequencies. In contrast, it adopts a logarithmic spacing above 1 kHz, resulting in coarser resolution for higher frequencies (Rao and Manjunath 2017). This characteristic is particularly beneficial for underwater vessel classification because significant noise from vessels often resides in the lower frequency range (McKenna et al. 2012; Yao et al. 2023).

The conversion between the linear frequency scale (Hz) and the Mel scale is expressed mathematically in Eq. (3):

$$\begin{aligned} f_{\mathrm{Mel}}=2595\log \left( 1+\frac{f_{\mathrm{Hz}}}{700}\right) . \end{aligned}$$

(3)

Here, $f_{\mathrm{Mel}}$ represents the frequency in Mel, and $f_{\mathrm{Hz}}$ denotes the frequency in hertz. This equation essentially converts linear frequencies to the Mel scale using a logarithmic function. The constant factor 2595 scales the conversion such that lower frequencies map to lower Mel values and vice versa.

The Mel spectrum of the power spectrum $p[k]$ is calculated by multiplying the power spectrum by each of the triangular filters using Eq. (4). The number of filters is established experimentally, as detailed in Section 2.4, and it determines the dimension of the Mel filter output vector.

$$\begin{aligned} e[m]\;=\;\sum \limits _{k=0}^{N-1}p[k]H_m[k],\quad \quad 1\le m\le M, \end{aligned}$$

(4)

where $H_m[k]$ are the coefficients of the $m\text{-th}$ Mel filter, M is the number of Mel filters, and N is the number of samples per frame. The output energies of these filters can be interpreted as subsamplings of the spectrum.

For the set of frames (NF), we obtain a set of vectors $\left\{ e_i[m],\;1\le i\le NF,\;1\le m\le M\right\}$. These vectors are visualized within a matrix denoted by E in Eq. (5):

$$\begin{aligned} \varvec{E}= \left[ \begin{array}{ccccc} e_1[1]&{}\cdots &{}e_1[m]&{}\cdots &{}e_1[M]\\ \vdots &{}\ddots &{}\vdots &{}&{}\\ e_i[1]&{}\cdots &{}e_i[m]&{}\cdots &{}e_i[M]\\ \vdots &{}&{}\vdots &{}\ddots &{}\\ e_{NF}[1]&{}\cdots &{}e_{NF}[m]&{}\cdots &{}e_{NF}[M]\end{array}\right] ,\;\;\;\;\;1\le i\le NF,\;1\le m\le M. \end{aligned}$$

(5)

The matrix, E, serves as the basis for the subsequent calculation of MFCCs, which capture cepstral features, and for obtaining the SDF vector.

2.4 Mel filter bank project

A fundamental requirement in obtaining the output of the Mel filter bank is its proper configuration, ensuring its adaptation to the context of our problem. These filters are commonly constructed in the frequency domain, and central frequencies are usually distributed uniformly along the frequency axis (Dixit et al. 2016; Rao and Manjunath 2017). The selection of filter bank parameters, including the number of filters, their center frequencies, and bandwidths, directly influences the coverage and resolution characteristics of the Mel filter bank.

In this work, we adopted an experimental approach to customize the filter parameters for both MFCC and SDF calculations. This approach allowed us to tailor the filters to the unique characteristics of our dataset, potentially improving classification performance.

Previous research by McKenna et al. (2012) and Yao et al. (2023) suggested that underwater-vessel-generated noise primarily concentrates in frequencies below 1 kHz. To achieve a frequency band distribution that prioritizes high resolution in the low-frequency range and low resolution for high frequencies (Majeed et al. 2015), we designed a set of 42 experimentally optimized triangular bandpass filters. These filters were limited to a frequency range of 0 to 7 kHz. Our experiments revealed that including filters for frequencies exceeding 7 kHz did not yield significant improvements in classification accuracy. Notably, half of the filters were centered at frequencies below 1 kHz to ensure high resolution in this critical range for vessel noise. The final configuration of the Mel filter bank is presented in Table 1 and Fig. 4.

Table 1 Mel filter bank configuration. Forty-two triangular filters were designed with variable center frequencies and bandwidths

Full size table

2.5 Mel-frequency cepstral coefficients

Figure 5 illustrates the phases involved in extracting MFCC feature vectors. The input signal, denoted as $s[n]$, undergoes preprocessing steps. The preprocessed signal passes through a bank of Mel filters. In sequence, the output energies of each filter in the Mel filter bank are converted to the logarithmic domain. This emphasizes spectral variations and reduces the impact of large amplitude values. In the end, the logarithm of the Mel filter bank energies is transformed using discrete cosine transformation (DCT). This process correlates spectral information, minimizes redundancies, and emphasizes the most distinctive characteristics of the signal (Rao and Manjunath 2017; Tong et al. 2020). Additionally, applying the DCT allows for dimensionality reduction by discarding higher-order terms, as the most relevant information is typically present in the lower-order coefficients (Li et al. 2008). The resulting coefficients, obtained by applying the DCT, are known as MFCCs. They are mathematically represented by Eq. (6):

$$\begin{aligned} c[n]=\sum \limits _{m=1}^{M}\log \left( e[m]\right) \cos \left( \frac{\uppi \left( m-0.5\right) n}{M}\right), \quad n=1,2,...,p, \end{aligned}$$

(6)

where $c[n]$ are the Mel cepstral coefficients, M denotes the number of Mel filters in the filter bank, and p is the number of DCT points and the number of MFCCs extracted from each frame, illustrated in the lines of the matrix C, in Eq. (7):

$$\begin{aligned} \varvec{C}= \left[ \begin{array}{ccccc} c_1[1]&{}\cdots &{}c_1[n]&{}\cdots &{}c_1[p]\\ \vdots &{}\ddots &{}\vdots &{}&{}\\ c_i[1]&{}\cdots &{}c_i[n]&{}\cdots &{}c_i[p]\\ \vdots &{}&{}\vdots &{}\ddots &{}\\ c_{NF}[1]&{}\cdots &{}c_{NF}[n]&{}\cdots &{}c_{NF}[p]\end{array}\right] ,\;\;\;\;\;1\le i\le NF,\;1\le n\le p, \end{aligned}$$

(7)

where NF represents the total number of frames in the segmented signal.

In the end, the MFCC feature vector for the entire segment is obtained by calculating the average of the columns of matrix C, as shown in Eq. (8):

$$\begin{aligned} mean_{\mathrm{MFCC}}=[\mu _1\;\cdots \;\mu _i\;\cdots \;\mu _p\;],\;\;\;\;\;1\le i\le p, \end{aligned}$$

(8)

where $\mu _i$ represents the average value of the i-th column of matrix C.

2.6 SDF vector

Inspired by Karbasi et al. (2011), who used the SDF vector in an environmental sound classification task in the air, we made modifications to adapt the algorithm to the underwater environment.

Most feature extractors used in the context of vessel classification are the same as those commonly used in speech and music processing, such as zero-crossing rate (ZCR), MFCC, root mean square energy, spectral centroid, chroma feature, spectral flux, linear predictive coding (LPC) coefficients, and linear prediction cepstral coefficients (LPCC). The work by Sharma et al. (2020) detailed the process of obtaining each of the mentioned extractors.

Traditional feature extraction methods focus on the temporal and spectral characteristics of a signal within a single frame, where each frame is processed separately, thus failing to provide information about variations between consecutive frames.

Underwater audio analysis presents additional challenges due to the temporal variation of spectral features. These variations arise from the combination of multiple short-duration spectra with distinct characteristics, evolving over time. Consequently, effective underwater audio analysis requires techniques that capture the dynamic and variable nature of signals, identifying relevant temporal patterns and spectral features.

The proposed algorithm addresses these challenges by extracting dynamic information over 4-s segments. This approach leverages the temporal continuity of signals, explores nonuniformity, and captures how features change between frames. By extending the analysis window, the method provides a more comprehensive view of feature behavior over time, contributing to improved classification accuracy.

To extract SDFs, we apply the NF-point fast FT (FFT) to the outputs of each Mel filter across all frames, represented by the columns of matrix E in Eq. (5). We define $v_m[i]$ as the vectors extracted from each column, and $v_m[i]=\left\{ e_i[m],\;1\le i\le NF,\;1\le m\le M\right\}$. The results of the FFT applied to the columns of E are grouped into matrix Z as shown in Eq. (9).

$$\begin{aligned} \varvec{Z}= \left[ \begin{array}{c} \mathcal F\left\{ v_1[i]\right\} \\ \vdots \\ \mathcal F\left\{ v_m[i]\right\} \\\vdots \\ \mathcal F\left\{ v_M[i]\right\} \end{array}\right] \overset{\triangle }{=} \left[ \begin{array}{ccccc} z_1[1]&{}\cdots &{}z_1[i]&{}\cdots &{}z_1[NF]\\ \vdots &{}\ddots &{}\vdots &{}&{}\\ z_m[1]&{}\cdots &{}z_m[i]&{}\cdots &{}z_m[NF]\\ \vdots &{}&{}\vdots &{}\ddots &{}\\ z_M[1]&{}\cdots &{}z_M[i]&{}\cdots &{}z_M[NF]\end{array}\right] ,\;\;\;\;\;1\le i\le NF,\;1\le m\le M. \end{aligned}$$

(9)

To conclude the process of obtaining the SDF vectors, two additional steps are performed to reduce data redundancy and dimensionality. Similar to the calculation of MFCCs, we apply the logarithm to the matrix Z to equalize the spectrum. Next, we apply the DCT to extract cepstral parameters from the output energies of the filters (Karbasi et al. 2011). The result is a matrix of dynamic components (DC) in Eq. (10).

$$\begin{aligned} \varvec{DC}= \left[ \begin{array}{ccccc} dc_1[1]&{}\cdots &{}dc_1[n]&{}\cdots &{}dc_1[L]\\ \vdots &{}\ddots &{}\vdots &{}&{}\\ dc_m[1]&{}\cdots &{}dc_m[n]&{}\cdots &{}dc_m[L]\\ \vdots &{}&{}\vdots &{}\ddots &{}\\ dc_{M}[1]&{}\cdots &{}dc_{M}[n]&{}\cdots &{}dc_{M}[L]\end{array}\right] ,\;\;\;\;\;1\le m\le M,\;1\le n\le L. \end{aligned}$$

(10)

where L is the number of DCT points.

The DC matrix is subsequently rearranged, with its rows concatenated to compose the SDF vector, characterized by dimensions of 1 $\times$ (LM), where M represents the number of Mel filters.

2.7 Combined feature vector

As mentioned earlier, the combined feature vector is formed by the combination of MFCCs with SDF. The schematic diagram in Fig. 6 shows the details of all the steps of the extraction of the combined feature vector. In summary, the matrix formed by the PSCs of the set of frames is subjected to a filter bank designed on the Mel scale. The output vectors of the Mel filter bank are used as input vectors for obtaining both MFCCs and SDF.

3 Dataset

The dataset used for testing was obtained from recordings of underwater sounds in real environments available at http://atlanttic.uvigo.es/underwaternoise (Santos-Domínguez et al. 2016). This ShipsEar database consists of recordings of 90 vessels of 11 different types and ambient noise at various points along the northwest coast of Spain, totaling 2.94 h of recordings. ShipsEar also provides information on environmental conditions, images of vessels, location data, acquisition time, channel depth, and distance, among other details.

To optimize the testing process and ensure a balanced distribution of samples across vessel types, a subset of 53 audio files was selected from the database. These files were further segmented into 4-s intervals, resulting in the extraction of a combined feature vector for each segment. The selection of 4-s segments was done through simulations evaluating classification effectiveness with different segment durations.

The number of recordings, as well as the size and type of vessel, was a crucial factor in separating the data into distinct classes. The correct grou** of files that exhibit similar feature vectors influenced the algorithm’s accuracy rate. We chose to subdivide the 53 files into four distinct classes, grou** vessels with similar feature vectors into the same class when necessary. This approach promoted a balance in the quantity of samples for each class. The final configuration is presented in Table 2, similar to what was presented in Santos-Domínguez et al. (2016). Each class was manually assigned a label for identification.

Table 2 presents the subdivisions of the classes. The duration corresponded to the sum of all audio files belonging to the class, and the number of segments represented the number of feature vectors for each class as a vector was extracted for each 4-s segment.

Table 2 Data set with divisions into classes

Full size table

Figure 7 presents the power spectral density (PSD) of noise emitted by Class C and Class D vessels, highlighting the concentration within the crucial frequency band below 1 kHz for underwater classification (McKenna et al. 2012; Yao et al. 2023). The figure provides valuable insights for the design of Mel filters tailored to underwater vessel classification, considering the relevant frequency bands.

4 Experimental results

In this section, we analyze the results obtained when evaluating the performance of the proposed algorithm. For this purpose, evaluation metrics such as accuracy are used, measuring the proportion of correctly classified examples relative to the total number of evaluated examples (Jordan and Mitchell 2015).

The performance of the proposed feature vector was evaluated using four supervised classification methods: support vector machine (SVM), k-NN, random forest (RF), and decision tree (DT) (Saravanan and Sujatha 2018; Bianco et al. 2019; Sabara et al. 2020). These methods were chosen because of their widespread adoption in various classification tasks.

The proposed feature vector’s performance was compared against established feature extraction methods commonly used in underwater sound classification, including MFCC, PSC, LPC, LPCC, and ZCR (Sabara et al. 2020; Sharma et al. 2020). The performance of the combination of SDF with these extractors was also evaluated.

As shown in Table 2, each audio file was divided into six nonoverlap** 4-s segments, resulting in a total of 318 feature vectors for training and testing. The original sampling rate of 52734 Hz was preserved for the audio signals. In the preprocessing stage, we adopted a pre-emphasis coefficient of 0.97 and a rectangular window with a duration of 40 ms and 50% overlap and applied the Hamming window to each frame (Rabiner and Juang 1999; Alcaraz Meseguer 2009; Trang et al. 2015; Winursito et al. 2018).

To prevent overfitting, the dataset was split into training and test sets using a 75:25 ratio, allocating 75% of the data for model training and 25% for performance evaluation. Twelve files were randomly chosen for testing, resulting in 72 segments (6 segments per file) and 72 corresponding feature vectors. The remaining 41 files (246 segments) composed the training set, resulting in a final ratio of 246:72.

To address potential class imbalance in the test set, a stratified sampling approach was employed. Specifically, 4 files were randomly removed from Class D (24 segments), 3 files from Class B (18 segments), 3 files from Class A (18 segments), and 2 files from Class C (12 segments). This ensured a more balanced representation of each class in the test set.

For comparative purposes, MFCCs were extracted in two distinct ways, both using 42 Mel filters. The first approach followed the conventional distribution of the Mel filter bank, as described in Liu et al. (2014), Dixit et al. (2016), Rao and Manjunath (2017) and referred to as MFCC. The second approach, denoted as MFCC^†, utilizes the Mel filter configuration detailed in Section 2.4.

The SDF vector was computed using a four-point DCT (p = 4), resulting in a matrix DC limited to four columns. Because the number of filters (M) was 42, the final dimension of the SDF vector was $4\times 42=168$. Therefore, the combined feature vector ’MFCC^† + SDF’ comprised a total of 210 components (42 + 168).

The parameters of each classifier were optimized based on the dataset characteristics. For SVM, a radial basis function kernel yielded the best results compared with linear, polynomial, and sigmoid kernels. The k-NN classifier was configured to use the five nearest neighbors for classification. Meanwhile, the RF algorithm employed 100 estimated trees, whereas DT utilized the best split at each node instead of a random split.

Table 3 presents the classification accuracy achieved by the proposed feature vector and 15 other feature vectors using four classification algorithms.

Table 3 Comparison of classification accuracy for various feature vectors. The table depicts the classification accuracy achieved by the proposed feature vector (MFCC^† + SDF) compared with 14 feature extraction methods

Full size table

Table 3 demonstrates that the SVM classifier outperformed other classifiers for most features, except for PSCs, ZCR, and LPCCs. This superiority could be attributed to the lower dispersion of the proposed algorithm’s extracted feature vectors in the n-dimensional space. This characteristic facilitated the construction of hyperplanes for class separation, contributing to the SVM’s superior performance compared with other classifiers.

The MFCC^† + SDF feature vector achieved the best results in most cases, reaching a 95% accuracy with the SVM classifier. This performance highlights the effectiveness of combining enhanced MFCC coefficients with SDF for underwater audio classification.

Combining the MFCC^†, MFCC, LPC, LPCC, PSC, and ZCR extractors with the SDF vector increased the correct classification rate across all scenarios. This improvement likely stemmed from the addition of statistically uncorrelated features of the SDF vector. By enriching the data representation, these additional features aided the classification process.

The SDF vector, proposed by Karbasi et al. (2011), was originally designed for the classification of different environmental sounds in air acoustics under the interference of multiple sources. In this work, a modified version of the SDF achieved 100% accuracy in separating underwater ambient noise (Class A) from vessel classes, as can be seen in Table 4. The table shows the confusion matrix resulting from the SVM classification of the SDF vector. The analysis of the main diagonal of the matrix, which presents the percentages of correct classifications for each class, revealed 100% accuracy for Class A. This means that the algorithm correctly classified all 18 segments from the three audio files of underwater ambient noise present in the test set. In contrast, Class D showed a significantly lower performance, with only 54.16% accuracy. This ability to discriminate environmental noise from other classes highlights the relevance of SDF for monitoring applications that aim to detect the presence of targets in underwater ambient noise without the need for specific vessel type classifications. Designing the Mel filter bank according to the specific characteristics of the target classes can significantly improve the performance of the feature extractor. This is evident from the results in Table 3. SVM classification using the ’MFCC^†’ feature vector, which employed a modified Mel filter bank configuration, yielded a higher correct classification rate (86.5%) compared with that of the conventional MFCC vector (81.2%) (Table 5). This highlights the importance of optimizing filter bank designs for improved feature extraction and classification accuracy.

Table 4 Confusion matrix estimated from the SDF feature vector using the SVM classification algorithm

Full size table

To improve the classification accuracy among vessel classes, the SDF vector, which performed well for Class A, was combined with the MFCC^† , MFCC, LPC, LPCC, and ZCR vectors. Tables 6 and 7 present the confusion matrices generated after combining the vectors MFCC^† and MFCC with the SDF, both classified by SVM. In both cases, the combination increased the percentage of correct classifications for Class A to 100%. Additionally, as shown in Table 3, the overall classification accuracy improved. When used independently, MFCC^† and MFCC achieved accuracies of 86.5% and 81.2%, respectively. However, combining them with SDF led to substantial improvements, reaching 95% and 90% accuracies, respectively.

Table 5 Confusion matrix generated using an SVM classifier with MFCC^† as the feature vector

Full size table

Table 6 Confusion matrix estimated using the SVM classification algorithm and the combined feature vector (MFCC^† + SDF)

Full size table

Table 7 Confusion matrix generated using the SVM classification algorithm and by combining MFCCs with the SDF as a feature vector

Full size table

The choice of 4-s segments was based on experimental tests, where the segment size varied from 50 milliseconds to 10 seconds. An increase in accuracy was observed as the segment size increased, reaching the best performance with 4-s segments.

Figure 8 illustrates the distribution of MFCC^† + SDF and MFCC^† feature vectors in a two-dimensional plane by reducing the dimensionality to two principal components. Visually, it is noticeable how the principal components of the MFCC^† + SDF feature vector associated with ambient noise, represented in Fig. 8a, exhibit a distinct cluster from the feature vectors associated with vessels. This characteristic facilitates the classification task, resulting in a 100% accuracy rate in the classification of ambient noise segments using the MFCC^† + SDF vector as input to the classification algorithm, as shown in Table 6.

5 Discussion and conclusions

This work presented a cepstrum-based feature extraction methodology for the classification of marine vessels. The SDF extractor, which leveraged the nonhomogeneous nature of vessel-generated noise, was introduced to capture feature changes across successive frames.

In this study, we compared the performance of a combined feature vector, formed by fusing SDF and the MFCC^† vectors, with that of traditional feature extraction methods under supervised learning algorithms. For this purpose, we used a database of real recordings of maritime vessels. The results indicated that the combination of SDF and the MFCC^† vectors provided a better classification of vessels than those of some traditional extraction methods evaluated for the dataset analyzed.

Despite employing various signal processing techniques, there is still room for improvement. Although representing a relatively large audio instance with a 4-s duration, the SDF vector requires dimensionality reduction optimization. Larger vectors demand higher computational effort during classification.

The results also suggest that for the analyzed dataset, cepstral domain extractors outperform temporal and spectral domain features for characterizing underwater vessel sounds. Considering the high accuracy achieved in classifying environmental noise, future work will involve evaluating the performance of the algorithm in classifying different underwater acoustic scenes with multiple sources. Furthermore, future work could explore variations in the proposed feature extractor, modifying preprocessing parameters such as the rectangular window duration and overlap rate. Assessing the impact of these changes on classification accuracy is crucial to optimizing system performance. Another avenue for future research is the detailed evaluation of the computational complexity of the method in comparison with other approaches, considering different stages of the process, such as feature extraction, model training, and classification.

Availability of data and materials

The authors declare that the data supporting the conclusions of this study are available at http://atlanttic.uvigo.es/underwaternoise upon formal request to the database author.

References

Alcaraz Meseguer N (2009) Speech analysis for automatic speech recognition. Master’s thesis, Department of Electronics and Telecommunications
Bianco MJ, Gerstoft P, Traer J, Ozanich E, Roch MA, Gannot S et al (2019) Machine learning in acoustics: theory and applications. J Acoust Soc Am 146(5):3590–3628
Article Google Scholar
Chen Y, Xu X (2017) The research of underwater target recognition method based on deep learning. In: 2017 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC), **amen, pp 1–5. https://doi.org/10.1109/ICSPCC.2017.8242464
Das A, Kumar A, Bahl R (2013) Marine vessel classification based on passive sonar data: the cepstrum-based approach. IET Radar, Sonar Navig 7(1):87–93. https://doi.org/10.1049/iet-rsn.2011.0142
Article Google Scholar
Dixit A, Vidwans A, Sharma P (2016) Improved MFCC and LPC algorithm for Bundelkhandi isolated digit speech recognition. In: 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), Chennai, pp 3755–3759
Jordan MI, Mitchell TM (2015) Machine learning: trends, perspectives, and prospects. Science 349(6245):255–260
Article CAS Google Scholar
Karbasi M, Ahadi SM, Bahmanian M (2011) Environmental sound classification using spectral dynamic features. In: 2011 8th International Conference on Information, Communications & Signal Processing, Singapore, pp 1–5
Google Scholar
Li XX, Yang S, Yu M (2008) Feature extraction from underwater signals using wavelet packet transform. In: 2008 International Conference on Neural Networks and Signal Processing, Nan**g, pp 400–405. https://doi.org/10.1109/ICNNSP.2008.4590381
Lian Z, Xu K, Wan J, Li G (2017) Underwater acoustic target classification based on modified GFCC features. In: 2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chongqing, pp 258–262. https://doi.org/10.1109/IAEAC.2017.8054017
Liu J, He Y, Liu Z, **ong Y (2014) Underwater target recognition based on line spectrum and support vector machine. In: Proceedings of the 2014 International Conference on Mechatronics, Control and Electronic Engineering, Shenyang, pp 79–84. https://doi.org/10.2991/mce-14.2014.17
Majeed SA, Husain H, Samad SA, Idbeaa T (2015) Mel frequency cepstral coefficients (MFCC) feature extraction enhancement in the application of speech recognition: a comparison study. J Theor Appl Inf Technol 79(1):38–56
Google Scholar
McKenna MF, Ross D, Wiggins SM, Hildebrand JA (2012) Underwater radiated noise from modern commercial ships. J Acoust Soc Am 131(1):92–103
Article Google Scholar
Rabiner LR, Juang BH (1999) Fundamentals of speech recognition. Tsinghua University Press, Bei**g
Google Scholar
Rao KS, Manjunath KE (2017) Speech recognition using articulatory and excitation source features. Springer, Cham
Book Google Scholar
Sabara R, Soares C, Zabel F, Oliveira J, Jesus S (2020) Automatic acoustic target detection and classification off the coast of portugal. In: Global Oceans 2020: Singapore–U.S. Gulf Coast, Biloxi, pp 1–9. https://doi.org/10.1109/IEEECONF38699.2020.9389067
Santos-Domínguez D, Torres-Guijarro S, Cardenal-López A, Pena-Gimenez A (2016) ShipsEar: an underwater vessel noise database. Appl Acoust 113:64–69
Article Google Scholar
Saravanan R, Sujatha P (2018) A state of art techniques on machine learning algorithms: a perspective of supervised learning approaches in data classification. In: 2018 Second International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, pp 945–949
Sharma G, Umapathy K, Krishnan S (2020) Trends in audio signal feature extraction methods. Appl Acoust 158:107020
Article Google Scholar
Tong Y, Zhang X, Ge Y (2020) Classification and recognition of underwater target based on MFCC feature extraction. In: 2020 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC), Macau, pp 1–4. https://doi.org/10.1109/ICSPCC50002.2020.9259457
Trang H, Tran L, Nam H (2015) Proposed combination of PCA and MFCC feature extraction in speech recognition system. In: 2014 International Conference on Advanced Technologies for Communications (ATC 2014), Hanoi, pp 697–702. https://doi.org/10.1109/ATC.2014.7043477
Winursito A, Hidayat R, Bejo A (2018) Improvement of MFCC feature extraction accuracy using PCA in indonesian speech recognition. In: 2018 International Conference on Information and Communications Technology (ICOIACT), Yogyakarta, pp 379–383
Yao Q, Wang Y, Yang Y (2023) Underwater acoustic target recognition based on data augmentation and residual CNN. Electronics 12(5):1206. https://doi.org/10.3390/electronics12051206
Article Google Scholar
Zhang Q, Da L, Zhang Y, Hu Y (2021) Integrated neural networks based on feature fusion for underwater target recognition. Appl Acoust 182:108261. https://doi.org/10.1016/j.apacoust.2021.108261
Article Google Scholar
Zhang Y, Jiao L, Hu S (1998) An efficient method of target classification. In: ICSP ’98. 1998 Fourth International Conference on Signal Processing (Cat. No.98TH8344), Bei**g, pp 1181–1184. https://doi.org/10.1109/ICOSP.1998.770828

Download references

Acknowledgements

The authors express deep gratitude to PhD David Santos Domínguez from the University of Vigo in Spain for providing and allowing the use of the data discussed in this work.

Additional information

Edited by: Lin Gao.

Author information

Authors and Affiliations

Institute of Sea Studies Admiral Paulo Moreira, Arraial do Cabo, 28930000, Brasil
Murillo de Brito Santos & Rogério de Moraes Calazan

Authors

Murillo de Brito Santos
View author publications
You can also search for this author in PubMed Google Scholar
Rogério de Moraes Calazan
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed significantly to this work. Each author’s involvement includes: Conceptualization, writing of the manuscript, interpretation of results, preparation of figures, review and final approval of the manuscript.

Corresponding author

Correspondence to Murillo de Brito Santos.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

de Brito Santos, M., de Moraes Calazan, R. Improved spectral dynamic features extracted from audio data for classification of marine vessels. Intell. Mar. Technol. Syst. 2, 18 (2024). https://doi.org/10.1007/s44295-024-00029-0

Download citation

Received: 01 January 2024
Revised: 17 May 2024
Accepted: 29 May 2024
Published: 25 June 2024
DOI: https://doi.org/10.1007/s44295-024-00029-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Improved spectral dynamic features extracted from audio data for classification of marine vessels

Abstract

Similar content being viewed by others

Deep learning for time series classification: a review

The State-of-the-Art on Time-Frequency Signal Processing Techniques for High-Resolution Representation of Nonlinear Systems in Engineering

Robust dimensionality-reduced epilepsy detection system using EEG wavelet packets and machine learning

1 Introduction