1 Introduction

Various methods such as sound source localization (SSL), sound source separation (SSS), and classification have been proposed in acoustic signal processing, robot audition, and machine learning for use in real-world environments containing multiple overlap** sound events [1,2,3].

Conventional approaches use the cascade method, incorporating individual functions based on array signal processing techniques [4,5,6]. The main problem in this method is accumulation of errors generated at each function. Because each function is optimized independently regardless of the overall task, each output might not be optimal for subsequent blocks.

Recently, deep learning-based end-to-end methods using a single-channel microphone have been proposed [7,8,9]. Environmental sound segmentation, which simultaneously performs SSS and classification, has been reported to achieve segmentation performance superior to the cascade method by avoiding accumulated errors [10, 11]. However, performance deteriorates with overlap** sounds from multiple sources, because a single-channel microphone obtains no spatial features.

Multichannel-based methods have been proposed for automatic speech recognition (ASR) [15, 16], where the model has multiple output layers corresponding to different speakers and is trained to cope with all possible combinations of speakers. While these studies assume mixtures of two or three speakers, it is impractical to extend them to many classes of sounds, such as environmental sounds.

A multichannel environmental sound segmentation method has been proposed [17]. This integrated method deals with SSL, SSS, and classification in the same neural network. Although the method implicitly intends that SSL, SSS and classification are trained simultaneously in a single network, loss function with respect to the direction of arrival (DOA) is not used in training. Thus, spatial features may not be used effectively. Deep learning-based methods for sound event localization and detection (SELD) have been proposed [18,19,20,21]. These methods simultaneously perform SSL and sound event detection (SED) of environmental sounds. Many SELD methods have two branches that perform DOA estimation and SED. These methods are trained using not only loss function to SED outputs but also DOA outputs. However, the DOA and the class do not correlate unless the position and orientation of the microphones remains fixed. If a sufficient dataset is not available, the network overfits to the relationship between the DOA and the class.

Throughout the multichannel-based method, various features, such as complex values of short time fourier transform (STFT), IPD, sine and cosine of IPD, have been used as spatial features [14, 18, 20], but there are no studies comparing them.

This paper proposes a multichannel environmental sound segmentation method comprising two discrete blocks, a sound source localization and separation (SSLS) block and a sound source separation and classification (SSSC) block as shown in Fig. 1. This paper has the following contributions:

  • It is not necessary to set the number of sound sources in advance, because sounds from all azimuth directions are separated simultaneously.

  • Because the SSLS block and the SSSC block are discrete there is no overfitting to the relationship between the DOA and the class.

  • Comparison of various spatial features revealed the sine and cosine of IPDs to be optimum for sound source localization and separation.

Fig. 1
figure 1

Proposed framework of the environmental sound segmentation method. Spectral and spatial features were input to the sound source localization and separation block. Then each separated spectrogram was segmented into each class

2 Related work

This section describes multichannel-based approaches to sound source localization, sound source separation and classification.

2.1 Multichannel autonomous speech recognition

Multichannel-based methods have been proposed for ASR [2.2 Multichannel environmental sound segmentation

A multichannel environmental sound segmentation method has been proposed [17]. This method uses magnitude spectra and the sine and cosine of the IPDs as input features to train SSL, SSS and classifications simultaneously in the same network. Although this method implicitly intends that SSL, SSS and classification are trained simultaneously in a single network, the loss function with respect to DOA is not used in training thus spatial features may not be used effectively. Normally, unless the position and orientation of the microphone is always fixed, DOA and the class do not correlate. If a sufficient dataset is unavailable, the network overfits to the relationship between DOA and the class.

A combination of SSL, and SSS for classification of bird songs has been proposed [24, 25]. The method comprises SSL, SSS, and classification blocks, and uses the SSL results as spatial cues for bird song classification. However, spatial cues are ineffective if the position and orientation of the microphone differs from that during training.

2.3 Sound event localization and detection methods for environmental sound

For detection and classification of acoustics of scenes and events (DCASE) [22], deep learning-based methods for SELD have been proposed [18,19,20,21]. These methods simultaneously perform SSL and SED of environmental sounds containing many classes. Many SELD methods have two branches that perform DOA estimation and SED, and calculate the loss to DOA and SED outputs, respectively. A simple SELD method optimizes both of these losses simultaneously, but typically DOA and class do not correlate unless the microphone position and orientation is always fixed. If a sufficient dataset is unavailable, the network overfits to the relationship between the DOA and the class. Therefore, many SELD methods have reported improved performance by training these two branches separately. However, these methods reduce the frequency dimension of the features by using frequency pooling [23] and cannot perform SSS. Additionally, various features, such as complex values of STFT coefficients, IPD, sine and cosine of IPD, have been used as spatial information, but no study has compared them.

2.4 Issues of related works

Conventional multichannel-based methods have drawbacks.

  • For environmental sounds containing many classes, it is impossible to set a maximum number of sound sources in advance.

  • If a sufficient dataset is unavailable, the network overfits to the relationship between the DOA and the class.

  • Various features, such as complex values, IPD, sine and cosine of IPD, have been used as spatial information, but no study has compared them.

To address these issues, this paper proposes a multichannel environmental sound segmentation method that includes discrete SSLS and SSSC blocks. By separating the blocks, this method prevents overfitting to the relationship between DOA and the class. The SSLS block separates sound sources by each azimuth direction among the complex mixture of sounds, so that it is not necessary to set the number of sound sources in advance. Additionally, we compared multiple types of spatial features.

3 Proposed method

Figure 2 shows the overall structure of the proposed method which consists of four blocks: (a) feature extraction, (b) sound source localization and separation (SSLS), (c) sound source separation and classification (SSSC), and (d) reconstruction. (a) STFT was applied to the mixed waveforms. STFT coefficients were decomposed into magnitude, sine and cosine of IPDs, and phase spectrograms. (b) The magnitude spectrograms and sine and cosine IPDs were input into the SSLS block. This block separated the magnitude spectrograms for each azimuth direction from the mixture. (c) The outputs of the SSLS block were input into the SSSC block. Because the SSLS block could not fully separate magnitude spectrograms for each azimuth direction from the mixture, it additionally separated magnitude spectrograms for each class from the output of the SSLS block. (d) The time domain signals were reconstructed using inverse STFT.

Fig. 2
figure 2

Complete architecture of the proposed method comprising feature extraction, SSLS, SSSC and reconstruction. STFT was applied to the waveforms. The SSLS block predicted spectrograms of each direction from the input spectrograms. Since the SSLS block could not separate sound sources that arrived from a close direction, the SSSC block performed not only classification but separated magnitude spectrograms for each class from the output of the SSLS block. Inverse STFT was applied to reconstruct the time domain signal

Normally, there is no correlation between DOA and the class unless the position and orientation of the microphone array is fixed. If a sufficient dataset is unavailable, conventional environmental sound segmentation method using the single network overfits to the relationship between the DOA and the class. In contrast, the proposed method prevents such overfitting by explicitly separating the SSLS block from the SSSC block. The method was trained in two stages. Initially, we trained with the SSLS block alone, then the SSSC block was trained using the output of the SSLS block as input with the weights of the SSLS block fixed. It is not necessary to set the number of sound sources in advance, because all sound sources in all azimuth directions are separated simultaneously. Although the SSLS block could not separate sound sources that arrived from a close direction, the SSSC block performed not only classification but simultaneously separated the magnitude spectrograms for each class from the output of the SSLS block.

3.1 Feature extraction

We used the following spectral and spatial features proposed in [26, 27]. The input signals were multichannel time-series waveforms with a sampling rate of 16 kHz. STFT was applied using a window size of 512 samples and a hop length of 256 samples. A reference microphone, p, and other non-reference microphones, q, were selected. Magnitude spectrograms of the reference microphone were used as spectral features. The magnitude spectrograms were normalized to the range of [0, 1]. Meanwhile, sine and cosine of IPDs were used as spatial features as,

$$ sinIPD(t, f, p, q)=sin(\theta_{t, f, p, q}), $$
(1)
$$ cosIPD(t, f, p, q)=cos(\theta_{t, f, p, q}), $$
(2)

where 𝜃t,f,p,q is the IPD between the STFT coefficients xt,f,p, and xt,f,q, at time, t, and frequency, f, of the signals at reference microphone, p, and non-reference microphones, q.

3.2 Sound source localization and separation

Figure 3 shows the overview of the SSLS block. The SSLS block predicted 360 / n spectrograms of each azimuth direction at angular resolution, n. We used Deeplabv3+, which has been originally proposed for semantic segmentation of images [28], for our proposed method. Deeplabv3+ has been reported to improve the segmentation performance for environmental sounds with various event sizes [17].

Fig. 3
figure 3

The overview of the SSLS block. The SSLS block predicted 360 / n spectrograms of each azimuth direction at angular resolution, n

Figure 4 shows the structure of Deeplabv3+ used for SSLS block. Similar to U-Net [7], which is often used as a conventional model, it has an encoder-decoder structure. The encoder block is a convolutional neural network that extracts high-level features. Xception [29] module was used for feature extraction and it outputs a feature map that is 1/16 of the original spectrogram size. The biggest difference with U-Net is a pyramid structure with dilated convolution [

Fig. 4
figure 4

Architecture of Deeplabv3+ which predicted spectrograms of each direction from the input spectrogram with angular resolution, n

Fig. 5
figure 5

Dilated convolution. Blue area shows convolution filter. By increasing the dilation rate, the receptive field can be expanded without increasing the number of parameters

SSLS block predicted 360 / n spectrograms of each azimuth direction at angular resolution, n. While PIT requires the number of sound sources to be set in advance [15, 16], this method does not require the number of sound sources to be set in advance. The spectrograms in the direction where no sound sources exist were zero. If multiple sources exist in the same direction, the SSLS block cannot separate the sound sources. But the sources that could not be separated by the SSLS block were separated by the SSSC block described below. Note that the SSLS block predicts a spectrogram for each azimuth angle regardless of the class, so the network does not overfit to the relationship between the DOA and the class.

Equation (3) represents the loss function used in the training. X denotes the input spectrograms of the mixed signal and Yssls denotes the magnitude spectrograms of the target sounds. The mean squared error (MSE) between the output of the network and the target sounds was used for the training:

$$ L({\boldsymbol{X}},{\boldsymbol{Y}_{ssls}})=\|f({\boldsymbol{X}})\circ{\boldsymbol{X}_{mag}-\boldsymbol{Y}_{ssls}}\|_{2}, $$
(3)

where f(X) is the mask spectrograms generated by the model and Xmag is the magnitude spectra of the reference microphone. The model was trained for 100 epochs at a learning rate of 0.001 using the ADAM optimizer [33].

3.3 Sound source separation and classification

Figure 6 shows the structure of the SSSC block. The output of SSLS block has n outputs for the n directions, but spectrograms separated by SSLS block were input into SSSC block one by one and segmented into each class. Although, the spectrograms of all directions could be input to the SSSC block at the same time as multichannel inputs, the relationship between the direction and class of the sound sources would be overfitted. The output in the direction where a sound source did not exist was not input to the SSSC block as,

$$ {\boldsymbol{X}_{sssc}}=\{{\boldsymbol{Y}_{sslsn}} | \max \{{\boldsymbol{Y}_{sslsn}}\}>0.2\}, $$
(4)

where n represents the angle index of the SSLS output. In the normalized magnitude spectra, 0.2 corresponded to a maximum volume of approximately -96 dB. The spectrograms of the mixed sounds were concatenated and input to Deeplabv3+ because the spectrograms separated by SSLS block might be missing necessary information. Unlike Deeplabv3+ described in Section 3.2, the mixed spectrogram and separated spectrogram by SSLS block were input and the spectrograms for each class were output. Note that the outputs of the SSLS block were input to the SSSC block one by one, so that it has no spatial features. Since the SSSC block predicts the spectrogram for each class regardless of the DOA, the network does not overfit to the relationship between the DOA and the class. Equation (5) represents the loss function used in the training. Xsssc denotes the input spectrograms and Y denotes the magnitude spectrograms of the target sound. The MSE loss between the output spectrograms and the targets was used in the training:

$$ L({\boldsymbol{X}_{sssc}},{\boldsymbol{Y}})=\|f({\boldsymbol{X}_{sssc}})\circ{\boldsymbol{X}_{mag}}-{\boldsymbol{Y}}\|_{2}, $$
(5)

where f(Xsssc) is the mask spectrograms generated by the model and Xmag is the magnitude spectra of the reference microphone. The model was trained for 100 epochs at a learning rate of 0.001 using the ADAM optimizer [33].

Fig. 6
figure 6

Architecture of the SSSC block. Spectrograms separated by SSLS block were input to this block one by one and segmented into each class