1 Introduction

In the era of vacuum tubes, radios were equipped with luminous dials for tuning frequencies and locating stations, complemented by a scanning dial designed for exploring different frequency options. Today’s digital tuning retains the term’s meaning of seeking content across stations. For radio listeners, the speedy identification of their preferred content closely resembles the quick channel selection that humans effortlessly perform within milliseconds through auditory perception [10]. The manuscript is motivated by the question of whether deep learning algorithms can similarly classify sounds accurately in durations shorter than human recognition.

Over the past years, convolutional neural networks (CNN) have achieved a significant impact on several image and audio processing tasks [23, 27, 30]. In recent times, there has been a growing focus on the Transformer model as a viable alternative to CNNs [11, 16], is designed specifically for manipulating time-frequency domain features. This augmentation method involves blending two distinct data samples using time-frequency masks that effectively preserve spectral correlation within each audio sample. A different strategy, MixUp [38], revolves around blending two sounds from different classes in a randomized ratio, resulting in a sound that lies between classes. Additionally, the VH-MixUp technique involves augmenting data through a nonlinear blending process applied to multiple images [36].

However, it is important to acknowledge that the aforementioned methods have the potential to partially diminish the feature information inherent in audio files. To address this concern and mitigate the loss of critical feature information, the STTS is introduced. This approach aims not only to enhance data diversity and reduce overfitting but also to strengthen the generalization capabilities of neural network-based algorithms. The STTS method achieves this by selectively shifting a segment of the audio spectrogram’s feature information along the temporal dimension, thereby striking a balance between augmentation and preservation.

3 Instantaneous audio datasets

Most publicly available audio datasets consist of audios longer than 5sec, which may or may not contain silent parts [22, 25, 28, 32, 40, 41]. This research navigates through the details of audio classification within remarkably brief time intervals, utilizing a Transformer model- the Instantaneous Audio Classification Transformer (i-ACT). Unlike previous methodologies, that classify sounds within a minimum duration of 5sec or 10sec [9, 18], i-ACT specializes in performing instantaneous audio classification in extremely short intervals, to be precise, a minimum 0.025sec.

However, to effectively utilize i-ACT for instantaneous audio classification, thorough preprocessing of existing datasets is requisite. Simply segmenting existing datasets into instantaneous portions proves insufficient; it is important to excise silent (unvoiced) sections from the audio files. To create the updated dataset, the dBFS [29] is employed to filter out non-silent portions from the audio data used in the experimental datasets. The dBFS is a unit of measurement for amplitude levels in digital systems and provides a standard value [20]. It determines the absence of vocal activity in an audio segment by comparing its sample value with a baseline value. The silent segments within the dataset are identified by analyzing the amplitude of the audio files using the Pydub library of Python.

Fig. 1
figure 1

Waveform of an audio file. An audio segment is divided into voiceless segment, low amplitude segment and filtered segment, respectively

Fig. 2
figure 2

i-AESC-50, i-AUrbanSound8K, i-AAESDD, i-AReaLISED and i-ARAVDESS interval histogram of different audio lengths at the millisecond level. A smaller subfigure is the zoomed-in range 350 to 2700 of the larger subfigure

Table 1 The attributes of newly generated instantaneous datasets (for extremely short audio) through the proposed filter

Given that the waveform of an audio file consists of multiple periodic signals, calculating the average value over a time period involves taking the Root Mean Square (RMS) of the sample values in the signal. This process identifies non-silent audio segments by comparing the average value of the periodic function with the sample value over a given time or amplitude, as illustrated in Fig. 1. As shown in the figures two filtered segments obtained by dBFS discard the unvoiced and lower amplitude segments. The RMS of the audio file is calculated as follows:

$$\begin{aligned} X_{RMS} = \sqrt{\frac{1}{n}\sum \limits _{i=1}^nx_i^2}, \end{aligned}$$
(1)

where \(x_{1}, x_{2}, x_{3},..., x_{n}\) are the sample values and \(x_{n}\in \left[ -1,1\right]\) and n represents the total number of samples of an audio file sampled at 44.1 kHz. After calculating the RMS, a threshold value T is defined as follows:

$$\begin{aligned} \begin{aligned} T&= 20\cdot \log _{10}\left( X_{RMS}\cdot \sqrt{2}\right) , \end{aligned} \end{aligned}$$
(2)

such that \(T=0\) for a full-scale sine wave. Since the maximum of \(X_{RMS}\) is \(\frac{1}{\sqrt{2}}\), therefore multiplying \(X_{RMS}\) by \(\sqrt{2}\) ensures that the argument of \(\log _{10}\) in (2) returns 0 for the full-scale sine wave. To differentiate between silent and non-silent segments, the audio signal is divided into voiceless, low-amplitude, and high-amplitude segments. Therefore, the sample values are modified to get new values, \(I_i\), by the following equation:

$$\begin{aligned} I_i=20\cdot \log _{10}\left( \mid x_{i}\mid \right) , \end{aligned}$$
(3)

where \(1\le i\le n\). The audio file segment is recognized as the part to be discarded when the stream of modified sample values \(I_i\) remains less than the threshold T for a minimum of 50msec. In this way, several audio segments of higher amplitude (loudness) can be obtained from one audio file.

Contrary to configuration presented by Iqbal et al. in [15], which establishes the threshold in terms of a fixed value of \(-48\), this study determines the threshold in terms of the RMS of each audio file in order to filter out silent parts. The exact silent parts of all audio files cannot, however, be detected. The distribution duration of the audio datasets used in this study is shown in Fig. 2. The horizontal axis is the audio length interval, and the vertical axis is the number of intervals. The length of each interval is marked on the histogram. Newly generated datasets through dBFS-based filter and original datasets are homogeneous.

4 Generated datasets for training and evaluation

The experiment utilized five datasets: ESC-50 [28], UrbanSound8K [32], Real-Life Indoor Sound Event Dataset (ReaLISED) [25], Ryerson Audio-Visual Database of Emotional Speech & Song (RAVDESS) [22], and Acted Emotional Speech Dynamic database (AESDD) [40, 41]. The proposed filtering method was applied to generate instantaneous audio datasets for each.

The values for category count, file count, training set size, testing set size, and average duration for each generated dataset are provided in Table 1. The average duration of all generated audio files combined (i-AESC-50 \(\sim\) i-ARAVDESS) is \(1.24\times 10^{-2}\)seconds.

5 Data augmentation and transformer architecture

This section provides a comprehensive explanation of the proposed data augmentation method and the Transformer architecture in detail.

5.1 Proposed data augmentation method

Following the procurement of the instantaneous audio datasets, the ensuing procedural step encompasses the execution of data augmentation. This phase is necessary for enhancing the robustness and generalization capabilities of the model, especially in diverse and potentially challenging acoustic environments. This manuscript introduces an innovative and effective data augmentation approach known as Squeezing-Toothpaste-Time-Shift (STTS). The proposed technique aims to elevate the model’s performance by expanding its ability to generalize across a spectrogram of acoustic scenarios for the instantaneous audio datasets.

The STTS method leverages the understanding that the frequency content of an audio signal can be visually represented as a color image through a spectrogram. A spectrogram serves as a multidimensional representation, essentially a matrix where rows r correspond to time, columns c correspond to frequency, and the intensity is depicted by numerical values. This graphical representation illustrates the relationship of sound elements over time and frequency, similar to a snapshot of the audio landscape.

Fig. 3
figure 3

Squeezing-Toothpaste-Time-Shift (STTS) applied to the base input matrix

Algorithm 1
figure a

STTS Method

The STTS process is a dynamic operation involving the manipulation of the spectrogram columns. Specifically, a “squeezing” action takes place on a designated number of leftmost data columns, determined by the absolute value of a random integer value R within the range \([-10,10]\). This value is assigned to the spectrogram matrix \(V_{r,c}\), consequently generating a modified spectrogram matrix, denoted as \(V_{r,c}^{\star }\). Following this, the affected columns are transposed to the rightmost side of the spectrogram. This action necessitates a simultaneous shift of the existing data on the right side to accommodate the transposed columns. The direction of this shift depends on the sign of R: a positive R value indicates a squeeze and shift to the right, whereas a negative R implies a leftward operation. If R equals zero, the original spectrogram remains unaltered.

In the resultant spectrogram matrix \(V_{r,c}^{\star }\), the pixel values undergo updates using the variables k and d as indices. This updating process is designed to ensure that no data is lost or omitted during the transition. By maintaining the integrity of the original information while introducing variations through the squeezing and shifting operations, the STTS method successfully injects controlled variations into the spectrogram representation. This controlled augmentation strategy is fundamental to the robustness and adaptability of the model, allowing it to effectively learn and generalize across diverse acoustic scenarios without compromising the quality of the underlying audio data. Algorithm 1. gives the concrete step of the STTS method.

To facilitate a comprehensive understanding of the proposed STTS method, Fig. 3 illustrates its application to the base input matrix. The figure aids in explaining the dynamic transformations that occur during the STTS process. In Fig. 3, a random value of 2 is chosen as an example, influencing the behavior of the STTS method. This specific value results in the squeezing of the leftmost two columns of the base input matrix, which are then transposed to the right side of the spectrogram. Simultaneously, the data that was originally on the right side shifts to make space for these newly transposed columns, as depicted in Fig. 3.

It is important to note that the central portion of the matrix remains unaffected by the squeezing operation. The data in the center merely shifts to the left or right, ensuring that no information is lost during this process. This careful design ensures the accuracy of the audio data is maintained, kee** all the detailed features in the spectrogram intact.

For consistency and uniformity, the size of all audio spectrogram matrices in this context is maintained at \(128 \times 70\). This standardization underscores the applicability and scalability of the STTS method across diverse instantaneous audio datasets, promoting its versatility as a data augmentation technique.

Fig. 4
figure 4

The Instantaneous Audio Classification Transformer (i-ACT) architecture with the Squeezing-Toothpaste-Time-Shift (STTS) method

5.2 Transformer architecture

Figure 4 illustrates the i-ACT architecture with the Squeezing-Toothpaste-Time-Shift (STTS) method. Initially, the input audio waveform, with a duration of \(\varDelta T\) seconds, is transformed into an audio spectrogram using a 15msec Hamming window computed at intervals of 6msec. Following the approach outlined in [11, 12, 18], the frequency dimension is consistently maintained at 128. To ensure uniformity before feeding the spectrograms into the Transformer, a fixed value for the time dimension of the spectrogram needs to be provided. Through a series of experiments, it was determined that the optimal Transformer performance is achieved when the time dimension is set to 70 for such brief audio files i.e., an average duration of \(1.24^{-2}\)second. Consequently, most spectrogram matrices will have a shorter time dimension. If the time dimension of the generated spectrogram is less than 70, the lower section of the spectrogram matrix is filled with zeros. Conversely, if it exceeds 70, the segment beyond 70 will be omitted.

The spectrogram undergoes a patch division process, as illustrated in Fig. 4, wherein it is segmented into smaller units, each measuring \(16\times 16\). In contrast to traditional CNN, which processes entire images simultaneously, ViT [6] operates by dividing images into smaller patches. This patch-based processing, inspired by the Transformer, enhances computational efficiency and facilitates parallelization.

Following Gong et al.’s insights [11], the overlap** patches is employed, characterized by a six-unit overlap in both the time and frequency dimensions. This strategy is employed to improve the accuracy of classification. Subsequently, these patches undergo sequential processing through a linear projection layer, leading to the generation of patch embeddings. In order to retain positional information, position embeddings are introduced. These embeddings signify the position of each patch embedding within the overall sequence and are incorporated into the patch embedding sequence using linear addition. This approach ensures the preservation of crucial positional details during the transformation process.

Both the patch embeddings and position embeddings have a size of 768 elements, with dimensions of \(16_{height}\times 16_{width}\times 3_{item}\). The \(3_{items}\) refer to the three components required by the Transformer encoder: Query, Key, and Value [6]. These components are essential for the attention mechanism of the Transformer, as they enable it to analyze the relationships and relevance of each sequence element in the audio signal.

Class tokens, initialized tokens without inherent information, are added to the beginning of the patch embedding sequence, following the approach by Devlin et al. in [5]. While the class token itself lacks specific information, it aggregates information from other tokens in the sequence. Subsequently, the sum of each patch embedding and its corresponding position embedding is fed into the Transformer. The Transformer architecture comprises two parts: an encoder and a decoder. In this study, only the encoder is utilized for classification tasks, as it analyzes the relationships between input patches. The architecture of the Transformer encoder aligns with the design presented by Dosovitskiy et al. in [6]. The Transformer encoder’s output is then directed to a softmax classifier, which assigns confidence scores to various sound events for classification purposes.

6 Experimental setup

For the experiments, PyTorch library is used and the data is trained on a GeForce GTX 3060 GPU with 12GB memory. In this study, the datasets for training and testing were divided in a 5:1 ratio using stratified sampling, based on individual sound event categories, such that 83.34% dataset is used for training, and 16.66% is used for testing. The i-UrbanSound8K dataset was divided in a 3:1 ratio to verify performance consistency across different ratios.

The scarcity of files in the training set was a notable constraint, leading us to optimize the scheduling of datasets for model training. Therefore, the validation set is not employed in this analysis [8, 11, 18]. The focus of this study was on including as many datasets as possible while ensuring that only the essential training and testing sets were allocated for the effective evaluation of model performance.

To promote convergence and minimize the impact of varying scales within the dataset, the input audio spectrogram was normalized, resulting in a dataset with 0 mean and a 0.5 standard deviation. Given the brief audio duration, the dataset is treated as a single-label dataset, with only one event considered within a unit interval. Performance assessment relies on the accuracy metric for single-label classification, while for multi-label classifications, mean average precision (mAP) is utilized.

The Adam optimizer [17] is used for training, with an initial learning rate of \(10^{-5}\). The learning rate is subsequently decreased by a factor of 0.5 for every 10 epochs. To prevent underfitting, the model is trained for 100 epochs and each experiment is repeated at least five times. Accuracy results reported in all cases are the averages obtained at the \(100^{th}\) epoch.

7 Results and discussion

This section describes the results and performance comparison of STTS under varied settings. The performance of i-ACT, examined across multiple datasets, is comparatively analyzed in the context of the pre-training model and diverse data augmentation methods. Moreover, the i-ACT is compared with preceding methodologies, such as AST [11] and PSLA [12]. AST is purely a transformer-based algorithm, while PSLA is an ensemble of model training techniques that have been proven to significantly enhance model accuracy. The core architecture underpinning PSLA is a convolutional neural network called as EfficientNet [37]. The decision to contrast PSLA and AST is strategically made to underscore the advantages of i-ACT, which employs a transformer architecture with the objective of accurately classifying extremely short audio files. The standard deviation for all results obtained from the subsequent experiments remains less than 0.4.

7.1 Performance analysis of PSLA & AST using the ESC-50 & i-AESC-50 datasets

This section evaluates the performance of AST and PSLA and highlights the crucial functionality of i-ACT in the classification of extremely short audio clips within the i-AESC-50 dataset, leveraging optimal configurations initially devised for the ESC-50. According to preceding studies [11, 12], AST has demonstrated impressive outcomes, notably when utilizing pre-training models derived from ImageNet and AudioSet, and implementing masking techniques across both time and frequency domains. It is critical to acknowledge that, while AST achieves an accuracy of \(90.92\%\) when deployed on the ESC-50 dataset, its accuracy perceptibly recedes to \(89.20\%\) when applied to the more demanding i-AESC-50 dataset, as indicated in Table 2.

In contrast, PSLA used ImageNet as a pre-training model and registers an accuracy of \(82.75\%\) on the ESC-50 dataset but experiences a decrease to \(78.89\%\) when applied to the i-AESC-50 dataset, as shown in Table 2. This table demonstrates that both AST and PSLA methods encounter a dip in classification accuracy when interfacing with the i-AESC-50 dataset, thereby highlighting the complex nature of classifying extremely short audio files within this dataset, relative to the ESC-50 dataset.

Table 2 Performance comparison of PSLA and AST from the ESC-50 to the i-AESC-50
Table 3 Classification accuracy of Squeezing-Toothpaste-Time-Shift method, frequency shift and non-random time shift

7.2 Performance analysis of STTS across various configurations

The Squeezing-Toothpaste-Time-Shift (STTS) method is investigated under diverse configurations, with accuracy results being evaluated across varied scenarios: different ranges of the random value R, frequency-shift, and non-random time-shift. Three specific ranges of are considered for R: \([-10, 10]\), \([-5, 5]\), and \([-10, 15]\). In the case of non-random time-shift, the R is fixed and correspondingly allocated according to the sound event classes for instance, the i-AESC-50 comprises 50 types of sound events, each assigned a fixed value corresponding to its category. The experimental findings reveal that the highest accuracy is achieved when R resides within the range of \([-10, 10]\), as detailed in Table 3. As mentioned previously, all audio spectrogram matrices maintain a size of \(128 \times 70\). The optimal time-shift range is determined to be \(0.1429\times\) the number of columns in the spectrogram matrix for achieving the highest classification accuracy.

Despite the fact that both frequency and time can be squeezed along their respective axes, the performance of STTS exhibits a declination when the frequency is manipulated. This could potentially be ascribed to the frequency diversity inherent i-AESC-50, suggesting that random frequency shifts may lead to overfitting. Conversely, non-random time-shift squeezing, based on sound event categories, is assumed to enhance model performance due to fixed labels. However, the results contradict this assumption, indicating that the model benefits from greater randomness to increase its robustness.

7.3 Performance of i-ACT under pre-training model

This section presents a comparative assessment of i-ACT, exploring its performance against previously proposed pre-training models. In the study by Gong et al. [11], the incorporation of AudioSet as a pre-training model, used in conjunction with ImageNet, augmented the system’s performance by 7% compared to utilizing only ImageNet as a pre-training model with the original ESC-50 dataset. It is important to note that both ImageNet alone and in combination with AudioSet exhibit equivalent efficacy as pre-training models for the i-AESC-50, which was derived from the original ESC-50, as evidenced in Table 4.

Table 4 Classification accuracy of pretraining models with i-ACT
Table 5 Classification accuracy of i-ACT with data augmentation methods on i-AESC-50
Fig. 5
figure 5

The confusion matrix of i-ACT + ImageNet + MixUp + Masking + STTS on i-AESC-50

Fig. 6
figure 6

Learning cure of STTS compared MixUp and Masking on i-AESC-50

7.4 Performance analysis of i-ACT with data augmentation methods on i-AESC-50

Table 5 presents the accuracy outcomes of i-ACT classification when combined with the ImageNet pretraining model and various data augmentation methods, including MixUp, Masking, and the proposed STTS. To conduct the ablation study, i-ACT is compared with the original AST and PSLA, i.e., without the use of any pretraining model or data augmentation technique. A notable enhancement in performance on i-AESC-50 was observed when the pretrained ImageNet model was deployed [11, 39]. Consequently, the performance of i-ACT is compared to that of the ImageNet pretrained i-ACT. Furthermore, a comparative analysis of time/frequency masking [26] and MixUp [38] is carried out, as both have demonstrated improved performance in audio classification tasks [11, 18], in conjunction with STTS. The experimental findings highlighted that time/frequency masking of 25 and a mix-ratio of 0.4, respectively, resulted in optimal performance. Based on the results presented in Table 5, it is evident that STTS has higher accuracy compared to the other data augmentation methods.

The learning curve of STTS, when compared with MixUp and Masking on i-AESC-50, is illustrated in Fig. 6. This figure shows accuracy values on the horizontal axis and epochs on the vertical axis. Examination of the learning curve indicates that STTS achieves better generalization for the i-ACT classifier.

To elucidate the classification capability of the i-ACT with data augmentation methods within the various categories of the i-AESC-50, refer to the confusion matrix illustrated in Fig. 5.

Additionally, Fig. 7 illustrates the classification accuracy of the i-ACT with different data augmentation methods as well as their combinations based on various sound events. The outcomes of i-ACT are uniformly colored to distinguish the effects of the proposed method from those of ImageNet and other data augmentation methods. The proposed method shows improved results than the previous methods.

Table 6 Classification accuracy of AST and PSLA with and without STTS on the normal ESC-50

7.5 Performance analysis of STTS on the original ESC-50

STTS was subjected to the conventional ESC-50 dataset to evaluate its efficacy in handling standard audio data. Nonetheless, the results, displayed in Table 6, reveal that neither the self-attention-based transformers nor the CNN-based PSLA exhibited a substantial performance enhancement when STTS was implemented. Contrarily, the utilization of STTS somewhat diminished the performance of these models. The performance demonstrated a marginal enhancement when utilizing Masking. Complementing the insights provided in Table 5, the findings from the same table affirm that STTS is effective when managing extremely short audio segments. Conversely, it does not produce satisfactory results when applied to audio files of standard/normal length.

Fig. 7
figure 7

The classification accuracy of i-AESC-50 by i-ACT with different data augmentation methods and pretraining model

Table 7 Classification accuracy of i-ACT with data augmentation methods based on i-AUrbanSound8K
Fig. 8
figure 8

The confusion matrix of i-ACT + ImageNet + MixUp + Masking + STTS on i-AUrbanSound8K

Table 8 Classification accuracy of SOTA and proposed strategy based on i-AUrbanSound8K

7.6 Performance analysis of i-ACT with data augmentation methods on i-AUrbanSound8K & i-AAESDD

To verify the scalability of the proposed scheme, i-ACT is implemented on i-AUrbanSound8K and i-AAESDD datasets. Table 7 demonstrates the classification accuracy of i-ACT in conjunction with various data augmentation methods as compared to prior algorithms on i-AUrbanSound8K. It should be noted that while i-AUrbanSound8K predominantly comprises urban sounds, it contains fewer sound categories than ESC-50 dataset. According to the results presented in Table 7, i-ACT with data augmentation methods outperforms previous methods.

Table 7 illustrates a noteworthy finding: while the latest SOTA for UrbanSound8K stands at 90% [8], i-ACT, when paired with data augmentations, achieved an accuracy of 96.4% on the i-AUrbanSound8K. This discrepancy is partially attributed to the possibility that the two datasets, despite similarities, are not identical. Therefore, while they may not be directly comparable, this insight could lay the groundwork for future investigations, such as exploring amplitude-based local data augmentation. Nonetheless, i-ACT provided a substantial enhancement in the dataset of i-AUrbanSound8K.

Additionally, to demonstrate the superiority of our proposed strategy, it is compared with a SOTA method known as End-to-End Audio Transformer (EAT) [8], which is a combination of CNN and Transformer and has previously achieved the best results on the UrbanSound8K dataset. As shown in Table 8, while EAT achieved an accuracy of 86.32% on i-AUrbanSound8K, our method surpassed EAT by nearly 10%, achieving an accuracy of 96.40%.

Table 9 reveals that i-ACT, when employed alongside data augmentation methods on the emotional speech dataset i-AAESDD, achieves an accuracy of 70.98%, surpassing both PSLA and AST.

To delineate the classification capability of the i-ACT with data augmentation methods across different categories in both i-AUrbanSound8K and i-AAESDD, refer to the confusion matrices shown in Figures 8 and 9.

7.7 Performance analysis of i-ACT with data augmentation methods on i-AReaLISED & i-ARAVDESS

In contrast to the i-AUrbanSound8K dataset, the i-AReaLISED dataset encompasses indoor sound events and features a broader array of sound types. The outcomes of implementing i-ACT on i-AReaLISED are depicted in Table 10. The original i-ACT presents an accuracy of 53.85%, whereas PSLA and AST exhibit accuracies of 50.30% and 53.24%, respectively. However, i-ACT, when harmonized with data augmentation strategies, transcends both PSLA and AST, achieving an accuracy of 70.98% as opposed to the 65.58% and 59.87% achieved by PSLA and AST (when utilizing identical data augmentation strategies), respectively, as evidenced in Table 10.

To outline the i-ACT with data augmentation methods’ classification capability across diverse categories in both i-AReaLISED and i-ARAVDESS, consult the confusion matrices depicted in Figures 10 and 11.

7.8 Performance analysis of Speech emotion datasets i-AAESDD & i-ARAVDESS

The analysis of i-RAVDESS dataset, indicated an accuracy rate slightly above 50%. This relatively lower performance can be attributed to the complex nature of emotional expression in speech, which often requires analysis of the entire utterance. Short audio clips make it difficult to distinguish emotions, which are conveyed through tone, pitch, rhythm, and language in speech. The proposed model, designed primarily for short-duration audio, struggles to capture these intricate emotional cues effectively within such short snippets. Like with i-RAVDESS, the unique characteristics of emotional speech in i-AESDD pose challenges for our current model. Despite these challenges, the proposed method demonstrated approximately 22% greater efficiency for the i-ARAVDESS dataset compared to other methods, as shown in Table 11. A parallel improvement of 20% in efficiency was noted for the i-AAESDD dataset, as detailed in Table 9.

In contrast to i-AAESDD, i-ARAVDESS is created using English speech and has more varieties of sound events. It can be seen from Table 11, that classifying the emotional speech dataset in i-ARAVDESS is more challenging than in i-AAESDD. Nevertheless, both the original i-ACT and i-ACT equipped with augmentation approaches demonstrate superior performance compared to AST and PSLA.

Table 9 Classification accuracy of i-ACT with data augmentation methods based on i-AAESDD
Fig. 9
figure 9

The confusion matrix of i-ACT + ImageNet + MixUp + Masking + STTS on i-AAESDD

Table 10 Classification accuracy of i-ACT with data augmentation methods based on i-AReaLISED
Fig. 10
figure 10

The confusion matrix of i-ACT + ImageNet + MixUp + Masking + STTS on i-AReaLISED

Table 11 Classification accuracy of i-ACT with data augmentation methods based on i-ARAVDESS
Fig. 11
figure 11

The confusion matrix of i-ACT + ImageNet + MixUp + Masking + STTS on i-ARAVDESS

8 Conclusion

This study introduces i-ACT, a Transformer-based algorithm for instantaneous audio classification. Initial research involved employing a dBFS-based filter to extract truncated audio recordings from publicly accessible datasets by removing silent segments. This filter is versatile for application to various audio datasets. A new data augmentation method, STTS, is also employed to further improve classification accuracy. The proposed methodology is evaluated against different sound events and proves effective in classifying very short-duration sounds. A comparison between the proposed data augmentation scheme and standard methods was conducted on instantaneous audio datasets, yielding accuracy rates of 94.16%, 96.40%, 70.98%, 89.28%, and 53.51%, respectively. This study establishes a foundation for the rapid and accurate interpretation of auditory signals.

The implementation of instantaneous audio classification holds substantial potential for a wide variety of applications and future studies across diverse domains. Encompassing real-time applications in emergency response and healthcare, enhancing user experiences in gaming and smart technologies, and facilitating advancements in academic, industrial, and environmental research.