Introduction

Background

Poor quality of sleep is well known to not only cause mood disorders, but also lead to a higher incidence of traffic accidents. Obstructive sleep apnea (OSA), a type of sleep disorder, adversely affects health, and an increasing body of evidence has demonstrated that OSA has a high correlation with hypertension, coronary artery disease, heart failure, arrhythmia, and stroke. The most common reason for OSA is an oxygen supply shortage caused by pharyngeal collapse during sleep, which increases the burden on the cardiovascular system and causes problems related to the system [1,2,3]. To enable diagnosis of OSA, polysomnography (PSG) was developed, with nasal airflow and thoracic abdominal effort used as the main measurements, through which the average number of apnea and hypopnea events per hour of sleep (the apnea–hypopnea index [AHI]) can be obtained. PSG methods include electrocardiography (ECG), peripheral oxygen saturation (SPO2) monitoring, electroencephalography (EEG), and electrooculography (EOG) [4, 5]. However, traditional methods for diagnosing OSA can only be employed under specific conditions. In addition to equipment for at least 12 types of measurements, traditional methods require a suitable slee** environment and qualified personnel to manage the measurement processes, which raises the threshold for OSA diagnosis. To solve these problems, scholars have sought methods for identifying apnea patterns through single-signal measurements, such as SPO2 [32, 33].

In our study, we aimed to develop an apnea-detection workflow for processing extensive single-lead ECG data from the China Medical University Hospital (CMUH) sleep center. To enhance our model’s performance and save computation consumption, we only utilized short-time Fourier transform (STFT) to gain features of the signal in the frequency and time domain. In terms of the model, we employed machine learning (ML) and EfficientNet-based models for classification and feature extraction, respectively. The simple preprocessing method and hybrid architecture perform well in our extensive data set.

The rest of the paper is structured as follows: Sects. “Results” and “Discussion” test the system’s performance with different data and compare its performance with similar systems. Sect. "Conclusions" concludes the paper and offers suggestions for future models. Finally, Sect. "Materials and methods" introduces our training data, presents a simple method for data preprocessing, and outlines the roles of ML and DL in the proposed system.

Results

The outcomes of this study is detailed in this section. The functions of the model (apnea event detection and OSA level screening) are tested using the ECG records of the 276 patients with OSA, which were obtained from the CMUH sleep center. The effectiveness of the model is evaluated using ACC, SP, SE, and AUC.

Per-segment classification of apnea and nonapnea

Initially, we examine the performance of the models trained under different conditions. To observe their functioning on patients with varying levels of severity, we also divided the test data into groups according to the physician-diagnosed severity of OSA patients. The performance of the model is presented in Tables 1, 2, 3. When we trained the model using the class weight with non-overlap** slicing data (Table 1), the average ACC, AUC, and SP reached 0.874, 0.900, and 0.963, respectively, although the SE achieved only 0.612. For the low-level OSA groups, the SE was low. For example, the SE in the AHI < 5 group was only 0.198. This indicates that the ratio of apnea to nonapnea segments differed considerably between OSA levels, which created problems when we trained the model. Therefore, to solve the problem, we used the sample weight rather than the class weight.

Table 1 Results of class weight and non-overlap** slicing training
Table 2 Results of sample weight and non-overlap** slicing training
Table 3 Results of sample weight and overlap**-slicing training

Through this, the ratio of the positive and negative (P/N) segments in each patient’s signal was taken into account to adjust the weight, which also enabled the model to learn rare apnea features present in low-level OSA signals. This significantly improved the sensitivity of the model, as indicated in Table 2, in which the sensitivity is notably improved for low OSA levels, with a gain of 26.5%.

Overlap** data slicing also enabled the model to learn effectively from the training data. In the sample-weight, non-overlap** data trial, only ACC = 0.854, AUC = 0.900, and SP = 0.897 were achieved. Overlap** slicing can increase the probability of complete apnea signals being included in segments, and it enables models to be fed a double amount of data. For the overlap** trial (Table 3), we developed a postprocessing smoothing method that normalizes the prediction score of each segment with those of the previous and subsequent sections to create a continuous record for the entire night. This prevents potential contradictions, such as one segment containing an apnea signal when the neighboring, overlap** segments do not. Overlap** data slicing yielded the most favorable performance, with ACC = 0.855, AUC = 0.917, and SP = 0.878. The results of the subsequent, more-detailed trial are displayed in Table A.1 in the Online Appendix.

Per-record OSA level screening

Rather than using the detected apnea events divided by the total recorded time to assess the severity of OSA, as AHI has been performed in the literature [18, 22, 25, 28, 34], we used our trained DL model to extract features of patient sleep records for more accurate identification of the patients’ sleep states. The results of the different trials are displayed in Tables 4, 5. Under the same ML model setting, the results demonstrate that the model’s performance improved as the screening OSA threshold increased, indicating that signals from patients with higher levels of OSA have clearer OSA patterns. This also explains why the sensitivity of the per-segment classification improves as the severity of the groups increases. Additionally, the results revealed that the model performed best with overlap** slicing and the sample weight set to the results of per-segment classification. When we used the SVM model as a classifier, for the AHI > 15 screening, ACC = 0.899 and AUC = 0.949, and for the AHI > 30 screening, ACC = 0.917 and AUC = 0.972. When we used the XGBoost model, for the AHI > 15 screening, ACC = 0.909 and AUC = 0.961, and for the AHI > 30 screening, ACC = 0.928 and AUC = 0.975. These results demonstrate that the XGBoost model performed more favorably than the SVM model did. In our previous study, we applied a method for OSA level screening that combines the statistical results of SPO2 data with an SVM model [7]. However, in this study, we used a trained DL model as a feature extractor in combination with an ML model to deal with ECG data. Through this proposed model, the significance of ECG data may be comparable to that of SPO2 data in OSA level screening, with the results for SPO2 data indicating ACC = 0.880 and AUC = 0.941 for AHI > 15 screening and ACC = 0.904 and AUC = 0.958 for. AHI > 30 screening [7].

Table 4 OSA level screening in SVM model
Table 5 OSA level screening in XGBoost model

Discussion

Comparison with overlap**-slicing and general slicing

Through the mathematic derivation, we can get the probabilities that a window can completely cover a particular apnea signal under different slicing methods. For the general slicing, the probability is

$$P\left({w}_{s},x\right)=\left\{\begin{array}{c}\frac{{w}_{s}-x}{{w}_{s}}, x\le {w}_{s}\\ 0, x>{w}_{s}\end{array},\right.$$
(1)

where \({w}_{s}\) is the slice window length and the x is the apnea signal length. For the overlap** slicing, the probability is

$$P\left(w,x\right)=\left\{\begin{array}{c}1, x<\frac{1}{2}{w}_{s}\\ \frac{2\left({w}_{s}-x\right)}{{w}_{s}}, {w}_{s} \ge x >\frac{1}{2}{w}_{s}\\ 0, x>{w}_{s}\end{array}.\right.$$
(2)

With the apnea length distribution of our data set (see Fig. 2), we can estimate the posterior probability that our window could cover the whole apnea signal with the marginalization method [35]:

$$P\left({w}_{s}\right)={\int }_{10}^{\infty }P\left({w}_{s},x\right)D\left(x\right)dx,$$
(3)

where D(x) is the probability distribution of the apnea length. For the general slicing, the result showed that the window with a length of 60 s could only cover the 54.3% apnea event in the slicing. And for the window length = 30 s only has 19.4% to cover the complete apnea signal. To make most of the signal could be fully covered in at least one segment, we adopted the overlap** slicing in our literature. Under this method, the probabilities that at least one segment can fully cover the apnea signal are 89.2% with window length = 60 s and 38.0% with window length = 30 s.

With the above calculation, we could know that the overlap** slicing helps cover the entire apnea signal. And the test results shown in Table 2 and Table 3 also prove the setting with overlap** slicing has superior performance to the non-overlap** slicing one.

Models applied to data from PhysioNet Apnea-ECG database

We tested the proposed model on data from the PhysioNet Apnea-ECG database, a well-established benchmark dataset commonly used for evaluating apnea-detection models [36, 37]. In order to adapt the model effectively to the features of the PhysioNet data, we employed a fine-tuning approach. Initially, we initialized the model weights with pre-trained weights obtained from the CMUH Model and utilized the Adam optimizer with an initial learning rate of 1e−5. The fine-tuning process allowed the model to better adapt to the specific features of the PhysioNet dataset. By leveraging the fine-tuning approach we adopted, we aimed to minimize potential biases or discrepancies arising from using data from different sources.

Per-segment classification

Although the overlap**-slicing training could not be applied to this 1 min, labeled data set, the results of the trial reveal AUC = 0.944 and ACC = 0.888 for apnea detection, which is higher than the results for the CMUH data set. This is likely due to the population of the data having a strong influence on the results. Compared with those in the CMUH data, most of the PhysioNet cases were distributed at normal and severe levels, making it much easier for the model to distinguish between normal and OSA segments. In addition, the large population of the CMUH data was challenging for the model to process because the diversity in the ECG signals can increase the likelihood of distortion in apnea detection, although the results of large populations are much closer to those of real-world applications.

Per-record classification

The confusion matrix for the AHI > 5 screening for the PhysioNet data is displayed in Fig. 1. Only 1 FP occurred in the 35 test records, indicating ACC = 0. 971, SE = 1.000, and SP = 0.917. In adopting only linear SVM for this small data set, we achieved results comparable to those of other studies, the performances of which are summarized in Table 6.

Fig. 1
figure 1

Confusion matrix of AHI > 5 OSA level in PhysioNet data using the EfficientNet + SVM classifier

Table 6 Comparison of our model with elated studies for per-record screen (threshold: AHI = 5) of PhysioNet data

Comparison with other studies

In comparison with other studies, our approach to addressing similar objectives differs significantly in several key aspects. We applied intuitive techniques, such as short-time Fourier transform (STFT), to convert raw electrocardiogram (ECG) data into two-dimensional images, thereby enhancing training effectiveness through innovative overlap** slicing methods. Additionally, we mitigated the pronounced imbalance in positive and negative class ratios within our dataset by carefully selecting sample weights instead of relying on traditional class weight adjustments.

Furthermore, we observed that many studies do not transform ECG data into images but rather analyze raw signals and extracted features [18,19,20,21,22,23], such as R-peaks (RA, RRI, RAID, etc.), achieving satisfactory results. Regarding model architecture, while we utilized the EfficientNet framework for image classification, other studies, exemplified by Feng et al. [23], demonstrated significant achievements using different Convolutional Neural Network (CNN) models. In non-CNN deep learning architectures, pioneering research by the studies [22, 23, 31, 33] explored the use of attention-based Transformer frameworks, yielding promising results in classification tasks.

Similarly to our labeling preprocessing, Hu et al. [32] tackled an issue by varying the map** labeling length (including surrounding signal segments while kee** the labels of the central 1-min segment). They found that increasing the map** labeling length positively affected the model’s behavior. These outcomes suggest that the completeness of the signals indeed influences the model’s capability.

Hence, the methodological differences underscore the multifaceted nature of addressing similar research objectives, with each approach offering unique advantages and challenges. This highlights the importance of further exploration and research in diverse directions within the computational analysis of ECG data.

AHI value evaluation

For the per-record classification, it is common to evaluate the predicted apnea segments divided by the total record time to assess the AHI value, as has been performed in other studies. However, some potential problems should be considered:

  1. a.

    In the real world, knowing the total sleep time is essential to calculate the AHI value. Unfortunately, if the model cannot recognize the sleep stage, it is impossible to get this value needed in the AHI formula.

  2. b.

    Undercount might happen when there are over two apnea events in a segment. And the overcount would occur when the same apnea signal is distributed in two neighborhood pieces.

To avoid the aforementioned worries, we used features extracted by the DL model to do the classification. It enabled us to directly diagnose the severity of OSA through patients’ physical states, as indicated by their sleep records. The method finally showed good prediction in both CMUH and the PhysioNet data set.

Conclusions

Our study evaluated different methods of detecting OSA signals by using DL combined with ML models. The results reveal that the method with combined overlap** slicing in preprocessing and a sample-weight setting in the training process is the most suitable for large data sets, such as those from the CMUH sleep center. Overlap** slicing increased the probability that the apnea signals could be completely sliced from 54 to 89.2%, and the sample-weight setting solved the problem of the imbalance in positive/negative segments in each sleep record. Under this setting, the ACC and AUC for apnea detection per segment reached 0.855 and 0.917, respectively. For severe OSA level (AHI > 30) screening, our developed method made the ACC and AUC reach 0.928 and 0.975, respectively. To ensure that a model achieves similarity to real-world application, the model was trained with a data set of a large population and range. Our study identified problems that may arise from use of such data sets and proposed viable solutions to solve them. Because an increasing amount of data are being systematically collected with portable ECG devices, we believe that the results of this study may be applicable in the development of ECG OSA signal detection. However, application of sleep ECG data need not be limited to apnea detection. In the future, we plan to combine ECG data with PSG records of sleep stages, which are also strongly related with sleep quality, to develop a more complete model for analyzing sleep states.

Materials and methods

Overview

Our study methodology included the following 5 steps: (1) ECG data collection from the CMUH sleep center; (2) data preprocessing using STFT and slicing; (3) model development; (4) model training; and (5) performance evaluation.

Data collection

The data used in our study were obtained from the sleep center of CMUH, and our study obtained institutional review board approval in 2020. After the data of 4 individuals who did not experience respiratory events, sleep arousal, decreases in SPO2, and periodic leg movement were excluded, data from the overnight ECG signal records of 1656 patients were included for model training and testing. Normal and apnea event periods were delineated by the technician of the center by using PSG data, which is the gold standard for diagnosing OSA. The demographics of the patient population of the collected data were of broad range, which indicates that the model was potentially trained with most types of regular cases. For example, the ages of the patients ranged from 2 to 90 years, and their body mass indexes (BMIs) ranged from 10 to 55. The demographics of the included population are listed in Table 7.

Table 7 Demographics of the entire, training, and test groups

Data preprocessing

In order to make the input data more informative and suitable for our proposed model, we do the data preprocessing as follows.

Short-time Fourier transform (STFT)

STFT is commonly used in signal processing as a method that identifies changes in frequency over time. The processed formula is shown as follows:

$${\text{X}}\left( {\text{t,f}} \right){ = }\mathop \smallint \limits_{{{ - }\infty }}^{\infty } w_{H} \left( {{\text{t - }}\tau } \right) \times \left( \tau \right){\text{e}}^{{{\text{ - i2}}\pi {\text{f}}\tau }} {\text{d}}\tau ,$$
(4)

where \({w}_{H}\) represents the window function in the STFT, which is the Hamming window [38]. X is the transformed signal, x represents the original ECG signal, and f and t are the frequency and the time, respectively. Once the signal has been transformed, the resulting output demonstrates how frequency varies over time, and can potentially reveal distinct characteristics of the event of interest, thus aiding in model recognition. To provide our model with more informative features, we transformed the overnight ECG record of each patient from a 1-dimensional signal into a 2-dimensional spectrogram. Here, we set the window length = 0.5 s by considering the spectrogram’s frequency and time resolution balance. The window length should be as large as possible to provide enough frequency information while still performing with the instant wave change in the signal. On the other hand, the overlap is chosen as 0.25 s to simultaneously shorten the preprocessing time and make the spectrogram smooth.

Slicing methods

The sleep data were then sliced into segments to input the model. The sliced length should be multiples of 30 s, which is restricted to clinical usage in the CMUH sleep center. Considering the fixed input pixel of the DL model, it is practical to set the window length under 1 min to avoid more information loss from a higher compression ratio. We also hope the model can learn the complete apnea signal patterns in sliced segments. As a result, most of the pieces should cover the apnea event. By statistics of our data (Fig. 2), it shows that apnea length < 30 s accounts for 70.6% of our data, but the apnea length < 60 s accounts for 98.4%. Finally, we chose the 60 s as our window length.

Fig. 2
figure 2

The distribution of the apnea events’ length in entire data and different severity group of the patient

Furthermore, it is still possible for the slice to accidentally cut the apnea signal into pieces. Here, we adopted the overlap** slicing, which shifts the slice window to half of the segment’s length for every step to increase the chance of covering the complete apnea signal and double the training data set. The preprocessing with non-overlap** slicing is also tried, and its outcomes are also mentioned in the results for comparison.

Data labeling

Since our study has two tasks: per-segment classification and per-record classification, we prepare our input data and labels in different forms. For the per-segment classification, we prepare the sliced 1-min segments with the labels if the segments contain apnea events. For the per-record classification, we take a set of sliced segments of a certain patient’s whole night sleep record as the input data with the label if the patient’s OSA level is larger than a certain degree. All the preprocessing data were divided into six parts equally according to the patients’ number, with 5 for training (1380 people) and validation tasks and 1 for testing (276 people).

Models

We developed the architecture of our system with Keras, an open-source code that provides modules for artificial neural networks [39]. Our method, which contains the DL and ML, performs apnea event detection in the per-segment classification and OSA level screening in per-record classification. Under the architecture (Fig. 3), the DL model is not only a classifier in apnea detection. It also extracts features from the segments in the OSA level screening task. The ML model then simply uses extracted features to classify the OSA severity. The related details are as follows.

Fig. 3
figure 3

Architecture of the proposed system: a flowchart of apnea event detection and b OSA level screening (GAP = global average pooling)

Deep learning model background

The EfficientNet is truly the backbone of our model structure. It is a kind of CNN-based model introduced by the Google AI team in 2019 [40]. The model comprises several functional layers which are the same as the standard CNN, such as pooling, convolutional, and dense layers [41]. However, their construction should follow the rules of “Compound Model Scaling”. It systematically increases these modules to ensure deep and wide balancing, which can enable scaling up the accuracy without wasting computing resources. Here, we chose the EfficientNET-B0 by comparing it with another testing B7 structure (see Table. A2 in Online Appendix), which shows almost the same performance but the lowest time consumption due to the convolution design minimizing parameters and increasing its processing speed.

Machine learning model background

Two ML models were used as a classifier in the task: the support vector machine (SVM) and XGBoost [42, 43]. XGBoost employs ensemble learning, which, due to its tree algorithm, can easily adapt data. XGBoost is often used in competitions because of its high speed and data adaptation in dealing with large samples. By comparison, the SVM model is a traditional ML model that uses a generic kernel to project data. The SVM is most effective when the number of dimensions is lower than the number of samples.

Architecture

The whole architecture can be easily separated as two parts: per-segment classification and per-record classification (see Fig. 3). For per-segment classification, the sliced 1 min spectrogram would be the input of the EfficientNet. And the inference result is the probability of whether the segment included an apnea event. For the per-record classification, we take the part of EfficientNet without its dense layers as the feature extractor. All 1-min ECG spectrograms of a patient’s whole night sleep record would then be transformed as the set of N*1280 features through it, where N is the number of the segments input. Next, we do an average on N*1280 features and get the 1280 mean features as the input to the ML model. Finally, the ML inference result enabled us to identify patients’ OSA levels.

Model training

Firstly, we trained the EfficientNet, which is mainly for the per-segment classification, with the spectrogram data. The model, which has been pre-trained with the noisy student method [44], is the starting point in our training task. In the training process, we tuned the whole parameters of the architecture, enabling the model to distinguish if the segment includes apnea or not. On the other hand, weight adjustment methods are also adopted here to solve the individual and overall imbalances in the positive/negative ratio, which brought the prediction bias of the model. The results of the sample weight and class weight adjusted in training were shown in the literature. The classic weight value of each segment was decided according to the ratio of the amounts of positive and negative pieces for the entire group. In contrast, the sample weight values depend on the personal data distribution:

$${\text{weight} }_{i}=\left\{\begin{array}{c}{\left(\frac{{A}_{i}{+N}_{i}}{{A}_{i}}\right)}^{0.8} if\, the\, slice\, has\, apnea \\ { \left(\frac{{A}_{i}{+N}_{i}}{{N}_{i}}\right)}^{0.8} if\, the\, slice\, has\, no\, apnea\end{array}\right.$$
(5)

where the \({A}_{i}\) and \({N}_{i}\) represent the number of the apnea and normal segments of the ith patient.

After EfficientNet was trained as an apnea classifier, we took the model without a fully connected layer as an OSA feature extractor in the per-record classification training task. Here, the parameters of the extractor did not need to be tuned again. The 1280 features, which are extracted from the patients’ whole night record through the preprocessing and extractor, would be used to train the machine learning model and enable it to screen the patient’s OSA level.

Performance evaluation

In the following section, we employ four common metrics: accuracy, sensitivity, specificity, and area under the receiver operating characteristic curve (AUC) to evaluate the model’s performance. The definitions of the first three parameters are presented as Eqs. (6) to (8):

$$\text{Accuracy }\left(\text{ACC}\right)=\frac{TP+TN}{TP+TN+FN+FP},$$
(6)
$$\text{Sensitivity }\left(\text{SE}\right)=\frac{TP}{TP+FN},$$
(7)
$$\text{Specificity }\left(\text{SP}\right)=\frac{TN}{TN+FP},$$
(8)

where TP and TN represent true positive and true negative, respectively. FP and FN are false positive and false negative. The receiver operating characteristic curve (ROC) measures how accurately the model can categorize the data points as positive and negative. AUC measures the area underneath the ROC curve, which makes it easy to show how well the classifier will perform the given task [45].