Introduction

Drivers’ driving behavior is a vital factor in driving safety. Road accidents were result in 1.35 million fatalities, which is one of the top causes of death globally [1]. Various abnormal situations can arise while on the road [2,3,4]. A significant percentage, approximately 74%, of these accidents are linked to erratic driving behavior [5]. Therefore, understanding driver behavior can significantly mitigate the rate of these accidents.

Extensive studies have been conducted on driver behavior in recent decades, with a particular focus on driving habits [6] (categorized as aggressive, normal, and cautious), as well as driver emotions [7], dangerous driving behavior [8,9,10], and other related aspects. And they require capturing information about the driver’s state [11], including head posture [12], eye state [13], hand movements [14], foot movements [15, 16], and also some physiological signals [17, 18].

Cameras are widely used as detection devices due to their intuitiveness. Video-based behavior recognition methods utilize image sequences as input and combine information from inside and outside frames and inter-frame motion information to derive features for representation and classification. In [19], data on the driver’s head posture, hand movements, foot movements, and their view ahead were collected to detect distracting driver activity. In [20], cameras were used to monitor the driver’s eyes, head, and mouth to distinguish whether they were driving under fatigue. In [21], ten driving behaviors, including safe driving, drinking, and talking to passengers, were effectively identified through segmenting video and the application of deep convolutional neural networks. However, video-based approaches may invade the user’s privacy and may not be suitable for installation in family cars.

In addition to vision-based feature extraction methods, physiological signals including electroencephalography (EEG) [22] and electrooculography (EOG) [23] have also been employed for real-time driver status monitoring. In this study [24], six traffic flow conditions are designed in a simulated car-following experiment and a two-layer EEG-based driving behavior recognition system is proposed, the average accuracy is 69.5% and the highest accuracy can reach 83.5%. In [25], Electrooculographic signals and a one-dimensional convolutional neural network are applied to the recognition of driver behavior, which achieve an average accuracy of 80%. Moreover, wearable acoustic sensors, wearable wristbands for heart rate detection, and wearable EMG sensors are also widely used in Driver Activity Recognition (DAR) systems. However, the inconvenience of wearing such devices limits their potential for further applications.

In recent years, millimeter wave radar has gained widespread usage in the realm of human movement recognition. This is primarily thanks to its benefits, including its resistance to environmental influences (such as sunlight), its reliability [26], its capacity to ensure privacy protection [27, 28], and its high degree of accuracy [29, 30]. In [31, 32], Linear Frequency-Modulated Continuous-Wave radar (FMCW) is effectively utilized to recognize basic movements that are typical in everyday life, such as walking, sitting, and standing up. In the domain of driver activity recognition, a study [33] evaluating driving behavior recognition using FMCW radar has successfully yielded the anticipated outcomes. The usage of radar technology is rapidly increasing, and numerous studies are underway to investigate its potential in recognizing human movement. Therefore, this paper presents a device that employs the mutual fusion of two radar signals to precisely and conveniently detect driving behavior. In terms of methods for DAR recognition, deep learning models have achieved state-of-the-art results in object detection, classification, generation, and segmentation tasks due to the continuous development of deep learning (DL) techniques. So far, it has been successfully applied to a number of driver monitoring tasks [34,35,36]. However, as technology evolves, deep learning network scan be further improved to provide better recognition results. This paper introduces the two types of Radar Fusion Driver Activity Recognition Neural Net (RFDANet) model for driver activity recognition, which employs two different types of radar (FWCW in combination with TOF radar) and focuses specifically on five common driving activities to maintain user privacy. By utilizing pre-processing of the radar data and avoiding manual feature extraction, the RFDANet model effectively addresses DAR detection, achieving impressive results with a pre-trained CNN model.

The main contributions of this work are as follows:

  1. (1)

    A method for multimodal data fusion using non-intrusive radar technology is proposed for use in DAR. This approach has led to a significant improvement of 10% in data fusion results, as compared to using a single sensor.

  2. (2)

    A novel deep learning-based approach to identify driver behavior is proposed, which combines multi-level attention fusion and hybrid CNN-LSTM models for detecting driving behavior This branch-attention-based multimodal fusion model automatically extracts features and weighs features from different modalities, resulting in improved performance in DAR.

  3. (3)

    The paper conducted experiments to validate the proposed model for detecting the driving behavior of unknown persons. The experimental results show that radar detection of driver behavior is highly promising, and the attention-based multimodal fusion model surpasses other DNN models.

The framework of this paper is illustrated in Fig. 1. The structure of this article is as follows: “Measurement setup” introduces the working principle of FMCW radar and TOF radar. The data processing methods and the developed RFDANet are presented in “Methods”. The experimental verification of the development method is performed in “Experiment”. “Analysis of results and discussion” extends further discussions. The conclusions are drawn in the final Section.

Fig. 1
figure 1

General framework of the article

Measurement setup

FMCW radar system

FMCW radar systems use a chirp signal for transmission. The chirp’s frequency increases or decreases linearly over time, and a set of chirps produces a frame serving as a radar processing observation window. The system’s performance is influenced by various chirp parameters, such as frequency slope and scanning band.

The FMCW radar signal, as employed in this paper, can be represented by a simplified sawtooth modulated waveform, as shown in Fig. 2. The period of the chirp signal transmitted by the antenna is determined by summing the idle time and the ramp end time. By default, the signal period is set to 160 us, while the idle time and the ramp end time are set to 100 us and 60 us, respectively. The signal’s maximum bandwidth is 4 GHz, which enables the achievement of a maximum detection distance 3.75 cm. The transmit signal can be mathematically expressed as follows:

$$ T_{x} (t) = \exp \left( {j \cdot 2\pi \cdot \left( {f_{{\text{c}}} t + \frac{1}{2}\frac{B}{T}t^{2} } \right)} \right), $$
(1)

where \(B\) is the bandwidth of transmitted signal, \(f_{c}\) indicates the carrier frequency, and \(T\) is the sweep time. Assuming an object at a distance of \(R\) from the radar and moving with a radial velocity of \(v\), the reflected signal is shown as follows:

$$ R_{x} (t) = \exp \left( {j \cdot 2\pi \cdot \left( {f_{{\text{c}}} \left( {t - \tau } \right) + \frac{1}{2}\frac{B}{T}\left( {t - \tau } \right)^{2} } \right)} \right), $$
(2)

where \(\tau = 2\left( {R + vt} \right)/c\) is the round-trip delay and \(c\) is the speed of light. In the radar sensor, by mixing the transmitted signal with the received signal, the mixed output signal can be obtained as follows:

$$ B(t) = \exp \left( {j \cdot 2\pi \cdot \left( {f_{c} \tau + \frac{B}{T}t\tau - \frac{1}{2}\frac{B}{T}\tau^{2} } \right)} \right) $$
(3)
Fig. 2
figure 2

The transmitted and received signal for the FMCW radar

As depicted in Eq. (3), the mixed signal is a beat signal. The sampling data of the mixed signal is stored in the computer and then subsequently processed to obtain a micro-Doppler signature of driving behavior.

TOF radar system

The TOF radar uses the zero-difference detection principle to measure distance. It achieves this by measuring the correlation between the reflected light and a reference signal. To be more specific, it transmits a near-infrared optical signal that is modulated by a sinusoidal wave. This signal is then reflected by the target surface and received by an infrared detector. By calculating the time delay of the received signal in relation to the transmitted signal, the target distance information can be determined. Figure 3 shows this principle.

Fig. 3
figure 3

TOF radar transmit-receive signal schematic

Assuming that the infrared optical signal being transmitted is \(g\left( t \right)\), here \(g\left( t \right) = acos\left( {2\pi f_{0} t} \right)\), with an amplitude of \(a\) and a signal modulation frequency of \(f_{0}\). The received optical signal \(s\left( t \right)\) is then expected to be:

$$ s\left( t \right) = a_{r} cos\left( {2\pi f_{0} t + \varphi } \right) + b, $$
(4)

where \(a_{r}\) is the amplitude of the emitted signal light, \(\varphi\) is the phase delay caused by the distance to the target, and \(b\) is the offset caused by the ambient light.

The correlation equation between the transmitted and received light signals is shown as follows:

$$ c_{\tau } \left( \varphi \right) = \mathop {\lim }\limits_{T \to \infty } \frac{1}{T}\int\limits_{ - T/2}^{T/2} {s(t)g(t + \tau )} {\text{d}}t, $$
(5)

where \(\tau\) is the time delay.

To recover the amplitude \(a_{r}\) and phase \(\varphi\) of the reflected light signal, four sequential amplitude images are generally acquired, defined as:

$$ C_{i} = c_{{\tau_{i} }} \left( \varphi \right),\tau_{i} = i \cdot \frac{\pi }{{2 \times 2\pi f_{0} }},i \in \left\{ {0,1,2,3} \right\}. $$
(6)

The distance \(d\) between the driver and the TOF radar can be expressed as:

$$ d = \frac{1}{2}c_{{{\text{light}}}} \frac{\varphi }{{2\pi f_{0} }}. $$
(7)

The TOF radar measures distance based on the principle of time-of-flight, while FMCW radar calculates distance based on frequency difference. They both capture driving behavior information, but from different domains, namely time and frequency. The TOF radar excels in measuring static driving behavior information effectively, while the FMCW radar is more adept at detecting dynamic driving behavior information. Integrating the information obtained from both sources can more precisely and efficiently identify driving behavior.

Methods

FMCW radar data processing

The acquired FMCW radar echo signal undergoes I/Q quadrature demodulation, sampling, and multi-cycle combination to yield a 128 × 256 dimensional data matrix per cycle. The matrix comprises 128 cycles of echo data vectors regarding rows and 256 frequency data vectors regarding columns. However, data obtained from actual vehicle driving behavior often contains static clutter and DC drift, hence necessitates clutter removal and suppression before proceeding with the next step in processing. To address this in the current study, an MTI filter is utilized to eliminate DC clutter and static clutter from the dataset.

The MTI filter is designed to enhance the detection of driving actions by eliminating DC clutter, which refers to stationary velocity, from the Doppler frequency. Although motion information is the key indicator for detecting driving behavior, static information between consecutive movements can also provide valuable insight into different driving behaviors. Consequently, the MTI filter must be able to purge static clutter while retaining the driver’s body information. To accomplish this, the MTI filter assigns different weights of 1 or 0.5 to static interference based on the detection distance. DC clusters representing self-interference are filtered using a weight of 1, whereas static or moving data pertaining to the target is filtered using a weight of 0.5.

To obtain more Doppler features related to driving behavior in the time–frequency domain, the FMCW radar data is transformed with a Short-Time-Fourier-Transform (STFT) after the MTI filter described above. The STFT adds a window to the Fourier-Transform (FFT) and the time range of the analysis is reduced and will be amplified significantly. Assuming the signal is smooth during this time range, the STFT is

$$ {\text{STFTs}}\left( {m,f} \right) = \sum\limits_{k = - \infty }^{ + \infty } {s\left( {k,n} \right)w\left( {m - k} \right)e^{{\left( { - j2\pi kf} \right)}} } . $$
(8)

The processed radar signal is denoted as \(s\left( \cdot \right)\) and the window function as \(w\left( \cdot \right)\) in the Short-Time Fourier Transform (STFT). Although STFT is useful in characterizing signal distribution, there is a trade-off between temporal and frequency resolution. Decreasing the window length improves temporal resolution, while increasing it enhances frequency resolution. This study investigates five driving behaviors, and for STFT analysis, a window length of 10 is chosen. As illustrated in Fig. 4, the time–frequency spectrogram resulting from MTI and STFT effectively removes DC and static clutter, and reveals a clear Doppler signature.

Fig. 4
figure 4

Time spectrum diagram (before and after MTI and STFT)

To improve the efficiency of the subsequent classification model and reduce its size, the input of the designed network consists solely of one-dimensional signals. The spectral maps transformed through MTI and STFT are then downscaled using principal component analysis (PCA) to remove redundant features while preserving the important dimensions in the feature vectors.

Suppose the spectrogram is a two-dimensional set of feature vectors \(P\). The PCA transformation requires transforming \(P\) into a one-dimensional \(Q\).

$$ Q = kP. $$
(9)

The PCA method needs to find the maximum of the following equation to find the linear map** \(k\).

$$ \max \left( {k^{T} cov(P)k} \right), $$
(10)

where \(cov(P)\) is the covariance of the set of spectral map eigenvectors.

TOF radar data processing

The TOF radar information \(x,y,z\) includes various areas, such as the driver and the seat backrest. However, we only require information pertaining to the target region, specifically the region concerning the driver which we are interested in. For this purpose, the thesis uses a threshold segmentation method to divide the image pixels into regions of different grey levels, as there is a clear difference in intensity between the target and background pixel grey values, and by setting a suitable threshold \(I\left( {x,z} \right) > T\) the set of target regions that meet the requirements can be extracted. Additionally, the captured TOF radar information data may contain a significant number of zero points due to reflected light. These points do not contain useful information and must be eliminated to streamline processing. Figure 5 displays the resulting output, with unnecessary data omitted to reduce the overall computational workload.

Fig. 5
figure 5

TOF radar date of X, Y, and Z axis

RFDANet (Radar Fusion Driver Activity Recognition Net)

To fully leverage the potential of FMCW radar information and TOF radar information, two independent forward propagation channels are established for each type of information. These channels are then merged through the use of a branching structure, facilitating the fusion of FMCW radar information and TOF radar information within the model. Furthermore, to enhance the integration of FMCW radar information and TOF radar information and to improve the flexibility of feature integration from diverse branches, an attention mechanism is implemented in each branch. This attention mechanism assigns varying dynamic weights to different features. Unlike classical channel [37] and spatial attention [38], branch attention [39] is not restricted to measuring the inter-feature relationship within the model but instead emphasizes the significance of different branch features. The information from FMCW radar and TOF radar are segregated into separate branches, which are later fused at the end of the model and transmitted to the LSTM classification layer. The final output of the model is utilized to differentiate driving behaviors. Figure 6 portrays the branching fusion strategy for integrating FMCW radar information and TOF radar information.

Fig. 6
figure 6

Diagram of the integration strategy

Each branch contains several basic units consisting of regular convolutional layers (including ReLU activation function and maximum pooling layer). In addition, to further improve the performance of the model, a Batch Normalization Layer (BN Layer: Batch Normalization Layer) is added between the convolutional layer and the basic ReLU function of the proposed RFDANet.

By inserting BN layers into the proposed RFDANet, the features of each layer can be normalized to a fixed distribution (mean of 0 and standard deviation of 1). This helps in mitigating the feature distribution drift that may occur with increasing iterations, thereby preventing gradient vanishing during model training.

To comprehensively evaluate the significance of distinct branches during the feature fusion process, RFDANet extends the attention mechanism to the branch level and integrates the features of each basic unit in the corresponding branch to constitute the global attention weights. Unlike the classical channel or spatial attention, the branch attention mechanism in this paper focuses on the importance of different branch features. Specifically, assume that \(f_{l}^{m}\) is the output feature of the \(m{\text{th}}\) basic unit convolutional layer in the \(l{\text{th}}\) branch. Then the local attention weights of the \(m{\text{th}}\) basic unit in the \(l{\text{th}}\) branch, the global attention weights of the \(l{\text{th}}\) branch, the feature map of the \(l{\text{th}}\) branch, and the final feature map after the fusion of different branches are given by Eqs. (10), (11), (12) and (13), respectively. Figure 7 illustrates the general architecture of the developed RFDANet.

$$ W_{l}^{m} = AWM_{l}^{m} \left( {F_{l}^{m} } \right) $$
(11)
$$ W_{l} = Fusion\left( {W_{l}^{1} ,W_{l}^{2} ,...W_{l}^{m} } \right) $$
(12)
$$ F_{l}^{final - weighted} = F_{l}^{final} \otimes W_{l} $$
(13)
$$ F^{final} = Fusion\left( {F_{1}^{final - weighted} ,F_{2}^{final - weighted} } \right), $$
(14)

where \(AWM_{l}^{m}\) is the attention weight module on the \(m{\text{th}}\) basic cell of the \(l{\text{th}}\) branch for outputting local attention weights, \(F^{final}\) is the unweighted features outputted from the \(l{\text{th}}\) branch, \(\otimes\) denotes the multiplication operation of the features, and \(Fusion\) denotes the fusion process of the features.

Fig. 7
figure 7

RFDANet Network General Architecture

In this paper, we mainly consider the common methods of adding, multiplying, and splicing FMCW radar information and TOF radar information. In addition, for the fusion of AWM feature maps, multiplication operations are used.

In the developed RFDANet, a multi-level feature fusion strategy is employed to obtain the branch-level attention weights and to integrate the influence of each layer’s feature maps. In each basic unit, the feature maps generated by the convolutional layers are first converted into local attention weights using the local attention module. Then, the branch attention weights representing the importance of the corresponding branches are obtained by multi-level local attention fusion. Specifically, in the local attention module, the output feature maps undergo a compression process through global average pooling (GAP) and global maximum pooling (GMP) along the channel direction. Afterward, the compressed features are inputted into a multilayer perceptron (MLP) with shareable weights, generating two sets of one-dimensional weightings. It is worth noting that the main purpose of using Weight-Sharing-MLP for the local attention module is to reduce the number of training parameters and computational cost of the developed RFDANet. However, it is also important to consider the use of MLPs with non-shared weights or other network structures in practical applications. Finally, the weights of the corresponding branches are obtained by summing the two sets of weights.

Assuming that \(F_{l}^{m}\) is a feature of the output of the \(mth\) basic unit convolution layer of the \(l{\text{th}}\) branch, the procedure for the calculation of the \(mth\) attentional power module \(AWM_{l}^{m}\) of the \(l{\text{th}}\) branch can be given by the following equation. Figure 8 shows the internal structure of the attention module.

$$ \begin{aligned}AWM_{l}^{m} \left( {F_{l}^{m} } \right) & = Sigmoid\big( MLP\left( {GAP\left( {F_{l}^{m} } \right)} \right) \\ &\qquad \oplus MLP\left( {GMP\left( {F_{l}^{m} } \right)} \right) \big)\end{aligned}, $$
(15)

where \(\oplus\) denotes the addition operation for the corresponding position element and \(Sigmoid\) denotes the activation function that allows branch attention weights to be restricted to the range (0, 1).

Fig. 8
figure 8

Internal module diagram

Experiment

This section delves into the measurement of driving behavior through radar technology in real-life driving scenarios. It covers the measurement environment, system, and parameter settings. In this paper, two radar systems based on different principles (FMCW radar and TOF radar) are used for the measurement of driving behavior.

The measurement system relies on a Texas Instruments (TI) millimeter wave radar evaluation board, illustrated in Fig. 9. Two evaluation boards were employed: the IWR6843BOOST, responsible for signal transmission and reception, and the DCA1000EVM, responsible for real-time data transmission. With the assistance of the mmWave Studio software system, radar signal waveform parameters, transmit, and receive procedures were configured. No additional hardware beyond a mobile power supply and a laptop was necessary for the measurements.

Fig. 9
figure 9

Measurement systems and scenarios

The main measuring parameters are shown in Table 1. The measurement campaign was conducted in the 77 GHz band with an effective bandwidth of 4 GHz for signal transmission and reception.

Table 1 Measuring parameters of the radar

The tested vehicle was a Volkswagen Magotan 1.8 TSI, which includes seats for both front and rear rows. The front seats were for the driver and the safety supervisor, while the rear seat had one passenger who was responsible for operating the laptop to record data. During the measurement, the FMCW radar board was fixed with a bracket in front of the driver’s right side. Meanwhile, the TOF radar was placed in front of the driver’s front upper place, where the visor typically sits. The configuration of software adjustment parameters and the control of transmit and receive commands in the laptop were performed manually.

The vehicle was driven along the lane in an open and unoccupied environment of the holiday campus under typical driving conditions. The driver navigated through a straight line, random left, and right turns, as well as U-turns. The vehicle travel route is shown in Fig. 10.

Fig. 10
figure 10

Vehicle route map

During the measurements process, each volunteer was asked to adjust their seat to the position that was most comfortable for them. The distance between the radar and the volunteer ranged from 0.4 to 0.8 m, while the height of their head was between 0.9 and 1.2 m. The three volunteers performed each behavior 20 times, with 10 of them in a parked car and 10 while the car was moving at approximately 20 km/h.

A total of approximately 30 min of data collection was carried out to ensure that the large amount of data gathered was sufficiently random. 3 volunteers participated in this experiment, performing 5 types of movements under real driving conditions: normal driving, head-up, head-turning, looking at a mobile phone, and dancing to music. These 5 driving behaviors are described in Fig. 11. All procedures involving human research conformed to the ethical standards of Zhejiang Sci-Tech University and followed Declaration of Helsinki.

Fig. 11
figure 11

Diagram of experimental data

  1. (1)

    Normal driving the driver keeps the upper body upright for normal driving with little to no other movement;

  2. (2)

    Head up head leans forward at a slow rate and then tilts up with a controlled period of about 4 s to simulate driving fatigue;

  3. (3)

    Head twisting turn your head backward to the left and the right, kee** your torso still, 2 times back and forth, keep the cycle to about 4 s;

  4. (4)

    Pick up the phone the driver holds the phone in his left hand and picks it up and places it in front of him while driving;

  5. (5)

    Dance to music popular songs are played in the car and the driver can’t help but dance to the music while driving.

As shown in Fig. 11a, the distance-Doppler diagram is typically smooth during regular driving. There is no visible presence of periodic waveforms, only slight fluctuations, which may be due to vibrations caused by the motion of the vehicle. This can be observed when comparing it to the interior of an unstartled motor vehicle.

As shown in Fig. 11b, the distance-Doppler trajectory also shows a positive Doppler during head-up. The distance decreases, leading to the eventual disappearance of the Doppler signal. However, when the driver is tilting his head before raising it again, the Doppler features and distance features of the Doppler trajectory increase again after the initial drop.

As shown in Fig. 11c, it was observed that during the head-twisting maneuver, only the lateral rotation of the right side was captured, as the left and right rotations produced identical Doppler and distance signals. Apart from the predicted negative Doppler components, certain positive Doppler components were also noticed. This is owing to the fact that, while rotating the neck, one side of the face turns away from the radar, whereas the other side moves closer to it. However, a trajectory point was identifiable from the distance-Doppler diagram, which made it more challenging to account for the positive Doppler components.

As shown in Fig. 11d, dance to music shows a periodic character on the distance-Doppler diagram, and the waveform periodicity is obvious.

As shown in Fig. 11e, the distance-Doppler diagram of picking up a mobile phone while driving shows that the power of the Doppler diagram keeps increasing with a larger slope when picking up the phone.

Model parameters and hyperparameter settings

The RFDANet model consists of two branches, each containing three basic units of Convn—BN—ReLU—Pooling. The first and third convolutional layers feature 32 convolutional kernels, while the second convolutional layer houses 16 convolutional kernels of size 2 and with a step value of 1. The pooling region and step size in the MaxPooling layer are both set to 2. Additionally, between the convolutional and MaxPooling layers, a BN layer and a ReLU function are incorporated. The resulting features from the latest basic unit in the two branches are combined and fused by tiling, and the fused feature map is then fed into the MLP for classification. The MLP consists of three fully connected (FC) layers, with 256, 128, and 5 neurons per layer, respectively. Additionally, the output feature map from the current convolutional layer must be passed through an auxiliary network to generate local attention weights before proceeding to the next convolutional layer. The auxiliary network employed in this paper primarily comprises three fully-connected layers, with the number of neurons (which equates to the number of feature map** channels) being halved, quartered, and one respectively. The implementation of the deep learning model discussed in this paper relies on TensorFlow 1.13.1 and Keras 2.2.4. The computing platform used is Win10 Professional (64-bit) with a Core (TM) i7-6700HQ CPU and 12G RAM (2133 Hz).

To further compare the performance of different information fusion methods, a diagnostic model based on data-level information fusion, namely the 1D-CNN method [40], is also constructed in this section. The 1D-CNN model consists of 3 Convn—BN—ReLU—Pooling basic units, with the convolutional layer and MLP parameters being the same as those of RFDANet. The difference is that 1D-CNN only has a single branch and does not contain an attention module. Table 2 outlines the specifics of each model. Cross-entropy classification is employed as the loss function during network training. The mini-batch strategy was chosen for network training, with 64 samples in a single mini-batch. In addition, the Adam optimizer was consistently parameterized, and a dropout technique was used with a retention rate of 0.5 between the last two fully connected layers to mitigate the effects of overfitting.

Table 2 Model parameters

Analysis of results and discussion

In this section, we investigate the accuracy of driving behavior recognition with different information integration. Firstly, the results of fusing FMCW radar with TOF radar information of different dimensions (x,y,z) are compared using the 1-DCNN method, as shown in Fig. 12 (vertical axis represents the accuracy, horizontal axis represents the fused data). The fusion of FMCW radar data with TOF radar x-axis information is denoted as "radar_x", while the fusion with TOF radar y-axis information is denoted as "radar_y". "radar_xy" represents the fusion of FMCW radar data with both TOF radar x-axis and y-axis information. "radar_xz" denotes the fusion with TOF radar x-axis and z-axis data. Similarly, for fusion with TOF radar y-axis and z-axis the "radar_yz" are used. And "radar_xyz" represents the fusion with TOF radar x-axis, y-axis and z-axis information. After conducting 10 validations, we conclude that fusing FMCW radar information with TOF radar x- and y-axis data results in comparable accuracy to that achieved with the fusion of FMCW radar information with TOF radar 3-axis data. Therefore, to simplify the computational process and improve the classification efficiency of the model, we use the FMCW radar and TOF radar x- and y-axis information fusion, which we refer to as "FMCW + TOF data" in this paper.

Fig. 12
figure 12

Comparison of FMCW radar and TOF radar x, y and z axis information fusion results, respectively

Secondly, the proposed RFDANet model, illustrated in Fig. 13, was evaluated through a comparison of three recognition schemes. These include the single FMCW radar-RFDANet, the single TOF radar-RFDANet, and the fusion scheme of FMCW radar and TOF radar-RFDANet. Table 3 reveals that the recognition scheme based on the fusion of FMCW radar and TOF radar information achieves the highest recognition accuracy for driving behavior. The combination of the two types of radar information results in a recognition accuracy that is approximately 10% higher than the recognition achieved when using each radar type separately.

Fig. 13
figure 13

Box plot of FMCW and TOF information fusion results compared to single FMCW radar information and single TOF information, respectively

Table 3 Accuracy of different recognition schemes

The box plot shows the distribution of test accuracy for each case. The upper and lower borders of the blue boxes represent the third and first quartiles of all accuracy rates, respectively, denoted as \(Q_{3}\) and \(Q_{1}\). This means that half of the test accuracy rates lie in the blue boxes. The size of the box, denoted by \(IQR\), corresponds to its level of robustness. The red line in the box indicates the median \(Q_{2}\). Points with values greater than \(Q_{3} + 1.5 \times IQR\) or less than \(Q_{1} - 1.5 \times IQR\) are identified as outliers. Figure 13 demonstrates that the results from both the FMCW radar and the TOF radar are capable of achieving a higher maximum accuracy and a lower minimum accuracy. Moreover, a smaller gap between the upper and lower boundaries indicates better differentiation between individuals. Additionally, the narrower blue frame illustrates the improved robustness of the technique.

A random selection of 70% of the final dataset was used as the training set and the remaining 30% as the test set. In addition, the average accuracy (time) of 10 training trials was used as the final evaluation metric of this paper. As shown in Fig. 14, the average training accuracy of the developed RFDANet can be maintained at around 94.5%, where Nd for normal driving, Lu for head up, Th for head twisting, Puh for picking up the phone and looking at it, and Dm for dancing with music and dancing. Also as shown in Fig. 15, RFDANet converges quickly and after about 80 epochs, the loss plateaus and fluctuates only within a small range.

Fig. 14
figure 14

RFDANet Classification Confusion Matrix

Fig. 15
figure 15

RFDANet’s loss function

Table 4 illustrates the model size and computational cost of the RFDANet compared with other methods. The size of the proposed RFDANet is the smallest. Table 5 presents the average accuracy and speed of recognition of the AlexNet [41], 1-D CNN, CNN-LSTM, CNN-Channel attention, and CNN-Spatial attention methods as a benchmark to compare against the RFDANet method proposed in this paper. The results indicate that the RFDANet method achieves higher recognition accuracy compared to the other methods, without any compromise on speed. These experimental findings demonstrate the promising benefits of using the RFDANet method for classification models.

Table 4 The model size and computational cost of the RFDANet compared with other methods
Table 5 Results of the RFDANet compared with other methods

Discussion

Installation position of the radar system

The mounting position of the FMCW radar and TOF radar has a significant impact on the detection of driving behavior and warrants further experimental study. There are currently two options: one is to mount the FMCW radar on the steering wheel and the other is to place the radar on the dashboard. The FMCW radar is placed on the steering wheel and the TOF radar is placed on the dashboard to detect driving behavior from above in the absence of direct light. However, there may be some missed data, resulting in limited recognition accuracy. As a next step, we are considering placing a radar at the window to supplement the missing data on the side of the driving action.

Subject differences impact

The applicability of the model was investigated to classify and identify the driving behavior of unknown persons by a model trained on the driving behavior of known persons. For this purpose, data on the driving behavior of three additional volunteers were collected separately in a laboratory setting as a test group. For each maneuver, the test was repeated 30 times to obtain reliable results. Figure 16 shows the box plots of the test results, which indicate that the model can achieve classification with high robustness for the same driving behavior of an unknown person.

Fig. 16
figure 16

Results of the RFDANet model comparing the identification of already person data with unknown person data

Conclusion

The paper proposes a solution to the challenge of detecting and measuring driving behaviors in real-world situations. The solution, referred to as RFDANet, utilizes FMCW radar and TOF radar systems to identify five prevalent driving behaviors. RFDANet has two feature branches that process FMCW radar information and TOF radar information, respectively. It also extends the attention mechanism to the branch level and adjusts the importance of features in different branches, which helps the autonomous fusion of FMCW radar information and TOF radar information, thereby enhancing the flexibility and accuracy of driving behavior detection. Also, it is capable of integrating FMCW radar information and TOF radar information more flexibly than general external information fusion methods. The experiments have validated the superiority of the method. The model has demonstrated impressive performance in both real-world driving scenarios and laboratory-simulated driving environments. The proposed RFDANet has the potential to bolster driving behavior recognition accuracy while maintaining a rapid detection time. There are still certain limitations in scenarios involving in-vehicle factors and varied road conditions due to the lack of data. To improve the model’s accuracy and robustness, future research will focus on extending the driving behavior dataset to include more diverse and complex traffic situations.