1 Introduction

Intelligent vehicle can be viewed as an intelligent body that integrates the perception layer, the decision-making layer and the execution layer. Perception technology is an important means for the intelligent vehicle to acquire surrounding information. In a traffic scene, a moving vehicle uses a series of lamp signals in line with traffic rules to remind or hint the driver. Therefore, whether an intelligent vehicle can understand the intention of the lamp signal through the perception module, and perform behavior prediction or intervention in advance is crucial to safety improvement.

Recently, with the development of deep learning, target detection technology has become increasingly mature. In particular, vehicle detection has been commercialized and computer vision allows inferences based on the salient features of the vehicle lamp color. Combination of the two will definitely promote the development of intelligent vehicle perception technology. Li et al. [1] combines Haar feature with Adaboost cascade classifier to detect the vehicle position, then performs tail lamp segmentation in the red, green, blue (RGB) space, and completes lamp signal information recognition based on the lamp signal recognition rules. Nevertheless, the Adaboost cascade classifier in this algorithm is prone to missed and false detection of vehicle positions, which decreases accuracy in lamp signal recognition. Cui et al. [2] uses clustering techniques to extract taillight candidate signals and estimate tail lamp state. By combining the state history information, it infers the current lamp signal meaning of the vehicle ahead. However, the algorithm depends on the picture texture information in vehicle lamp positioning and recognition, with poor robustness under complex working conditions. Chen et al. [3] proposes to use the scattering modeling and reflection direction to segment the vehicle lamp area and complete the tail lamp recognition, but it fails to judge the lamp signal according to the tail lamp information. Fröhlich et al. [4] first detects the light spot of the image, selects the vehicle lamp area, and then performs feature extraction according to feature transformation based time signal behavior analysis of the light spot. Finally, it classifies the extracted features through Adaboost to recognize the lamp signal intention. However, Adaboost classifier trained based on the tail lamp frequency information in this algorithm can recognize few lamps signal types. He et al. [5] builds a deep neural network, adds an attention mechanism to optimize the network, and thoroughly learns the tail lamp details of vehicles. Nonetheless, the algorithm convergence rate is not fully guaranteed, and the recognition accuracy has insufficient generalization. Based on the YOLOv3-tiny backbone network, Yoneda et al. [6] uses a convolutional neural network to detect lamp states and then calculates the flicker frequency using a fast Fourier transform, which has good detection effect for small-target vehicle lamp. However, due to the big convolutional neural network scale, over-fitting is easy if data set is insufficient.

In summary, the traditional computer vision or machine learning algorithms used by the predecessors have certain deficiencies in recognizing the semantics of tail lamps, while lamp signal recognition simply based on deep learning relies on a large number of training samples and has high complexity. To this end, this paper proposes a new lamp signal recognition algorithm that combines deep learning with computer vision. The paper is structured as follows: It first uses YOLOv4 [7] in chapter 2 to detect the vehicle tail and sets the potential area of the vehicle tail lamp. On this basis, a region-based adaptive threshold is proposed in chapter 3 to segment the lit tail lamp in the hue, saturation, value (HSV) space. In chapter 4, the deep neural network (DNN) model is constructed to train and collect samples, and the lamp signal is classified and predicted according to the pixel information in the HSV space. Finally, in chapter 5, the performance of the proposed algorithm is evaluated with real vehicle experiments and the summary and discussion in chapter 6. The algorithm proposed in this paper determines the current meaning of the lamp signal of the vehicle ahead, which helps intelligent vehicles understand lamp signals, improves the lamp signal recognition accuracy and lays a certain foundation for driving of intelligent vehicles in complex real traffic scenes.

2 Vehicle tail detection

In lamp signal recognition, it is crucial whether the vehicle tail can be accurately and quickly recognized and detected through the image information captured by the camera [8]. In this paper, a YOLOv4-based target detection model is established and trained on the KITTI [9] dataset to quickly detect the vehicle tail.

The KITTI dataset was co-founded by Karlsruhe Institute of Technology in Germany and Toyota American Institute of Technology. It is currently the largest computer vision dataset in the world for autonomous driving scenarios. KITTI contains real image data collected from urban, rural, and highway scenes, with up to 15 vehicles per image, meeting our YOLOv4 criteria for training and testing 2D object detection.

2.1 YOLOv4 network structure

As shown in Fig. 1, the YOLOv4 network is mainly composed of four parts. First, the CSPDarknet53 backbone feature extraction network is used to down-sample the input image and stack the residual structure. Then, maximum pooling of three different sizes is used to process the feature pyramid output from the upper layer. The pooling core size is set to \(13\times 13\), \(9\times 9\) and \(5\times 5\) respectively. Then, path aggregation network (PANet) is used for bottom-to-up feature extraction of the three effective feature layers: up-sampling and stacking for the upper-level features, and down-sampling and stacking for the lower-level features. Finally, a three-layer feature pyramid with strong semantics is obtained, which is used to detect the target's category and return the target's size.

Fig. 1
figure 1

YOLOv4 network structure

2.2 Vehicle tail detection based on YOLOv4

This paper uses KITTI dataset for network training. This data set is currently the world’s largest computer vision algorithm evaluation data set under autonomous driving scene [9]. This paper selects 7,024 pictures to label the vehicle tail and uses it as test set, validation set and training set to build YOLOv4 model according to training network parameters. As shown in Fig. 2, the MV-SUA134GC monocular industrial camera from MindVision is used for real vehicle picture acquisition experiments. Figure 3 shows the detection of road vehicles.

Fig. 2
figure 2

Installation of the vehicle-mounted camera

Fig. 3
figure 3

YOLOv4 vehicle tail detection results

3 Vehicle tail lamp detection

Vehicle tail lamp detection includes position and state determination of the tail lamp. First, the ROI (Region of Interest) area of vehicle lamp information is obtained according to the structured characteristics of the vehicle appearance, and then the lamp images in different states are segmented by computer vision.

3.1 Determination of the vehicle tail lamp position

With reference to the National Standard GB 4785-2019 of the People’s Republic of China "Installation Regulations for External Lighting and Lamp Signaling Devices of Automobiles and Trailers" [10], as shown in Fig. 4(a), according to appearance and color properties, the vehicle tail lamp area usually includes three areas A, B, C. Each area represents different lamp color and semantics. At the same time, there are certain requirements for the installation location of the tail lamp. As shown in Fig. 4(b), with the reversing lamp as an example, the height of the tail lamp top from the ground D < 1200 mm, the height of the tail lamp bottom from the ground E > 250 mm, and the distance between the tail lamps on both sides F > 600 mm. According to the above information and vehicle dimension, the ROI area including the vehicle tail lamp can be calculated.

Fig. 4
figure 4

Schematic diagram of vehicle tail lamp. A Brake lamp in red area, B turn signal lamp in yellow (orange) area, C other tail lamp areas, D the height of the tail lamp top from the ground, E the height of the tail lamp bottom from the ground, F the distance between the tail lamps on both sides

3.2 Hue, saturation, value space-based vehicle lamp detection

Hue, saturation, value (HSV) [11] space is a color space that describes color characteristics. Compared with RGB space, HSV space can better characterize human visual perception. Usually, the image collected by the optical sensor is in RGB format. Specifically, this paper first normalizes the R, G, and B channels of the pixel, and then calculates the maximum and minimum channel values \({C}_{max}\) and \({C}_{min}\) according to formulas (1) and (2). After obtaining the channel difference \(\Delta \) (formula (3)), the corresponding \(\mathrm{H}\), \(\mathrm{S}\) and \(\mathrm{V}\) channel values of pixels can be calculated according to the values of \(\Delta \) and \({C}_{max}\) (formulas (4), (5) and (6)). Through the above operations, the color space conversion is completed.

$${C}_{max}=\mathrm{max}\left(\frac{R}{255},\frac{G}{255},\frac{B}{255}\right)$$
(1)
$${C}_{min}=\mathrm{min}\left(\frac{R}{255},\frac{G}{255},\frac{B}{255}\right)$$
(2)
$$\Delta ={C}_{max}-{C}_{min}$$
(3)
$$\mathrm{H}=\left\{\begin{array}{c}{0}^{^\circ } , \Delta =0\\ {60}^{^\circ }\times \left(\frac{G-B}{255\times \Delta }+0\right) ,{C}_{max}=\frac{R}{255}\\ {60}^{^\circ }\times \left(\frac{B-R}{255\times \Delta }+2\right) ,{C}_{max}=\frac{G}{255}\\ {60}^{^\circ }\times \left(\frac{G-B}{255\times \Delta }+0\right) ,{C}_{max}=\frac{B}{255}\end{array}\right.$$
(4)
$$\mathrm{S}=\left\{\begin{array}{c}0 ,{C}_{max}=0\\ \frac{\Delta }{{C}_{max}} ,{C}_{max}\ne 0\end{array}\right.$$
(5)
$$\mathrm{V}={C}_{max}$$
(6)

In the above formulas, respectively represent the pixel values in red, green, and blue channels of the RGB space, and represent the hue, saturation, and brightness value of the pixel in the HSV space.

In vehicle lamp recognition, the key lies in the difference between the on and off states of the vehicle lamp in the HSV space. As shown in Fig. 5, the HSV space conversion is performed on the ROI area of the vehicle tail lamp in the two states to generate a three-dimensional point cloud image. The point cloud image represents the distribution of hue, saturation and brightness of each pixel in the image.

Fig. 5
figure 5

Distribution of pixels in HSV space. a The lighting state of the vehicle lamp. b The lamp off state

By comparing Fig. 5a and b, it can be seen that when the vehicle lamp is on, the Value component, that is, the brightness component, changes significantly compared to the off state; the hue component of the image presents bimodal characteristic. The division of the color thresholds in the HSV space is detailed in Table 1 [14]. It can be seen that the three RGB colors of yellow, orange, and red in the tail lamp have different ranges and large spans in the hue component, while saturation component has similar range as the value component.

Table 1 HSV space color data

Therefore, according to the above characteristics, a specific threshold can be set to segment the pixels in the lighting state [12] [13]. In this paper, the upper cut-off threshold is set as \(\left\{{\widehat{H}}_{max}=180 {\widehat{S}}_{max}=255 {\widehat{V}}_{max}=255\right\}\). \({\widehat{H}}_{min}=0\), \({\widehat{S}}_{min}=43\) is set in the lower cut-off threshold, and the \({\widehat{V}}_{min}\) choice affects the segmentation effect. As shown in Fig. 6, segmentation results are selected when \({\widehat{V}}_{min}\) is 100, 150, 200, and 250 respectively. Comparison with the original image reveals that: if \({\widehat{V}}_{min}\) is too small, the lighting area cannot be distinguished, and if \({\widehat{V}}_{min}\) is too large, the image details will be lost, so selection of a reasonable segmentation threshold is crucial. Since the selection of a fixed threshold cannot meet the segmentation accuracy requirement and is time-consuming and labor-intensive, this paper proposes a region-based adaptive threshold segmentation algorithm.

Fig. 6
figure 6

HSV space threshold segmentation results

3.3 Region-based adaptive threshold segmentation algorithm

According to the color characteristics of the vehicle tail lamp and the brightness characteristics in the lighting state, based on the distribution of each pixel in the HSV space, this paper sets three areas to represent the red area, yellow (orange) area and other areas. The red area is set as (\({H}_{max}^{R}\sim {H}_{min}^{R}\), \({S}_{max}^{R}\sim {S}_{min}^{R}\), \({V}_{max}^{R}\sim {V}_{min}^{R}\)), the yellow (orange) area is set as (\({H}_{max}^{Y}\sim {H}_{min}^{Y}\), \({S}_{max}^{Y}\sim {S}_{min}^{Y}\), \({V}_{max}^{Y}\sim {V}_{min}^{Y}\)), and the other areas are set as (\({H}_{max}^{E}\sim {H}_{min}^{E}\), \({S}_{max}^{E}\sim {S}_{min}^{E}\), \({V}_{max}^{E}\sim {V}_{min}^{E}\)). Where, \({H}_{max}\sim {H}_{min}\) limits the color within hue channel, \({\mathrm{S}}_{max}\sim {S}_{min}\) limits the color within saturation channel, and \({V}_{max}\sim {V}_{min}\) limits the color within brightness channel. The selection of \({V}_{min}\) is related to \({V}_{max}\), whose relational formula is: \({V}_{min}={V}_{max}-a\). The value \({V}_{max}\) is the maximum \(\mathrm{V}\) channel value of the pixel points when the above three regions are within the \(\mathrm{H}\) and \(\mathrm{S}\) channels. This paper sets a constant \(a=15\) to limit the dimension of each region in the \(\mathrm{V}\) channel.

According to the area established above, adaptive selection of the \(\mathrm{V}\) channel threshold \({\widehat{V}}_{min}\) is possible, with the basic steps shown as follows:

$${\widehat{V}}_{min}=\left\{\begin{array}{c}255 arg{V}^{Y}<arg{V}^{Y}+c,arg{V}^{Y}<arg{V}^{Y}+c\\ {V}_{max}^{R} arg{V}^{Y}>arg{V}^{R},arg{V}^{R}<arg{V}^{E}+c\\ {V}_{max}^{Y} arg{V}^{R}>arg{V}^{Y},arg{V}^{Y}<arg{V}^{E}+c,\\ {V}_{max}^{E} arg{V}^{R}>arg{V}^{E}+c,arg{V}^{Y}>arg{V}^{E}+c\end{array}\right.$$
(7)
  1. (1)

    Calculate the average \(\mathrm{V}\) channel value in the three regions, and calculate the average \(\mathrm{V}\) channel value \(\mathrm{arg}{V}^{R}\) in the red area, the average \(\mathrm{V}\) channel value \(\mathrm{arg}{V}^{Y}\) in the yellow (orange) area, and the average \(\mathrm{V}\) channel value \(\mathrm{arg}{V}^{E}\) in the other areas.

  2. (2)

    Lamp off state: according to formula (7), this paper sets the pixel perturbation constant \(c=10\) based on a large number of experiments to reduce the impact of noise pixels on the segmented pixels. If both the average \(\mathrm{V}\) channel value in the red area and the average \(\mathrm{V}\) channel value in the yellow (orange) area are smaller than the sum of the average \(\mathrm{V}\) channel value in the other areas and the constant \(c\), we can set the threshold \({\widehat{V}}_{min}={\widehat{V}}_{max}=255\) at this time. No pixel points are segmented in the image, but are converted into a binary image. The value of each pixel point is 0, that is, black.

  3. (3)

    Lamp lighting state: the lamp lighting state has three states: yellow (orange) light is on, red light is on and the two color lights are on together. According to formula (7), if the average \(\mathrm{V}\) channel value in the yellow (orange) area is greater than the average \(\mathrm{V}\) channel value in the red area, and the average \(\mathrm{V}\) channel value in the red area is smaller than the sum of the average \(\mathrm{V}\) channel value in the other areas and the constant \(c\), then the threshold is set \({\widehat{V}}_{min}={V}_{max}^{R}\) at this time, and \({V}_{max}^{R}\) is the maximum \(\mathrm{V}\) channel value in the red area. From the image, high-brightness yellow (orange) pixels can be segmented and converted into a binary image. The value of the segmented pixel is 1, which means white is the foreground, and the value of other pixels is 0, which means black is the background. The yellow (orange) lamp is on at this time. In the same way, the red pixels with high brightness in the image are segmented, and the red lamp is on at this time. When the average \(\mathrm{V}\) channel value in the red area and the yellow (orange) area is smaller than the sum of the average V channel value in the other areas and the constant\(c\), set the threshold \({\widehat{V}}_{min}={V}_{max}^{E}\) at this time. \({V}_{max}^{E}\) is the maximum \(\mathrm{V}\) channel value in the other area. At this time, red and yellow (orange) pixels with high brightness can be segmented, and it is judged that the red lamp and yellow (orange) lamp are on at the same time.

  4. (4)

    Take the divided binary image as a mask and perform an "AND" operation with the original image. The white pixels correspond to the reserved pixels in the original image, and the black pixels correspond to the pixels deleted in the original image.

As shown in Fig. 7, this algorithm is used to detect vehicle lamps in different states. It can be seen that according to the adaptive threshold of the \(\mathrm{V}\) channel, vehicle lamp lighting pixels of different colors can be well segmented.

Fig. 7
figure 7

Threshold segmentation results of this algorithm

4 Semantic recognition of vehicle tail lamp based on deep neural network

The driver usually predicts the vehicle behavior based on the tail lamp state of the vehicle ahead. Therefore, it is necessary for intelligent vehicle to learn the driver's thinking mode and perform semantic recognition of tail lamp. The lamp signal semantics of the lit vehicle tail lamp are usually described using tail lamp color, single/double side and state, as shown in Table 2 [15].

Table 2 Common manifestations of lamp signal

According to the different expressions of the lamp signal shown in Table 2, the tail lamp images of the vehicles are collected as the training, verification and testing samples of deep neural network (DNN) [16]. According to the HSV space color data, the scope of the yellow (orange) pixels of the left tail lamp in the H channel is defined as \(\left(11-34\right)\) and the average value of H channel pixels within the scope is calculated as \({X}_{1}\). The red pixel of the left tail lamp within the \(\mathrm{H}\) channel is defined as \(\left(0-10\right)\cup \left(156-180\right)\) and the average value of \(\mathrm{H}\) channel pixels within the scope is calculated as \({X}_{2}\). Similarly, the average value \({X}_{3}\) of yellow (orange) pixels of the right tail lamp in H channel and average value \({X}_{4}\) of red pixels in H channel are calculated. \(X=\left({X}_{1},{X}_{2},{X}_{3},{X}_{4}\right)\) are recorded as input features. According to the lighting pixels in the image and the lamp signal expression form, record the name of the lamp signal at this time, define the left turn signal lamp as \({y}_{1}\), the right turn signal lamp as \({y}_{2}\), the brake lamp as \({y}_{3}\). If brake lamp and the left turn signal lamp are on at the same time, it is recorded as \({y}_{4}\), if the brake lamp and the right turn signal lamp are on at the same time, it is recorded as \({y}_{5}\), and if the lamps on both sides are not on, it is recorded as \({y}_{6}\). \(Y=\left({y}_{1},{y}_{2},{y}_{3},{y}_{4},{y}_{5},{y}_{6}\right)\) is recorded as the true value.

Figure 8 shows the 3-layer DNN structure used in this paper, which is composed of an input layer, a hidden layer, and an output layer. Where, \(n=100\) represents the number of neurons in each hidden layer, the input layer inputs 4 features, and the output layer outputs 6 categories. The layers are fully connected, and the deep neurons in DNN can be applied to solve multi-classification problems. For instance, in this paper, the classification of lamp signals is judged by pixel features.

Fig. 8
figure 8

3-layer DNN structure

The DNN training process includes signal forward propagation and error back propagation [17]. In the signal forward propagation, the DNN linear operation process is shown in formula (8), (9) and (10). Where, \({A}^{k}=\left({a}_{1}^{k} {a}_{2}^{k} {a}_{3}^{k} \cdots {a}_{100}^{k}\right)\) is the result of the k-th layer linear operation, \(\mathrm{X}=\left({X}_{1} {X}_{2} {X}_{3} {X}_{4}\right)\) represents the input signal, \({W}^{k}\) is the weight of the k-th layer, and \({B}^{k}\) is the bias of the k-th layer.

$${A}^{k}=X{W}^{k}+{B}^{k} k=\mathrm{1,2}$$
(8)
$${W}^{k}={\left({w}_{1n}^{k} {w}_{2n}^{k} {w}_{3n}^{k} {w}_{4n}^{k}\right)}^{T}$$
(9)
$${B}^{k}={\left({b}_{1}^{k} {b}_{2}^{k} {b}_{3}^{k} \cdots {b}_{n}^{k}\right)}^{T} n=100$$
(10)

As shown in formula (11), the activation function is used to transfer the linear operation result. The activation function between the layers in this paper is the sigmoid function. Where, \({Z}^{k}\) is the output of the k-th layer after activation, and the sigmoid function is shown in formula (12). Where, \({a}_{j}\) represents the input value of the \(\mathrm{j}\) th neuron, the numerator is the exponent of the current signal, and the denominator is the sum of the exponents of all input signals.

$${Z}^{k}=h\left({A}^{k}\right)$$
(11)
$$\mathrm{h}\left({a}_{j}\right)=\frac{\mathrm{exp}\left({a}_{j}\right)}{\sum_{i=1}^{n}exp\left({a}_{i}\right)} j=\mathrm{1,2},3,\cdots ,n$$
(12)

In the error back propagation, the gradient descent method is used to calculate the extreme value of the cost function minimization, and the weights are updated. For example, formula (13) is the sample cost function. In order to prevent model overfitting, \({L}^{2}\) regularization term is added to the formula.

$$\mathrm{J}\left(w\right)=\frac{1}{2m}\left(\sum_{i=1}^{n}{\left({\widehat{y}}_{i}-{y}_{i}\right)}^{2}+\lambda \sum_{j=1}^{n}{w}_{j}^{2}\right)$$
(13)

where \(m\) is the number of samples, \({\widehat{y}}_{i}\) is the predicted value of the model, \({y}_{i}\) is the true value, and the mean square error between the two is calculated, \(\uplambda \) is the hyperparameter. In this paper, adaptive learning rate (Adam) [15] is used to optimize the weight update efficiency, and the initial learning rate is set to 0.001. In order to reduce the computational cost, mini-batch learning is used, batch-size is set to 15, and the maximum number of iterations epoch is set to 80, so that the model quickly converges and completes the training in a short time. Figure 9 shows the relationship between the number of iterations and accuracy regarding the training set and test set. It can be seen that the curve is monotonically increasing on the whole. After the model is iterated 58 times, the accuracy of the test set and training set reaches the extreme value, and as the number of iterations increases, accuracy remains stable, indicating that the DNN training is completed at this time.

Fig. 9
figure 9

Accuracy of test data and training data

5 Experimental results

5.1 Sample Training and Evaluation

In order to verify validity of the lamp signal recognition algorithm in this paper, an experimental platform was built for real-vehicle tests. The camera frame rate was 211FPS, the resolution was \(1920\times 1200\). The computer used in the experiment was configured with an i7-9700 CPU, a RTX 2080 SUPER graphics card, operating environment Win10, and the algorithm was compiled in Python. 7,024 images were selected from the KITTI dataset. As shown in the green box in Fig. 10, labelImg v1.4.0 software is used to label the vehicle tail in the image, and the labeled image is divided into training set and test set at a ratio of 9:1.

Fig. 10
figure 10

Labeling of the vehicle tail

YOLOv4 was built on the OpenMMLab [18] platform. Figure 11 shows the accuracy of YOLOv4 in the test sample. As shown in Fig. 11, after 64 iterations, the accuracy can reach 95.3%, and the single frame detection speed can reach 63FPS, which meets the real-time detection requirements in real scenes.

Fig. 11
figure 11

Accuracy of YOLOv4 against the validation set

The DNN data set collects videos in daytime urban traffic scenes. 326 pictures with a total of 418 vehicle driving scenes are selected as samples. As shown in Table 3, 80% samples are used as the training set and 20% as the test set, which are classified based on the tail lamp state of the vehicles in the collected pictures.

Table 3 Sample classification

5.2 Visualization of test results

Figure 12 shows the results when the algorithm herein is used to detect the actual scene where the brake lamp, left turn signal lamp and right turn signal lamp are on, and both the brake lamp and turn signal lamp are on at the same time. The single target and multiple targets are detected respectively. Where, the red box represents the vehicle tail detection result, the red title represents the real lamp signal, and the green title represents the lamp signal detection result of the algorithm.

Fig. 12
figure 12

Tail lamp recognition results

5.3 Verification of detection accuracy

In this paper, we collected videos of three different road sections, each with 20 s duration and 4,220 image frames. Based on the vehicle's tail lamp expression in the video, the vehicles were labeled with real lamp signal tags. As shown in Fig. 13, the Tkinter function package in Python was used to develop a visualization page to record the continuous frame detection results of this algorithm. The detection results are shown in Tables 4, 5, 6.

Fig. 13
figure 13

Visual interface in real vehicle test

Table 4 Brake lamp (two-side red signal lamp) detection
Table 5 Left turn signal lamp (left signal lamp on and right signal lamp off) detection
Table 6 Right turn signal lamp (right signal lamp on and left signal lamp off) detection

In order to quantitatively evaluate the performance of the lamp language recognition system, accuracy rate is defined as shown in formula (14). In the quantitative and qualitative evaluation, this paper stipulates that the detection result represents the state of the overall signal lamps at the current moment (the double-side signal lamps jointly determine this state).

$$\mathrm{Accuracy}=\frac{Correct\,detection\,number}{Number\,of\,lamps}$$
(14)

It can be seen from Tables 4, 5, 6 that this algorithm has an accuracy of more than 70% when it is used to recognize vehicle lamp signal in complex urban traffic scenes, and the average accuracy in recognizing brake lamp, left turn lamp and right turn lamp reaches 81.3%, 75.8%, 76.4% respectively. To further verify the superiority of the method proposed in this paper, we compare it with [1] and [2]. These experiments all use the same samples collected in this paper, and specify that the recognition accuracy of the turn signal includes both left and right sides. We train the corresponding Adaboost and SVM models for [1] and [2], respectively, for the determination of taillight regions and the classification of lamp signal. As shown in Table 7, for 4,220 frames of test images, the algorithm proposed in this paper achieves a turn signal recognition accuracy of 76.3%, which brings 3.2% and 10.5% gains compared to [1] and [2]. At the same time, the algorithm proposed in this paper achieves the highest 81.3% in the recognition of brake lights, which is 3.9% and 7.7% higher than [1] and [2].

Table 7 Algorithm comparison

6 Conclusions

In traffic scenes, lamp signal is the language of vehicle communication, and the ability to accurately recognize the semantics of vehicle tail lamps is the key to the development of intelligent driving. For the detection of vehicle tail lamp pixels and the understanding of semantics, this paper proposes a lamp signal recognition algorithm that combines deep learning with computer vision.

  1. (1)

    The vehicle tail part in the KITTI data set is labeled, and YOLOv4 is trained to complete the vehicle tail detection. Then the ROI area is determined according to the tail lamp position in the vehicle to reduce the computational cost of the algorithm in pixel search of the tail lamp.

  2. (2)

    HSV space segmentation is performed on the tail lamps in the ROI area. By comparing the distribution characteristics of the tail lamps in the HSV space when it is off and on, a region-based adaptive threshold is proposed. By fixing the H and S channel scopes, we searched for the optimal V channel threshold to segment the pixels when the tail lamps of different colors are on.

  3. (3)

    The average value of the lamp lighting pixels in the H channel is collected after segmentation, and the DNN model is trained based on lamp signal expression form. Finally, the vehicle's semantic recognition of the brake lamp, turn signal lamp and lamp off state is completed through this model.

  4. (4)

    Experiments on actual traffic scenes in urban areas show that the algorithm has an average accuracy rate of 81.3%, 75.8%, and 76.4% respectively for brake lamp, left turn signal lamp, and right turn signal lamp. However, due to the vehicle location and camera imaging quality, missed and false detection may also occur. In the future work, we will consider building a larger database of vehicle taillights for inferring taillight occlusion using deep learning and further improve the lamp signal recognition accuracy.