1 Introduction

At the end of 2019, the sudden outbreak of the COVID-19 epidemic brought a huge impact to the whole world. Nowadays, the growth of cases caused by COVID-19 in the world is far from over. Compared to the cases without masks, the infection rate of people wearing masks can be as low as 1.5%. To reduce the probability of infection, wearing masks correctly in public places has become the most feasible and effective way to prevent virus transmission. People wearing masks is a simple, feasible and low-cost method to block the source of infection. It is an object detection task to detect whether the face is worn mask correctly. This task mainly includes two detection targets: mask and face. By determining the position and relationship of the key targets in the face image, the various features of the face mask image are comprehensively analyzed, compared, reasoned and judged. Finally, the target information of each type of face wearing mask is extracted.

At present, the target detection algorithms based on deep learning mainly include a two-stage object detection algorithm based on region proposal and an one-stage object detection algorithm based on regression [1].The one-stage object detection algorithm is simple in structure and efficient in calculation. The two-stage object detection algorithm has high detection results, but has lower real-time performance and more difficult detection of small targets. For the object detection algorithm, the Region-based CNN (Convolutional neural network) algorithm [2] proposed by Girshick R et al., in 2014 uses the Selective Search method to pre-extract the candidate regions for scaling processing, and then uses the CNN network only on these candidate regions. Extract features and use SVM (support vector machine) classification for category judgment, but R-CNN performs repeated convolutional network calculations, so there are problems such as low speed and large memory usage. SPP-Net (Spatial Pyramid Pooling in Deep Convolutional Networks) object detection model was proposed by He in 2015 [3].Compared to R-CNN, SPP-Net improves the accuracy of the processing layer after a border regression and reduces the amount of calculation. Girshick R proposed Fast R-CNN and Faster R-CNN [4, 5]. The algorithm greatly improved the detection speed and accuracy by the pooling layer structure. Lin et al. [6] proposed the FPN(feature pyramid networks) algorithm which uses a feature pyramid to combine high-level feature semantic information with low-level feature high resolution. Qiao et al. [7] proposed DetectoRS algorithm which is based on the backbone network and FPN algorithm. The recursive feature pyramid and SAC (switchable atrous convolution) are used to convolution the features with different atrous rates.

Facial mask detection is influenced by various factors, and the recognition rate of existing systems will sharply decrease when users do not cooperate and collection conditions are not ideal. For non cooperative facial image acquisition, the occlusion problem is very serious. Especially in monitoring environments, the monitored object may wear accessories such as glasses and hats, which may result in incomplete facial images collected, affecting subsequent feature extraction and recognition, and even leading to the failure of facial detection algorithms. How to effectively remove the influence of obstructions is a very urgent research topic. To improve the predictive performance of SVM, genetic algorithms can be used to select the optimal kernel function [8, 9]. Among existing algorithms, Mhnt et al. [10]. Combined the ResNet50 algorithm with the SVM mechanism, resulting in a significant improvement in accuracy. However, the model requires a large amount of computation and lacks efficiency. Nagrath et al. [16] compared the DNN network and image classifier MoblieNetv2 structure for face mask detection, compared with existing network models such as VGG-16, the accuracy and F1 score have been improved, and the proposed model is easy to deploy in embedded devices, while combining SSD algorithms and MoblieNetv2 networks, the network model is more complex, but the actual performance is not as good as YOLOv4. And this paper uses YOLOv4 as the benchmark network because it is lightweight and easy to embed.

YOLO-v4 is an one-stage object detection algorithm. For an one-stage object detection algorithm, it does not need to generate a region proposal stage, and has a high detection speed. The final result can be obtained after one stage of detection. By adding the improved PANet (path-aggregation network) structure, the CSP1_X was lead into the network level of the backbone feature network CSPDarkNet53 [11]. The complexity of the model is reduced. The masks detection network algorithm is proposed by Kumar et al. [12].The algorithm is developed by integrating tiny YOLO-v4 with SPP module to detect occluders on facial images, mAP (mean average precision) is improved by 6%. Wu et al. [13] proposed a novel face mask detection framework named FMD-YOLO. The feature extraction is enhanced by combining the Res2Net module and the Im-Res2Net-101 residual network in the feature extractor, and benchmark evaluation is performed on the two datasets. The best precision of 92.0% and 88.4% was achieved when the threshold is 0.5 Jiang et al. [14] performed detection on the PWMFD face mask dataset, and proposed to integrate SENet [15] into YOLO-v3, which enhanced the robustness of the model, achieved a higher 8.6% mAP compared to YOLO-v3.In addition to the YOLO series, there are many application of object detection models in face mask detection. The single shot multibox detector MobileNetV2 model which uses an OpenCV DNN (Deep Neural Network) to provide a more lightweight detection of whether a face is wearing a mask [16]. SA Sanjaya et al. [17] used MobileNetV2 to detect face masks in 25 different cities, with an accuracy rate of 96.85 percent. Some methods are detected face masks [18], including max pooling, average pooling and mobileNetV2 architecture. Relatively speaking, some of the existing network models are not suitable for deployment in real-time conditions and are also not suitable for using in embedded devices. YOLO-v4 itself is an one-stage object detection algorithm, which is an object detection model designed to meet real-time requirements.

YOLO-v4 is a real-time object detection model that can be trained on a single GPU, reducing the training threshold. Relatively speaking, the trained YOLO-v4 can judge the feature key point information of the detected image in the actual scene, based on the extraction of a large number of feature points of the face [19], and achieve a combination of accuracy and speed for the detection results of face wearing masks. However, the actual application scene is complex, and there are a series of problems such as different lighting angles, multi-target detection, and non-mask occlusions. This paper proposes an improved YOLO-v4 face mask detection algorithm. This paper aims to improve the network structure under the algorithm framework of YOLO-v4, add attention mechanism module to the appropriate network layer, extract facial feature information, and optimize the detection algorithm.

2 Related Theories

2.1 YOLO-v4 Algorithm

Compared to YOLOv3, YOLOv4 is nearly 10 percentage points higher on AP and 12% faster in speed. YOLOv4 is a lot of improvements over YOLOv3, adding a lot of tricks, and a good combination of speed and precision.YOLO-v4 has been greatly improved on the basis of YOLO-v3 and added many practical training techniques. YOLO is mainly composed of the following elements: first CSPDarknet53 as the backbone network, secondly SPP as Neck’s additional module, PANet as Neck’s feature fusion module, and finally YOLO head. First of all, for the input side, YOLO-v4 adds different techniques such as Mosaic data augmentation. For the backbone network, YOLO-v4 lead into a Cross Stage Partial connections (CSP) and combines it with Darknet53 to form CSPDarknet53, Which reduces the complexity of computing to a certain extent. The use of CSPMarket53 can accelerate the inference speed and enhance the learning ability of the network. The use of the SPP spatial pyramid module in the neck can significantly enhance the Receptive field, solve the multi-scale problem, and separate important contextual information without affecting the running speed. Using the PANet path-aggregation network as the neck network part, we first extract features, and then build a feature pyramid FPN module from the extracted feature layer. In the FPN module, we fuse the high-level Semantic information to the lower layer, and then go through a reverse fusion to splice the two feature layers in the depth direction.

For the neck block, YOLO-v4 adds the SPP module before the output layer. The activation function is changed from the Leaky ReLU of YOLO-v3 to the Mish activation function, and DIOU_nms was used to suppress non-maximum values.

Assuming that the size of an input image is 416 × 416, in the backbone network part, first perform ordinary darknet convolution and a series of residual network structures are implemented, so that the width and height of the input image are compressed to 1/2 of the original. The number of channels is expanding. To obtain higher semantic information, the last three convolutional layers are selected and entered into the SPP block. Three pooling kernels with sizes of 13 × 13, 9 × 9, and 5 × 5 are used to perform the max-pooling operation, and the pooled results are subjected to a stacking and convolution operation three times. Then the result obtained through the SPP module is up-sampled, and a feature fusion operation is performed with the last three feature layers of the backbone network, that is, the feature pyramid, so that very effective features can be obtained. For the YOLO head, YOLO-v4 and YOLO-v3 are the same. YOLO head is essentially a 3 × 3 convolution plus a 1 × 1 convolution, and a 3 × 3 convolution can be regarded as the combination of features, and a 1 × 1 convolution can be regarded as transforming the obtained features into predicted results.

2.2 Attention Module

Attention mechanisms have been widely used in the field of deep learning. Three attention mechanisms that are SENet [15], CBAM [20] and CANet [21] are applied in this paper.

SENet is a channel attention mechanism. In deep neural networks, different channels in different feature maps usually represent different objects [22]. SENet focuses on the relationship between channels. According to the different importance of the channels, different channel weights are obtained from the feature input in the training of the module, and then the weights are assigned. Its core is the squeeze-and-excitation (SE) block [23], which compresses Height × Weight × Channel into 1 × 1 Channel through global average pooling to establish the dependency between channels. SENet is a lightweight plug-in module [24], one of the obvious advantages is that it can be easily applied to existing networks at a small cost, and has a wide range of applications.

SENet improves from the network structure itself and introduces a new structural unit SE block, which refines features by modeling the interdependence between feature map channels, focuses on important feature channel information, suppresses feature channels that are less important than the target task, and enhances the expression ability of features. After embedding SENet into a certain layer in the network, a global average pooling operation is firstly performed on the original feature map to compress the two-dimensional feature convolution of each channel into a real value. Next, a Bottom neck structure is formed by sequentially passing through the fully connected layer, activation layer, fully connected layer, and sigmoid activation layer to model correlation, then do the matrix multiplication corresponding to the learned normalized weight and the original characteristic graph value, learn the weight of the corresponding channel, and get the result value. SENet can be easily embedded into other convolutional neural networks to improve the accuracy of the model.

CBAM is not only an attention mechanism module on channels, but also a spatial attention mechanism. It consists of two parts, the Channel Attention Module (CAM) and the spatial attention module (SAM).The author verifies that the max pooling and average pooling of the channel attention module is better at the same time, and also compares the effect of the order of the channel attention module and the spatial attention module on the model performance [20]. Wang et al. [25] lead into a novel base network as the backbone network, and inserted the attention module CBAM into the backbone network for better identify and diagnosis of COVID-19 disease. Zhu et al. [26] combined the CBAM and efficient channel attention network (ECA-Net) in YOLO-v5 for improvement. The improved YOLO-v5 improved the performance accuracy by 3.4%. Ubaid et al. [27] inserted CBAM into the PANet of YOLO-v4 and added an attention module at the end of the feature fusion network to extract effective feature weights. CBAM and SENet are relatively lightweight modules, which can be plug-and-play and have high portability.

CBAM is to obtain different values based on the original feature map and focus on the information on different channels. Then, it multiplies the spatial and channel distribution information with the original feature map information to obtain a new feature map. CBAM inputs the target image at the input end and performs global maximum pooling and average pooling on the feature object in both channel attention and spatial attention directions. The process of generating attention in the channel direction is to add the pooling result to the fully connected layer and convert it into a one-dimensional vector, and then multiply the one-dimensional vector with the channel attention to obtain a feature map. And spatial channel attention is firstly pooled to generate spatial two-dimensional vectors, then concatenated and convolutioned to obtain spatial attention feature maps.

CANet can capture the information of channel, position and direction perception at the same time. It embeds the position information into the channel attention. When the horizontal attention is combined with the vertical attention, it has the best effect, and it is easier to be inserted into the channel attention. In the classical network structure, this module enhances the representational ability of mobile network learning. Zha et al. [28] used an improved feature fusion structure to embed CANet mechanism in MobileNetv2, and then extracted feature maps through different levels of feature extraction in the SPP structure. Finally, the feature results are analyzed by the YOLO head. CANet is to solve the loss of position information in the process of global pooling, and encode the feature tensor in one dimension along different horizontal and vertical directions, while capturing the information perceived on the channel, direction and position, and fusing the spatial information by weighting on the channel.

2.3 The Attention Module Introduced in YOLO-v4

The attention mechanism is led into YOLO-v4, that is, the attention mechanism is added to the feature extraction network module [29], to allocate more attention to the features that need attention. A major advantage of using the attention mechanism is that it enables better visualization and explanation of the entire model [30]. In the field of computer vision, most of the research on combining deep learning with visual attention mechanisms has focused on the use of masks [31]. The SENet proposed in this paper belongs to the channel attention, CBAM belongs to the mixed domain attention mechanism, and the CANet embeds the location information into the channel attention mechanism. Jiang et al. [32] proposed a fast object detection method based on YOLOv4-tiny, which combined channel and spatial attention mechanism to extract more effective feature information to improve the network structure. Wang et al. [33] cited the convolutional neural network of the attention mechanism which can incorporate with network architecture in an end-to-end training fashion to improve the detection accuracy. Gao et al. [34] led into a channel attention mechanism in YOLO-v4, carried out global average pooling operation on the features extracted from the image, enhanced the correlation between the features of channels. The mAP value was increased by 0.62 percent. Axiang et al. [35] cited the application of multi-attention mechanism in the mask wearing detection model, which improved the feature mining ability of the network, resulting in an average accuracy rate of 93.18%. To reduce the speed of virus transmission and the spread of droplets, Mao et al. [36] improved the CNN and added an attention mechanism, which can effectively combat virus transmission. In object detection algorithms, to improve detection accuracy, genetic algorithms are often chosen to optimize the structure and initialization weights of convolutional neural networks [37, 38]. Guo et al. [39] combined the attention mechanism with YOLO-v5 to detect the wearing of face masks. Under the condition of dim light, the accuracy of identifying the wearing of face masks can still be as high as about 92%. Xue et al. [40] combined the attention mechanism with the Retina face algorithm, improved the network structure of Retina face, and calculated the location information of the key points of the mask and the feature of the face. An attention mechanism and multiple residual layers skip connections are introduced to identify faces obscured by masks [41]. Liu et al. [42] combined the improved SSD algorithm with the attention mechanism to improve the degree of facial feature extraction, and improve the accuracy and speed of detection of face mask recognition.

YOLO-v4 is an end-to-end real-time object detection algorithm [43]. For the sake of reducing the complexity of the network and unnecessary redundant computation, the attention mechanism module should be added to the appropriate layer in the network structure.

In the deep convolutional neural network, the features of the shallow layer are more general, and it conforms to the general characteristics of the image; while the features of the deep layer are more complex, and their representation ability is unique, which is more suitable for attention adjustment.

Based on the above principles, this paper adds SENet, CBAM, and CANet to the PANet structure of YOLO-v4 and before YOLO head for comparative experiments to obtain higher semantic information. The face mask detection flow chart in this paper is shown in Fig. 1.

Fig. 1
figure 1

Flow chart of face mask detection

For mask detection, the first step is to define a specific mask detection task to extract images with specific features, and next input the labeled mask dataset for model training at the input end to obtain the pre-training weight, and then the model extracts key feature information points according to the dataset picture, predicts the bounding box through the key point information, and then overlaps and compares with the real bounding box. When the overlap rate is greater than the specified threshold, the mask data is marked with feature points, and finally the probability of wearing the mask correctly is marked on the image. This is the specific mask detection process.

2.4 Evaluation Indicators

Precision is the proportion of true positive samples for all results that are judged to be positive, the expressions is as follows:

$$ {\text{Precision}} = \frac{{{\text{TP}}}}{{\left( {{\text{TP}} + {\text{FP}}} \right)}}. $$
(1)

Recall is the correct rate of prediction for a positive example of the true result, the expressions is as follows:

$$ {\text{Recall}} = \frac{{{\text{TP}}}}{{\left( {{\text{TP}} + {\text{FN}}} \right)}}. $$
(2)

The recall rate and precision rate are calculated based on the prediction results of four types of models. Where TP is true positive, FP is false positive, FN is false positive, TN is true negative.

Precision-Recall curve (P-R curve). For the points on the P-R curve, a reasonable threshold is set. The curve divides the results according to the threshold. The results of the points on the curve that are greater than the threshold are determined as positive samples, otherwise they are negative samples, corresponding to the recall rate and precision of the prediction result.

IoU (Intersection over Union) ratio is the intersection and union between the predicted border and the real border, the expressions is as follows:

$$ {\text{IoU}} = \frac{{{\text{Area}}\;{\text{of}}\;{\text{overlap}}}}{{{\text{Area}}\;{\text{of}}\;{\text{union}}}}. $$
(3)

AP (average precision) measures the quality of the model in each category, and AP of all categories measured by mAP takes the average value. mAP value can directly reflect the performance of the system and model, that is:

$$ {\text{mAP}} = \frac{1}{n}\sum\limits_{i = 1}^{n} {AP} $$
(4)

2.5 Dataset

The dataset used in the experiment has a total of 7959 pictures, including 3894 pictures from the WIDER Face dataset, and 4064 pictures of faces wearing masks from the MAFA dataset open sourced by Ge Shiming of the Institute of Information Technology, Chinese Academy of Sciences.

The training sets and validation sets were split 9:1, with 6120 images for training and 1839 images for validation. The dataset divides the images into three categories: wearing a mask, not wearing a mask, and the position of the mask. In the data set, not only the faces that wear masks are marked, but also to cope with the different lighting and angles in the actual application scene, there are also pictures in the training pictures that are different from masks and other occluders cover the face (Fig. 2).

Fig. 2
figure 2

Display of single-target and multi-target datasets

3 The Algorithm Proposed in this Paper

The main research content of this paper is to put the attention mechanism module into the neck module PANet of the YOLO-v4 network and conduct a comparative experiment before the YOLO head, thereby improving the YOLO-v4 network structure model.

Improve the algorithm model parameters. At the neck and YOLO head of the YOLO-v4 network, each features fusion area of these two parts (after the Add, concat layer and before the model YOLO head) and attention. The force mechanism module is integrated to perform feature extraction.

Make full use of the existing public data sets for training, evaluate and compare the performance of the three attention mechanism modules CBAM,SENet and CANet combined with YOLO-v4 in the face mask detection task.

The improved YOLO-v4 network architecture can be seen in Fig. 3 in this paper. Taking CBAM as an example, in the following figure, four CBAM modules are added behind the three branches of the PANet feature fusion network and in front of the YOLO head, and they are marked with serial numbers (the serial number ① is marked after the concat layer of the PANet network, and the serial number ② ③ ④ is marked before three YOLO detection heads of different sizes). There are mainly two new network structure models. The structure model generated by adding four CBAM modules to the YOLO-v4 network at the same time is called YOLOv4-CBAM-A. When the CBAM module with serial number①is removed, the network model is named YOLOv4-CBAM-B.

Fig. 3
figure 3

The YOLO-v4 models with embedded attention modules

Compared to the original model, the improved YOLO model has four different places in the network where attention mechanisms are added, three of which are added before three detection heads of different sizes, and one is added after the concat layer of the neck network.

4 Experiment

4.1 Experimental Environment

The experimental environment in this paper: Processor: Intel(R) Xeon(R) Gold 5218R CPU @2.10 GHz 2.10 GHz; Memory: 64 GB; Graphics Card: NVIDIA GeForce RTX 3090. Compilation tool pycharm, compiled language: python3.7.

4.2 Model Training

The transfer learning is used to train the model with pre- training weight. The PASCAL Visual Object Class (voc) pre-training weights is used to improve the trained model, and the attention mechanism module is added at the appropriate network level.

When training YOLO-v4 and YOLO-v4 with the attention module on the training set, in most experiments, the training set is divided into training set and validation set according to the ratio of 9:1, while in a small part the training set and validation set is divided according to the ratio of 76:24. The input image size has two sizes: 416 × 416 × 3 and 618 × 618 × 3.Adam optimizer with weight attenuation of 0.0005 is used for optimization. The learning rate was decayed by cosine annealing, and the minimum learning rate was set to 0.00001. A total of 150 Epochs were trained in this model training, of which the first 50 Epochs were trained by freezing, and the parameters of the backbone feature network were frozen, and the last 100 Epochs were unfrozen, so that the entire network was trained together.

5 Results and Data Analysis

To evaluate the effectiveness of the attention module network introduced by YOLO-v4, this paper combines the above work and the selected WIDER Face and MAFA datasets to train and test the network performance index values. In the experiment, three attention modules, CBAM, SENet, and CANet, were added to the different network layers of YOLO-v4, and the performance was tested for three indicators: mAP, detection time and model size. In the ablation experiment, the size of the input picture is 416 × 416 × 3 and 608 × 608 × 3.

As shown in Fig. 4, the actual effect of the benchmark network YOLOv4 and the improved model for single target and multi targets image detection is shown. In the four sets of graphs in Fig. 4, Figures a and b represent the detection comparison effects of the YOLO-v4 original model and YOLOv4-CBAM-A under multiple and single targets, respectively. Figures c and d represent the detection comparison effects of the YOLOv4 benchmark network and the improved YOLOv4-SENet-A network model with attention mechanism under multiple targets, and the YOLOv4-CANet-A network model with attention mechanism.

Fig. 4
figure 4

Comparison between the original YOLO-v4 model and the network with the attention mechanism added

As shown in Fig. 5, the original model is directly used to compare with the highest-precision YOLOv4-CBAM-A model. When the input image size is 608 × 608 × 3, the test results of single-target and multi-target images are compared:

Fig. 5
figure 5

Model comparison diagram

The mAP obtained by adding the three attention modules to the YOLO-v4 network level, and the FPS under single-target and multi-target, as well as the AP values for detecting masks, faces, and face masks. The training of the model in the experiment is a total of 150 Epochs. First, freeze the first 50 Epochs of training, freeze the parameters of the backbone feature network, and then perform 100 Epochs after thawing. The learning rate before thawing is 0.0001, and the last 100 Epochs was Unfrozen, the learning rate after thawing is 0.00001. The batch_size before thawing is 20, and the batch_size after thawing is 50. The Model training environment is shown in Table 1.

Table 1 Specific environment for experimental operation

In previous research on face wearing mask detection, there were articles that adopted RGB for face mask target detection. The method of color information extraction, but lacks consideration for the irregular wearing of masks, or when the person to be detected is located far away from the camera, the resulting small target image is not easy to detect. The method proposed in this paper is a three-classification problem, which fully considers the detection of whether the mask wearing is standardized, and at the same time adds an attention mechanism to facilitate the detection of small target objects.

As shown in the data in the Table 2, when the size of the input image is 608 × 608 × 3 at the same time, when the data set is divided according to 9:1, the model gets more training times, and the detection effect obtained on the test set is better, indicating that the detection effect of the model is effectively improved after learning. Therefore, in the improved network training based on YOLOv4, increasing the number of training set pictures of the model can improve the feature learning ability of the network model. The mAP value of the dataset divided by 9:1 increased by 1.0% compared to the dataset divided by 76:24.

Table 2 Comparison of experimental results under different partitions of the dataset

As shown in the data in the Table 3, the experimental effect of YOLOv4-CBAM-A in fusing attention mechanism modules in four levels of the network is better than that of fusing attention mechanisms before only three YOLO head detection heads. The average accuracy AP values of face categories, face wearing mask categories and mask categories detected by the first network structure YOLOv4-CBAM-A network reached 85.06%, 95.63% and 98.32%, respectively, and the accuracy of the second network structure YOLOv4-CBAM-B network detected the three categories was 84.41%, 95.40% and 98.24%, respectively. From the experimental results, it can be seen that the first network structure is improved by 0.65%, 0.23% and 0.08% under the detection of three categories, respectively, and the effect of the face category improvement is the most obvious, so it can be seen that the effect of the first network structure is better.

Table 3 Comparative data on the effectiveness of two different improved networks

As shown in the data in the Table 4, when the input image size is 416 × 416 × 3, the model improves the effect of the model after adding the CANet attention mechanism in the same network A. Compared to the performance of the benchmark network YOLOv4, which increases the mAP value by 4.2%.

Table 4 Network structure diagram with three different attention mechanisms added to the benchmark network

Where mAP (mean average precision) is the accuracy of model detection, fps (number of frames transmitted per second of the screen) is the inference speed of the model, in the Table 5.

Table 5 The record results of experimental

From the results of the ablation experiment in Table 5, the detection effect of the input image size of 608 × 608 × 3 is better than that of the input image size of 416 × 416 × 3. Testing two different improved networks shows that the attention module added simultaneously in the PANet feature fusion region and before the YOLO Head detection head is better than adding the attention module only in the PANet feature fusion region. By comparing the three different mechanisms of CBAM, SENet and CANet, it was found that in the image with an input size of 608 × 608 × 3, the effect of adding CBAM dual-channel attention mechanism before the feature fusion region and the detection head was the best.

From the result data, adding the attention module to the YOLO-v4 network has slightly improved the accuracy. The effect of adding CBAM module is better than CANet and SENet. It can be seen from the experimental data that the training set is divided into training set and validation set according to the ratio of 9:1 in the experiment, and the result is obviously better than the division of the data set according to 76:24. The experiment is about a three-category problem. The IOU of the experiment is set to 0.5.And the best accuracy after adding the CANet module is 93.18%. And the size of the input image 608 × 608 × 3, compared with the size of 416 × 416 × 3, the result after model training is better, an increase of 0.08%. The 9:1 split of the dataset is also better than the 76:24 split. Through the comparison of ablation experiments, it can be known that the detection effect of adding the attention module to the front and neck PANet of YOLO head detection is better. Compared to the original model, the detection accuracy has been improved. After adding the attention module, the most obvious improvement is the AP value of face detection as shown in Fig. 6.

Fig. 6
figure 6

The AP values of masks, faces, and face masks are detected under three categories

For this experiment, the training of the model is carried out at first, and then the generated model is verified by the weights obtained after training, and the model is evaluated using the test set to classify and identify face masks. The recall value and precision value for three categories of masks, faces and face masks are shown in Fig. 7.

Fig. 7
figure 7

Recall value and precision value of each category

From the experimental data in Fig. 7, it can be seen that in the three categories of masks, faces and face_masks, the recall value and precision value of the mask category are relatively high. The original YOLO model detected a recall value of 96.61% for the mask category. After adding the CBAM module, the improved model detected a mask recall value of 97.65%, an increase of 1.04%.In this paper, the improved model has the highest recall value for detecting faces in these three categories. The model with the highest recall value for detecting faces is YOLOv4-CANet-A, which increased by 13.02%. Regarding the precision value of the face, the model with the highest-precision value for detecting the face is YOLOv4-SENet-B, which is 3.89% higher than the original model. It can be seen that to improve the performance of the detection model, the main thing is to improve the recall value and precision value of the detected face category.

The paper of introducing the attention mechanism module in the YOLOv4 module is to minimize unnecessary redundant calculations, reduce network complexity, and add matching modules to the appropriate network structure. Because in convolutional neural networks, shallow features are relatively more versatile, while deep features are more abstract and complex. Shallow features are suitable for a wide range of general features in images, and deep feature representation capabilities are more suitable for incorporating attention mechanisms to adjust images. Therefore, the introduction of attention mechanism in YOLOv4 network in this article is beneficial for the collection of object feature information.

6 Conclusion

For the detection of face wearing masks correctly in practical application scenarios, the network model and attention module of YOLO-v4 object detection to enhance the key feature point information of the face are studied in this paper. By improving the algorithm model, an attention module is added to the appropriate network layer for comparison of ablation experiments. CBAM, SENet, and CANet are added in the feature fusion area of YOLO-v4's PANet and before YOLO head for comparison. The experimental results show that the fusion of CBAM and YOLO-v4 algorithm model makes the detection effect significantly better. Moreover, the effect of adding the attention module before PANet and YOLO head at the same time is better than adding the attention module only in the PANet area. The application of this model can solve the problems of incomplete detection when there are many targets in the actual scene, or when the target is covered with a non-mask or body parts, the detection is misjudged as wearing a mask etc., so the public health safety is improved.

Based on the improvement of the YOLOv4 detection algorithm model, the YOLOv4s algorithm itself belongs to a single stage object detection algorithm, which is lightweight and has strong real-time performance. This paper proposes to construct a new feature fusion network by combining the PANet module and attention mechanism, enhancing the model's ability to fuse multi-scale features and improving the network's detection performance. By training the facial mask dataset, the optimization algorithm proposed in the final experiment was compared with the original benchmark model algorithm, and the accuracy improved by 4.66%. However, the YOLOv4 network is not the optimal network. The YOLO series of networks have fast update iterations and can try to integrate attention mechanisms with the new version of YOLO, which may result in better detection performance. This is what we will study in the future.