1 Introduction

Coronavirus broke out at the end of 2019, and it is still devastating havoc on the livelihood and businesses of millions of people around the world [13]. Since the world has started recovering from the pandemic, people intend to return to a state of regularity, the same as before the pandemic. However, there is an upsurge of uneasiness among the people in getting back to their normal routine because this virus spreads through droplets of saliva from an infected person which can affect the people within the range of approximately 6 feet. The main symptoms of this infection are fever, headache, cough, respiratory difficulties, loss of taste, and smell ability which leads to the death of the infected person [41]. The incidence rate of COVID-19 is higher than other acute respiratory problems like severe acute respiratory syndrome (SARS) and the Middle East respiratory syndrome (MERS).

To prevent this deadly virus, World Health Organization (WHO) [35] issued guidelines and SOPs such as wearing a face mask and maintaining social distance in public spots. In this regard, several research studies also reported that maintaining the distance while physical interaction between people can prevent the spread of most respiratory diseases [21]. Tangana et. al [1] presented a mathematical model to demonstrate the impact of physical distance while interaction on transmission possibilities of virus among the people. In another study [15], it is demonstrated that wearing a face mask is highly effective in mitigating the reproduction of coronavirus. However, manual monitoring and enforcement of the aforementioned SOPs in public places such as schools, universities, shop** malls, and parks is a quite challenging task.

In step with the rapid advancement in Artificial Intelligence (AI), Deep Learning in particular, the computer vision community has contributed various state-of-the-art methods for intelligent surveillance [65], object detection [6] and recognition [5, 7], and scene understanding [46]. These methods can be employed to develop an intelligent monitoring system for face mask detection and social distance measurement in public places. However, there are two main challenges in this direction. Firstly, to the best of our knowledge, there is no South Asian standard benchmark available to evaluate facial mask detection and social distance measurement methods. Secondly, there is no pipeline available for the development of an end-to-end real-time intelligent monitoring system for facial mask detection and social distance measurement. It is important to mention that several research studies have employed standard single- and multi-stage object detectors such as Faster-RCNN, SSD, and Retina-Net to perform face mask detection [17]. However, these methods do not consider the impact of social distance measurement, which make these methods insufficient for deployment in actual public places.

To address the aforementioned short-comings of existing state-of-the-art methods, in this paper, we have made the following contributions.

  1. 1.

    A local dataset containing 10,000 images based on two classes (i.e., masked face and unmasked face) has been collected from public places. It is worth noting that these classes are unique in orientation and dress codes, which are not covered in the existing datasets.

  2. 2.

    Existing state-of-the-art single and multi-stage object detectors are fine-tuned on the proposed dataset. Based on the analysis, an improved YOLO-v3 based object detection architecture is presented to enhance robustness of real-time surveillance systems.

  3. 3.

    Alongside, a machine-vision based distance measurement method has been proposed to ensure social distancing on public places.

  4. 4.

    Lastly, an extensive comparative study has been carried out between state-of-the-art Face mask detection methods and the proposed method to demonstrate the effectiveness of our proposed method in terms of higher detection and recognition accuracy, and inference time.

The rest of the paper is organized as follow. In Section 2, we briefly discuss existing state-of-the-art facial mask detection and social distance measurement methods, along with the available datasets. In Section 3, we present a detailed overview of our proposed end-to-end pipeline for face mask detection and social distance measurement. The experimental results have been presented in Section 4. Finally, the paper is concluded in Section 5.

2 Related work

Real-time object detection and recognition methods can play an important role in develo** intelligent monitoring methods for face mask detection and social distancing measurement to prevent coronavirus transmission. In this section, we analyze the existing state-of-the-art methods employed in develo** an intelligent monitoring system for face mask detection and social distancing measurement which includes: (i) single- and multi-stage detection methods—for face masked and non-masked face detection, (ii) Available Datasets—to develop generalized face detection systems and, (iii) social distance measurement methods.

2.1 Facial mask detection

In the majority of existing research works, the researchers focused on face construction and identity recognition while wearing face masks. However, the aim of this study is to identify the human face in both states—wearing the mask, or not wearing the mask in order to assist in reducing COVID-19 transmission and spread. In recent studies, researchers have demonstrated that wearing face masks minimizes the rate of COVID-19 spread as it can interrupt airborne germs effectively [38]. However, monitoring the people in public places is still a challenging task. In this regard, Zhang et al. [62] proposed a single shot refinement face detector namely Refine Face to to detect people not wearing a face mask. In another research work, Jagadeeswari et al. [19] proposed SSD-based face mask detection method for an outdoor environment. Khandelwal et al.[22] presented a deep learning approach for classifying human face with and without mask. Onyema et al.[40] proposed method for facial expression recognition based convolutional neural network. Hussain et al. [16] proposed deep learning based IoT system to detect face mask using transfer learning approach.

Besides, aforementioned approaches achieved better accuracy on the respective test data. However, the real-time face mask detection is still a critical challenge for the system developers. In this regard, Snyder et al. [56] introduced deep learning based approach for mask detection to prevent COVID-19 transmission. Kodali et al. [23] presented custom CNN-based model to detect face wearing a mask in the public spots. Similarly, Sagayam et al. [54] proposed deep neural network based method for binary class (i.e., masked and non-masked) face state recognition. Degadwala et al. [9] proposed YOLO-v4 based face detection method which has been trained and tested over WIDER-FACE and MAFA datasets. Likewise, Taneja et al. [58] presented facial mask detection system with MobileNetV2 lightweight CNN and achieved 99.98% accuracy. On the other hand, Sethi et al. [55] aims to detect mask using ResNet-50. The model give 11.07% and 6.44% higher precision and recall and compared it with RetinaFaceMask detector model.

In another research work, Loey et. al [30] presented multi-stage detection method for face detection with wearing or not wearing mask. Alongside, ensemble method combined with deep learning model to detect face masks using real-world and synthetic data to improve the generalizability of machine learning models. These research works are discussed along with insightful strengths and limitations in the Table 1. To this end, we conclude that the deployment of the above-discussed face mask detection systems encounter several constraints at development and deployment level such as diverse types of face masks, face orientation, and illumines conditions [52]. Furthermore, stabilizing object detection model accurateness and real time condition, placement of detector on system with limited computing capacity. In the circumstance of the epidemic, facial mask detection is not still explored in images, videos as well as closed circuit television (CCTV) to control transmission chain of virus [37].

Table 1 An Overview of Existing Machine Learning Methods Used for Face Mask Detection and Recognition Tasks

2.2 Available datasets

In the context of COVID-19, the face datasets have an essential role in training deep models for face mask and non-masked face detection. Recently, several datasets have been proposed to accelerate research in this direction. In this regard,Ge et al.[12] proposed MAFA dataset contains 30811 images which are collected from the Internet. These images have distinct types of masks, several occlusion degree and orientations. Furthermore,Laxel [33] introduced Face Mask Dataset (FMA) holds 853 images with three classes collected from Kaggle. Another extent version of kaggle dataset proposed by Wobot [18] denoted as FMA containing 6024 images having 20 classes. Rahmani et al. [45] proposed Medical Mask Dataset (MMD). The MMD dataset consist of 9067 images with three classes use to detect only medical mask. On the other hand, Wang et al. [26] aimed to study a database configuration in multiple sensor technologies similar to cameras, LiDAR, inertial gyroscopes, wireless sensors and additional sensors used as data acquisition stages. Liang et al. [27] utilized various sensors to get image information and geographic location information at the same time build an indoor 3D chart using geographic coordinates. Niu et al. [39] highlighted social distancing problem in 3D view by using monocular cameras pedestrian 3D localization. Futhermore, Magoo et al. [31] setting bird eye view framework with YOLO v3 model to monitor social distance in public area. Though, the research community has contributed several social distance measurement methods, however, deployment of such systems in real-world environment is still a challenging task.

3 The method

To address the above-mentioned issues, we propose a novel pipeline for develo** an end-to-end face mask detection methods to monitor the public spots in order to mitigate the COVID-19 spread, as shown in the Fig. 1. Firstly, we present a large-scale M UST F ace D ataset (MFD)—containing 10,000 images along with binary class bounding box annotations i.e., Face wearing mask, and Face not wearing mask. Alongside, we analyzed the existing state-of-the-art single stage and multi-stage object detector over our proposed dataset. Specifically, we fine-tuned the existing YOLO-v3 [49], SSD [63], RetinaNet-50 [28], Fast-RCNN [50], Faster R-CNN (FPN) [32], Faster-RCNN (ResNet-50) [25] and Faster-RCNN (ResNet-101) [29] on our proposed dataset through transfer learning. Based on the better performance, we further improved the YOLO-v3 architecture to robustify its performance in outdoor environment. On the basis of our face detector, we employed our self-proposed social distance measurement method—which takes input from the face detector and computes the distance between the two human beings to mitigate the COVID-19 spread in public spots.

Fig. 1
figure 1

The Proposed Pipeline For Develo** Face Mask Detection And Social Distance Measurement in Public Places

3.1 MUST face dataset

To this end, we collect and release M UST F ace D ataset (MFD)—a large-scale dataset to accelerate the development of generalized methods for end-to-end face mask detection in public places. Our MFD contains 10,000 images along with binary class (i.e., masked face, non-masked face) bounding box annotations. The proposed dataset is generated from the video sequences captured by the surveillance cameras installed at the outdoors of the departmental buildings. The average height of the installed cameras is in the range of 12 feet to 15 feet from the ground. After successful video sequence collection, the crowded frames are manually extracted while ensuring the quality control parameters such as positioning of the people and the clarity of the images. It is important to mention that we comply with the regulatory bodies and collected the data from the permitted areas. To protect the privacy, we do not disclose or release the personal identities, Geo-location, incoming and outgoing pattern based information of the people.

After completing frame extraction, considering the use-case of our proposed method, we defined two classes for annotations i.e., masked face, and non-masked face. For this purpose, we employed LabelImg annotation tool to label the human faces according to the aforementioned defined classes. One of the reasons of manual annotations instead of automated labeling is to maintain the accuracy of the coordinates of ground truth which plays an important role in training a robust face detection model. All the annotations are cross-validated by a team of experts to ensure the quality of ground truth. Some of the samples of our dataset are shown in the Fig. 2.

Fig. 2
figure 2

Sample Images From Our M UST F ace D ataset

3.2 Suitable face detection method selection

Till recently, deep object detection methods have demonstrated better applicability in various real-time object detection and recognition tasks [24]. To select the suitable deep learning object detector, firstly, we fine-tuned the existing state-of-the art single-stage and multi-stage detection methods including YOLO-v3 [49], SSD [63], RetinaNet-50 [28], Fast-RCNN [50], Faster R-CNN (FPN) [32], Faster-RCNN (ResNet-50) [25] and Faster-RCNN (ResNet-101) [29] on our proposed MFD through transfer learning. The results show that existing YOLO-v3 outperformed aforementioned employed detection methods in terms of inference time and accuracy. Based on the better performance, we further improved the YOLO-v3 architecture to robustify its performance in outdoor environment.

3.3 Proposed facial mask detection architecture

In the proposed framework, we have employed YOLO-v3 architecture to perform facial mask detection in real-time, one of the most outstanding deep learning object detectors proposed by Joseph Redmon and Ali Farhadi in 2018 [48], which demonstrated consistent performance for object detection and recognition tasks. One of the main issues in existing detection network was the vanishing gradient problem, which commonly occurs by increasing network layers. Therefore, multi-scale YOLO-v3 has been proposed which hold residual connections—which join the input from the previous layer to output of next layer similar to ResNet architecture. Resultantly, Yolo-v3 achieved good performance even over low resolution images due to inclusion of multi-scale feature extraction property. To this end, we employed the existing YOLO-v3 architecture and inserted k-means anchoring to 9 anchor boxes and then isolate them into three locations to get more more bounding boxes per image than baseline version.

The input layer takes an RGB image with a size of 416x416 pixels. As a backbone network, we employed DarkNet-53 to accomplish the maximum calculated floating-point procedure per second. The internal structure of the model includes fully connected network that does not contain max-pooling layer. As depicted in Fig. 1 the network contains convolution block, residual block, and scale output layers. In convolution block, convolution functions of the kernel size hold strides instead of max pooling to reduce size of input images; each monitored by batch normalization and ReLU activation. On the other hand, residual block having different kernel size of two convolution block named as mega-block. In existing YOLO-v3 architecture, the convolution blocks iterates by 1x, 2x, 4x, and 8x. However, considering the use-case of our application, we reduced the iterations of convolution blocks to 1x, 2x, 4x in order to improve the learning performance and inference time. In the bottom of the architecture, an average pool, followed by a fully connected layer and softmax activation is employed to down-sample the feature map and get binary class output probability, respectively. To improve the learning process, we applied the concept of transfer learning to utilize the storing knowledge of a neural network to do new tasks by simply learning new weights. The ultimate aim of employing this technique is to increase the learning process.

3.4 Social distance measurement methods

With the recent advancement in the field of AI, computer vision based applications have demonstrated better applicability in several applications such as scene understanding, object recognition, speed, and distance estimation [14]. Some research used proportional-integral-derived (PID) [57] due to it’s simplicity and non-optimal performance. Since, it is suitable for distance measurement as well as will consume less power and memory. Zhang et al. [64], proposed distance estimation method to localization of an object in the camera coordinate frame. Their method contain three steps. The first step is regarding camera calibration and the second step is concerned with constitute a model for distance measurement between camera coordinate frame with their projection frame and third step is representing absolute distance estimation.

The distance is computed with respect to the pivot point of bounding box known as centroid—which is calculated using (1), mentioned below.

$$ C_(x,y) = \frac{\widehat {x}_{min} + \widehat{x}_{max}}{2} , \frac{\widehat {y}_{min} + \widehat{y}_{max}}{2} $$
(1)

It can be seen from (1), C means centroid—means that minimum and maximum width of the bounding box whereas y_min , y_max means that minimum and maximum height of the bounding box. Calculated centroid and then use Euclidean distance formula to measure distance between centroids, as shown in the (2) and then compared the distance with ground truth value.

$$ D_({C_{2(x,y)}},{C_{1}(x,y)})=\sqrt{({x}_{min}-{x}_{max})^{2} + ({y}_{min}- {y}_{max})^{2}} $$
(2)

After calculating centroid of bounding box, a unique ID is assigned to each centroid. In the next step, the distance between every detected centroid is computed using Euclidean distance. To validate the correctness, Root Mean Square Error (RMSE) (mentioned in the equation 3) to estimate the error between actual value and predicted value of the model.

$$ RMSE = \sqrt{{\sum}_{i=1}^{N}\frac{(Predictedvalue - Actualvalue)}{N}} $$
(3)

3.5 Proposed algorithm for real-time face mask detection

Here we present a novel algorithm, depicted in Algorithm 1 , for develo** and deploying an end-to-end face mask detection and social distance monitoring system in the public spots.

In the first step, the real-time stream of the camera get the visual frames—which is passed to our developed face mask detection method for inference. Our proposed method analyzes the frames, if there is no face detected, our network returns null. If face is detected, face detect and also compute distance between faces by using our proposed method. To find out the precautionary measure according to the facial mask and measure social distance, a discussion performed in the Section 2. Following scenarios has been performed: if person wear a mask and distance is greater than 6 feet then no action performed. But when person not wearing a mask and social distance is greater than 6 feet then alert is high. On the other hand, when person wear mask and social distance is less than 6 feet again alarm generated. The masked person and not maintain social distance, then generated warning.

Algorithm 1
figure a

Real-time surveillance Pro.

4 Experiments and results

In this section, we evaluate the effectiveness of the proposed mask/non-mask face detection method and present the comparison study with current cutting-edge techniques. The studies are conducted on a powerful computer running a 64-bit version of Windows 10 that has an RTX 2080TI graphics card, an 11 GB DDR5 GPU, a core i9- 9900k CPU, and 32 GB of RAM.

4.1 Training setup

The training process of the proposed pipeline is divided into three fundamental steps: data pre-processing, model training, and model evaluation. Firstly, the whole dataset is randomly split into training, validation, and test set with 80:10:10 percent ratio and normalized the input size to 416x416 pixel resolution. In the next step, Pytorch library is used for the implementation of the proposed pipeline. Moreover, the experiments are categorized into three phases i.e. (i) evaluation of the existing state-of-the-art object detection networks on proposed dataset, and (ii) evaluation of improved Yolo-v3 network on proposed dataset, and (iii) evaluation of proposed distance measurement method.

4.2 Evaluation of existing state-of-the-art object detection networks on proposed dataset

To evaluate the existing state-of-the-art deep object detection models—YOLO-v3, SSD, RetinaNet-50, RetinaNet-101, Fast-RCNN, Faster R-CNN (FPN), Faster-RCNN (ResNet-50) and Faster-RCNN (ResNet-101) are are fine-tuned on the proposed face mask detection dataset. Pytorch 1.4.0 library and cuda 11.0 version are used to configure the training runs. The hyper-parameters such as learning rate, batch size and epochs are set to 0.0001, 32, and 100 with the stochastic gradient descent optimizer to update model weights, respectively. The performance matrices of the employed models are shown in Table 2.

Table 2 Evaluation of existing state-of-the-art object detection networks on proposed dataset

It can be seen from Table 2 that single-stage detectors demonstrated better applicability in term of low inference time due to their less parametric architectures. Whereas, the multi-stage object detectors have been computationally expensive while achieving significantly higher inference time. It is also important to mention that Yolo-v3 with 53 layers demonstrated better accuracy than the SSD, RetinaNet-50, RetinaNet-101, Fast-RCNN, Faster R-CNN (FPN), Faster-RCNN (ResNet-50) and Faster-RCNN (ResNet-101). For instance, YOLO-v3 achieved 64.1% mean accuracy, 59.6% mAP, 53.1% mAP @ 0.95 and 28ms inference time. Similarly, SSD achieved 61.8% mean accuracy, mAP 56.2%, mAP @ 0.95 is 48.6% and take 34 prediction time. Also, RetinaNet-50 demonstrate 55.2% mean accuracy, mAP 51.9%, mAP @ 0.95 is 44.7% with the inference time of 37ms on the test set. Whereas, RetinaNet-101 achieved 51.0% mean accuracy, mAP 46.3%, and 44.7% mAP @ 0.95 with 39ms inference time which is comparatively higher than RetinaNet-50. On the other hand, We next analyze the multi-stage object detector i.e., Fast R-CNN which demonstrated 41.7% mean accuracy, 39.4% mAP, and 37.1% mAP @ 0.95 with 132ms inference time on the our test set which is significantly higher than the employed single shot detectors. In another experiment, Faster R-CNN based on FPN 119 achieved mean accuracy of 47.3%, 44% mAP, and 41.5% mAP @ 0.95. Whereas, the sample Faster R-CNN with ResNet-50 feature extraction network achieved mean accuracy of 59.0%, mAP 44%, and 57.4% mAP @ 0.95 with inference time of 108ms. However, with ResNet-101 as a backbone feature extraction network, Faster-RCNN shows mean accuracy of 62.7%, mAP 61.3%, and 59.0% mAP @ 0.95 with inference time of 98ms. Consequently, it can be assumed that YOLO-v3 with DarkNet-53 can achieve better accuracy after further architectural fine-tuning.

4.3 Evaluation of improved YOLO-V3 architecture on proposed dataset

Based on the above discussed analysis, the architecture of the YOLO-v3 is further improved by trimming the less contributing convolutional layers and residual connections.The improved feature extractor—DarkNet has been evaluated on the proposed dataset. In order to train the network faster, we employed transfer learning to learn the high level features from the proposed dataset. In the training setup, we employed SGD optimization algorithm with momentum to train and evaluate the improved network on our proposed dataset for mask/non-mask face detection tasks. The re-known performance metrics such as mean accuracy, mAP, mAP @ 0.95 and inference time are used to evaluate the performance of our improved face mask/non-mask face detection on our dataset. The mean accuracy refers to the sum of correct predictions divided by the sum of total data samples. Whereas, mAP denotes mean average precision, and AP @ 0.95 shows the average precision with 0.95 intersection over union. Furthermore, inference time refers to the total time taken from getting an input to producing an output.

Table 3 Evaluation of improved YOLO-v3 on proposed dataset

It can be seen from the Table 3 that our improved Yolo-V3 based detection network outperformed the baseline Yolo-v3 in mask/non-mask face detection tasks on our proposed dataset. One of the main reasons behind the increase of accuracy in our model is the trimming of less contributing residual connections with accelerated the performance of our model as compared to the baseline model. Some of sample results are demonstrated in Fig. 3 to show the effectiveness of our proposed masked/non-masked face detection method.

Fig. 3
figure 3

Qualitative examples of our masked/non-masked face detection method on our face mask dataset

4.4 Evaluation of proposed distance measurement method

After evaluating our proposed mask/non-mask face detection, in next step, we evaluated our proposed machine-vision based distance measurement method to ensure social distancing on public places. Following the standard performance metrics, we employed root mean square error to analyze the correctness of our method as compared to the ground truth. Some of the quantitative analysis is shown in Table 4.

Table 4 Results of proposed distance measurement methods

The vision based system detect faces of the person and give the bounding boxes information. Later on, detect the central point of the bounding boxes around the face and then measure distance between two central point (centroid) using the standard equation of euclidean distance. The error rate is computed using RMSE which computes the difference between ground truth value and predicted value of the model. For instance,in the Distance 1 sample, the actual distance (ground truth) between two persons is 2.44 feet, whereas our proposed vision-based distance measurement method predicts 2.37 with quite lesser error rate i.e., 0.035 RMSE. In the next data sample i.e., Distance 2, the actual distance is 2.99 whereas, our model inferred 2.95 with the RMSE of 0.020. Similarly, in Distance 3 sample, the ground truth value is 3.16 whereas, the proposed method predicts 3.10 holding the error rate of 0.030 which is quite effective performance on our test set.

5 Conclusion

In this paper, a novel pipeline for develo** an end-to-end masked/non-masked face detection method is proposed to improve the effectiveness of real-time surveillance systems at public places. Alongside, a new dataset containing 10,000 images of two classes (masked face, non-masked face) is constructed to develop a generalized masked/non-masked face detection and social distance measurement in outdoor public places. While fine-tuning existing state-of-the-art single-stage and multi-stage detection methods, it is observed that Yolo-v3 outperformed the other networks in terms of accuracy and inference time. Based on analysis, we further improved the baseline Yolo-v3 by eliminating the less contributing residual connections in the network. Consequently, the results indicate that our customized YOLO-v3 performed better than baseline version, showing an improvement of 5.3% in terms of accuracy. In the future, we are aiming to extend our work to develop an image segmentation-based system that can provide accurate level information and gives greater clarity to detect face mask.