1 Introduction

In the past few decades, the world has been observing the threats to security in urban areas, which has expanded the pertinence of vision-based surveillance systems enough to detect pedestrians in high-density areas. Pedestrian detection has countless applications in computer vision. The obvious application is video surveillance [1].

Need and importance of Pedestrian detection

Recently, due to the COVID-19 pandemic, surveillance systems have been used to monitor crowded places. It helps to identify hot spots (the areas where the virus infection spreads more opportunities). Owing to human interaction in highly congested areas, the chances of the spread of the COVID-19 virus increase. Therefore, the demand for pedestrian detection systems has increased. However, the current state-of-the-art installed system requires a conventional manual inspection of the video, which is in most cases time-consuming and infeasible. Pedestrian detection is an essential and important task in any intelligent video surveillance system because it provides information for semantic understanding of video scenes. The focus of the researchers is to make it smarter and use deep learning to move from passive surveillance to active surveillance. The author proposes several promising frameworks to improve the accuracy and speed of pedestrian detection. Nevertheless, the deep learning framework still has room for improvement.

Need and importance of Pedestrian detection

The detection accuracy is affected by various changes such as human body appearance, trajectory, posture, abrupt motion, scale changes, complex background, pedestrian deformation, partial or complete occlusion, shadows, etc. In this area, it is still an unresolved problem. Different factors related to the video capture method, such as low frame rate, unavailability of color information, camera sensor stability, compression technology, etc., directly affect the quality of the video sequence. It may limit the design of object detection algorithms. In addition, different cameras have different sensors, lenses, resolutions, and frame rates, resulting in different image quality. A low-quality image sequence can affect moving object detection algorithms. In some cases, different parts of a moving object might have different movements in terms of speed and orientation. When detecting moving objects in the presence of a moving camera, the need to estimate and compensate for camera movement is inevitable. However, it is not an easy task to do because of possible camera’s depth changes and its complex movements Table 1.

The development history of CNN based deep learning architecture can be summarized as follows

The state-of-the-art Convolutional Neural Network (CNN) architecture used for object detection is You Only Look Once (YOLO) [2]. Compared with other CNN-based object detection frameworks, it is the most accurate and fastest.

PP-YOLO or PaddlePaddle YOLO is the most efficient and fast object detection model in real-time [3]. YOLO has different variants, such as YOLO v1, YOLO v2, YOLO v3, YOLO v4, and YOLO v5 [4]. However, YOLO-based object detection fails in multiple object detection and occlusion. Specifically, Mask R-CNN performs better in crowded environments. We can also use it for semantic, instance segmentation, and classification [5, 6]. Other architectures are Faster R-CNN [7, 8], Fast R-CNN [9], Region-based Fully Convolutional Network (R-FCN) [10], Single Shot MultiBox Detector (SSD) [11], Fully Convolutional Network (FCN) [12], Deep Convolutional Generative Adversarial Network (DCGAN) [13], Residual Neural Network (ResNet) [14], GoogLeNet [15], Visual Geometry Group (VGG Net) [16], ZFNet [17], AlexNet [18], Deep Belief Network (DBN) [19], LeNet [20], etc. used for the pedestrian detection. These frameworks are significantly better than other neural network architectures, such as Support Vector Machine (SVM) [21], AdaBoost [22], Probabilistic Neural network (PNN) [23], Radial basis Neural Network (RBN) [24], Artificial neural network (ANN) [25].

Fig. 1
figure 1

Illustration of the motivation of the proposed SIRA Mask-RCNN. (a) Scale variation: The top row of the first column shows the pedestrian images from the California Institute of Technology [1] and the ETH [2] database. The first column and the bottom row show that the scaled visual appearance has changed significantly in proportion. (b) Illumination variation: The top row of the second column shows the pedestrian images from the California Institute of Technology [1] and the ETH [2] database. The bottom row of the second column shows a significant change in the visual appearance of the lighting after zooming. (c) Rotation variation: The top row of the third column shows the pedestrian images from the California Institute of Technology [1] and the ETH [2] database. The bottom row of the third column shows a significant change in the visual appearance when rotated after zooming

The motivation for the proposed research work

The Fig. 1 illustrates the motivation for the proposed contribution. Changes in the appearance of pedestrian instances make pedestrian detection difficult. In many cases, images of natural scenes usually vary in proportion, orientation, and lighting. These artifacts make pedestrian incident detection and classification a challenging task. Several other problems include 1) uneven lighting, 2) blurry and hazy appearance, 3) highlighting pedestrians, 4) changes in size with different posture changes, etc. Fig. 2 describes the key ideas of a new SIRA R-CNN framework, which is built on the Faster R-CNN pipeline [12]. The input image is first verified by the illumination recognition algorithm whether the input image needs illumination enhancement. The input image has low contrast, and then the histogram enhancement is performed on the input image. Afterwards, for the object proposal generation, first passes the process image through the bottom shared convolutional layers to extract its feature maps. Use these feature maps and scale rotation, affine invariant feature map for generation of to generate the final detection result, which is defined on the recommended size. We can always improve conclusive results through proposed feature maps and two-stage backbone networks, which are suitable for current input at certain scales. Therefore, SIRA R-CNN can achieve excellent detection performance in a wide range of input scales. Since SIRA R-CNN shares the convolutional features of the entire image with different target proposals, it is very effective in terms of training and testing time. Mask R-CNN has the advantage of predicting a binary mask for each specific detected object, while Faster-RCNN and FCN do not have this advantage. In these networks, objects are identified and classified, but no pixel-level segmentation mask is generated, and individual objects are not accurately distinguished.

Fig. 2
figure 2

Illustration of the proposed SIRA Mask R-CNN. We use multiple predefined box filters to filter multi-scale images to detect instances of different sizes. According to the proposed size of the target, the final result is obtained by fusing two-stage detectors and scale, illumination, rotation and affine invariant feature maps

The proposed region generation step generates an RoI, which may or may not contain the required objects. In the classification step, each RoI is classified as an object or background. However, despite its outstanding performance in terms of object detection accuracy, Mask RCNN is computationally costly for the new image as input to the system. The proposal area generation process takes time to generate the RoI. Similarly, due to the unavailability of meaningful information about the object’s scale and rotation-invariant features, the existing Mask R-CNN cannot effectively classify the smallest rotating object. These two drawbacks restrict the use of the Mask R-CNN in a real-world application for object detection in video surveillance.

The main objective of this research paper is to propose a method to reduce the computational cost of the pre-trained Mask-RCNN in the testing phase and to detect the small, revolved object efficiently. The key idea of this method is to modify the version of the Region Proposal Network (RPN) to generate a feature map with a scale rotation affine invariant feature map, which will provide additional object scale, orientation information. In particular, the feature map of the network can locate objects in the image. Hence, proposal regions for a new input image generated with scale and orientation aware information in the trained network, which results in the efficient small and rotated object detection in less computational cost. The detection accuracy increases because of the availability of object information on multiple scales. Joint feature maps of multiple resolutions extracted from different layers of the two networks and used for pedestrian detection, which results in a low false-positive rate. Experimental results show improvements in detection rates due to the proposed feature map.

R. Girshick et al. introduced a Fast R-CNN [9] to resolve the problem of the scale-variance issue by applying brute-force data on the image for scaling at the cost of time and computational complexity. Y. Gong et al. [29] recommended to use a multi-scale filtering model for all objects of different sizes. However, Mask R-CNN is very popular in object detection. In this case, the recognition and generation of the object region proposal takes time to generate the ROI. Similarly, since there is no scaling factor, it will cause false detection of objects whose size changes during the object detection process. Each region proposal is convolved with a fixed-size filter mask to obtain CNN features. These features are inaccurate when the image has significant contrast, lighting, scale changes, and unclearness. These three shortcomings limit the application of Mask R-CNN in the real world.

The contributions of this paper are as follows:

  1. 1.

    A new SIRA Mask-RCNN is proposed to overcome scale, rotation and illumination artifacts.

  2. 2.

    In the proposed method. First, the illumination recognition and detection algorithms solve the problem of illumination changes. Secondly, the Contourlet transform and the direction filter bank use to solve the problem of rotation changes. Third, use ASIFT to solve the problem of translation and scale invariance.

  3. 3.

    The comprehensive evaluation of the proposed method on Caltech, INRIA, MS COCO, ETH, KITTI and the proposed academic environment database proves the effectiveness and robustness of the proposed SIRA Mask R-CNN.

The rest of the research paper is further organized as follows

The latest architectural innovations in pedestrian detection technology and deep learning are described in Section 2. In Section 3, we describe the theoretical background of using deep neural networks for pedestrian detection. Section 4 describes the proposed approach to pedestrian detection. Section 5 presents the experimental results and comparative analysis of the proposed approach and the relevant methods. The last part ends with further research on the direction of pedestrian detection.

2 Literature survey

Researchers have proposed many pedestrian detection frameworks based on deep learning to improve the accuracy of pedestrian detection [30,31,32,33,34,35]. However, the accuracy of pedestrian detectors is affected by complex backgrounds, pedestrian scale changes, object occlusion, and illumination changes. These issues are partially addressed. The most commonly used method of pedestrian detection is the Histogram of Oriented Gradients (HOG) [36]. In this method, the directional gradient is used to detect objects in the image. It fails to detect an object in the case of occlusion and the HOG feature extraction process is time-consuming. Hence, it is not a perfect match for a real-time. In addition, In [37] proposed a combination method of HOG and local binary mode (LBP) to deal with partial occlusion of pedestrians. In this method, LBP is used to obtain the texture features of the image. The HOG features and texture features used together to detect the pedestrian. P. Dollar et al. [38] introduce the method of combining HOG features and color features. Compared with HOG features, this method is faster because it uses integral images. In this method, the feature pyramid based on directional gradients of different scales. This leads to fast feature calculations in real time. The texture and orientated gradient feature somewhat address the issues and challenges of pedestrian detection. Therefore, researchers continue to use region-based deep learning techniques to solve the problem of detectors based on texture and directional gradient features [39, 40]. However, the region-based techniques use the pedestrian region proposals. The results of these methods are more accurate and simpler. In [41], an instance segmentation using the coverage loss method for object segmentation was proposed.

In this method, CNN features and scale-invariant feature transform (SIFT) combined together to detect objects. The limitations of this method are 1) Due to the hierarchical structure, when objects in the scene are occluded, non-adjacent areas will not merge to form a single part. A model was proposed in [42], which integrates a patch-based CNN and a global multi-receptive field (MRF) network. In this method, patches of different sizes are used to identify the ROI. CNN is used to mark each patch. MRF contains the unidirectional graph of label patches. In this graph, each vertex depicts an instance label of each pixel. The limitations of this method include: 1) It is not suitable for detecting multiple related occluded objects. In [43] exhibited a Relief R2-CNN for pedestrian detection in real-time. In this method, the main aim was on faster ROIs identification using CNN features. The weaknesses of this approach are 1) not validated and tested in real-time. 2) the classification needs more time. In [31] present, a unified joint detection model for the cyclist pedestrian detection. This approach utilizes Fast R-CNN. The limitations of this method are 1) It is not verified in real time. 2) target detection rate of cyclic pedestrians could be improved. It needs the verification of the target detection method in intelligent driven vehicles. Farther, In [30] introduced employed Mask R-CNN and an Optical flow-based method for the detection of the active football player’s pedestrian. The shortcoming of this method are 1) non-active football players have been misclassified as an active player. 2) segmentation process is time-consuming and manual. Hence, not suitable for real-time. In [30], a scale-aware Fast R-CNN (SAF R-CNN) method for pedestrian detection is introduced. In this approach, two sub-networks are employed to detect varying scale pedestrians. The major disadvantage of this method is that the training and testing time increase because of the utilization of a two-sub network and need time for computation. The proposed SIRA Mask R-CNN framework addresses the aforementioned problem by detecting the pedestrian at a different scale, orientation, and illumination. The next section describes the proposed SIRA Mask-RCNN framework (see Section 4).

3 Theoretical background

The conventional pedestrian detection system reported in the literature was classified into a Region of Interest (ROI) detection, feature extraction, and candidate classification. ROI detection is most often based on stereo vision. The most commonly used hand crafted features are HOG [36], LBP [37], Integral channel features [38], Gray level co-occurrence matrix [47], CNN features [31], HAAR-Wavelet [48], and Oriented gradients [49]. The classifier trained using these features. The different classifiers such as SVM, Mask R-CNN, Neural network, and Boosted cascade, etc. are used for the classification. The sliding windows [30] scheme and the hand-crafted feature-based method proposed, partially address the issues of pedestrian detection (1).

Table 1 Feature extraction algorithm used in the state-of-the art pedestrian detector

Hence, the researcher moves towards the region-based deep learning approaches to overcome the existing hand-crafted feature based system issues [39, 40] as shown in Table 2. LeNet is the first CNN architecture introduced by LeCun et al. [22] in 1998. It only includes two convolutional layers, a pooling layer, and a backpropagation network for training. LeNet has been trained on the revised National Institute of Standards and Technology (MNIST) database, which contains 50,000 images divided into 10 categories. At that time, it was successfully used for object detection and handwritten signature detection commercially. Its error rate is 26.2%. Next, AlexNet introduced by Krizhevsky et al. [20] in 2012. It is very similar to LeNet. AlexNet uses ReLU activation and cross-entropy loss function.

AlexNet has more convolutional layers. It is trained on the ImageNet large database and contains more than 1 million images from 1,000 categories. AlexNet has higher object detection accuracy than LeNet and is used for different types of object detection purposes. The error rate is reduced to 15.4%. Zeiler et al. [19] A modified version of AlexNet was launched in 2013. ZFNet uses 7 x 7 filters in the first convolutional layer instead of 11 x 11 as used in AlexNet. It achieves an error rate of 11.2%. Liu et al. [21] In 2009, an unsupervised deep neural network architecture was introduced. DBN consists of several layers along with multiple feature detectors or hidden units. Simonyan et al. citeref19 introduced VGGNet in 2014, which reduces the error rate to 7.2%. It expanded the number of convolutional layers to 19 layers. It limited the filter size reduced by a factor of 16 and to 3 x 3. GoogLeNet and ResNet use similar architectural patterns. Vanhoucke et al. [17, 18] The Google Inception network with VGGNet was introduced in 2014. It achieves the error rate of 6.7% marginally better than the VGGNet at the cost of complex architecture design patterns compares to VGGNet. He et al. [16] in 2015, introduced ResNet. It achieves an error rate of 3.57%. ResNet contains 152 convolutional layers. ResNet uses forward and backward instead of the initial model passes of the backpropagation algorithm. The problem with these approaches is the selection of a region before convolution. Objects in the image may have different spatial positions and aspect ratios. Need to select a huge area to get the object of interest in the image. Hence, an algorithm like YOLO, R-CNN developed to solve the problem of selection of regions that constitute an object of interest. Ross Girshick et al. [28] proposed R-CNN, in which a selective search algorithm is used to find a region of interest (RoI) in an image, which is called a region proposal. The selective search algorithm extracts 2000 regions from the image. The problem with R-CNN is that 1) it required an enormous amount of time to train the 2000 region proposals in the image. 2) We cannot use it for real-time applications. Because it takes 47 seconds for testing the image. 3) The selective search algorithm is fixed and not adaptive, so it sometimes generates bad region suggestions for complex images.

Fast R-CNN solves the shortcomings of R-CNN. Ross Girshick et al. [14] again proposed the enhanced version of R-CNN i.e. Fast R-CNN. It is similar to R-CNN, but instead of inputting regional suggestions into CNN, it inputs this architecture into CNN to generate convolutional feature maps. In this process, it max-pooled the input feature map to generate a 7 x 7 square area of fixed size. This area is inputted into the fully connected Softmax layer to predict the category of the object. Results have been represented using the bounding box. Fast R-CNN is more efficient and faster than R-CNN because it does not require 2000 region proposal training to classify objects. Both the architecture usages the selective search to find region proposals using a selective search algorithm. This process is slow and time-consuming that affects the performance of the network.

Shaoqing Ren et al. [12, 13] introduced a Faster R-CNN to solve the Fast R-CNN speed problem by eliminating the selective search algorithm. In this architecture, the network itself learns the region proposals instead of using a selective search algorithm to identify region proposals. It proposes to use a separate network for the recognition region in the convolutional feature map. Hence, it is faster than the Fast R-CNN but not useful in the real-time applications for object detection. In all previous versions of the R-CNN series, regions are used to locate objects in the image. Joseph Redmon et al. [8] introduced a new architecture, i.e. YOLO. Here, a single convolutional layer network predicts the bounding boxes and class labels of these boxes. The input image composed of the S x S grid and each grid, m bounding boxes generated. For each of the bounding boxes, the convolutional network generates the class probability and bounding box location values for locating the object in the image. YOLO processes 45 frames per second, which is faster than other object detection algorithms. The problem with YOLO is that it cannot detect small objects in the image due to the limitation of the spatial plane coordinate position of the algorithm. The semantic segmentation algorithms classify the object at the pixel level. Recent architecture, i.e., Mask-RCNN, has used for instance segmentation. Recognizing each instance of a single object in the image and locating each instance pixel is called instance segmentation. However, instance segmentation is difficult because it requires accurate positioning and detection of all moving and non-moving objects in the image. Hence, it combines object detection and semantic segmentation techniques. Here, object detection, object classification, and the use of bounding boxes for representation. In semantic segmentation, each pixel is classified into a meaningful group without distinguishing object instances.

Table 2 Deep learning architecture used for pedestrian detection

4 Proposed methodology

The proposed SIRA Mask-RCNN framework includes three stages: 1) input image/video enhancement, 2) illumination, rotation, scaling and translation invariant feature extraction, and 3) classification using Mask R-CNN. We describe each of these steps in brief as follows.

4.1 Input image/video enhancement

The input to the proposed system is the image or frame of the video. If it is a video, we transform it into a frame. We have conducted experiments on the publicly obtainable standard database and on our database in an academic environment. The main steps of enhancement include the use of histogram analysis for illumination recognition and detection. These steps are essential for enhancing the input image/video quality. We describe each of the steps in the next subsection.

4.2 Illumination identification and detection

The main aim of the proposed algorithm is to identify input image/frame required illumination enhancement or not based on the histogram analysis. First, we divide the input image into three categories as 1) uneven image, 2) glare image, and 3) uniform image based on the histogram and pixel-level analysis. The input image/frame converted from RGB to HSV color space. We need the conversion because our focus is on the intensity of the image. We have plotted the histogram of a value change to identify the type of image. After several experiments on the different images, we have obtained the optimal thresholds used for distinguishing the darker and brighter intensity. We use the percentais ge of low pixel and high pixel values to classify the image into one of the three categories.

We describe the sequence of steps for lighting recognition and detection below.

The uneven intensity image composed of light that unevenly distributed in different intensity ranges. Glare images include specific areas in the image that are overexposed by light. Image specific areas comprise brighter pixels. The ideal image does not require the enhancement, it comprises the uniformly distributed intensity. if the input image has a high variance, it is the case of irregular brightness. After histogram analysis, we observed that the image background is uniform, and low variance means that the input image does not need to be enhanced. In the image below, you can view the three small peaks that process three separate illuminated areas. The highest peak in the center results from making all the dark region black pixels to the mean value. After several experiments, we found that a standard deviation above 25 can be called uneven illumination cases.

We can also identify the bright pixels easily when there is a glare on the image. Except for objects, the glare image looks like a good image. There can be a possibility that the image comprises high brightness regions, so if the mean of the input image is above 200, then it is true that there is no need to detect the object. We can eliminate those frames, as shown in Fig. 2.

Fig. 3
figure 3

First row shows even, uneven, and glare illuminated image. Second row shows the histogram of image

Finally, by using the proposed illumination recognition and detection algorithm, it is detected whether illumination enhancement is needed. In the next section, we describe image enhancement using logarithmic transformation.

figure a

4.3 Image enhancement

Logarithmic transformation replaces all pixel values with their logarithmic values. We used it for image enhancement as it unfolds, dark pixels as compared to the bright pixel. We compute the log transformation using the following Eq. 1.

$$\begin{aligned} L(x,y) = s* \log (1+F(x,y)) \end{aligned}$$
(1)

Here, F(xy), input image and s is scaling constant and L(xy), output image. The value of s is determined such that, we get the maximum output corresponding to the bit size used. So, the s is calculated using the following Eq. 2.

$$\begin{aligned} s = 255/ \log (1 + \max (F(x,y))) \end{aligned}$$
(2)

However, for the computation of constant value s any pixel intensity is 0 then its corresponding log value will be infinite. Hence, we are adding 1 to each pixel intensity value at the time of log transformation that results in any pixel intensity value is 0, then it will become 1 and its log value will be 0. Fig. 3 shows the result input image enhancement after log transformation.

Fig. 4
figure 4

Image enhancement results. First row shows before log transformation and Second row shows after log transformation

In the conversion, the darker low intensity is given the brighter high value, so that the features existing in the darker or gray area of the image are more visible and obvious to the human eye. It can also reduce the brighter brightness to a lower dark value. Results in a more illuminated image in the output as shown in Fig. 4.

Fig. 5
figure 5

SIRA Mask R-CNN: Improved Mask R-CNN Architecture

However, the luminous intensity not scaled down to the low-intensity region. The scaling constant represents and transforms the higher intensities. In the next subsection, we have described the rotation invariant feature extraction.

4.4 Rotation invariant feature extraction

The rotation-invariant features extracted using contourlet transformation. It converts the enhanced image into a multi-resolution part. It is a combination of the Laplacian pyramid and direction filter bank. The features obtained from the contourlet transform are scale invariants. It can provide smooth edges in any direction. We have not implemented wavelet transformation, because we see edges, but the directional information about the edges unknown. It gives edges in four directions (vertical, horizontal, +45\(^{\circ }\), -45\(^{\circ }\)). We extracted the contourlet transformation edges in eight directions. In Laplace decomposition, the two main scaling operations are to perform reduction and growth. Each of the operation describe in brief as follows.

  1. 1.

    Reduced operation employs a low-pass filter to the transformed image and down-samples by two factors.

  2. 2.

    The grow operation applies a predefined interpolation technique and up-samples it by two factors.

  3. 3.

    Next, in the Gaussian pyramid a series of replica represented by \(\mathrm {Gp_{1},Gp_{2},..Gp_{n}}\) of the transformed image represented by lf(xy) at different scale. It is accomplished by placing \(\mathrm {Gp_{1}}=lf(x,y)\), and iteratively employing \(\mathrm {Gp_{i+1}} = reduce(\mathrm {Gp_{i}})\).

  4. 4.

    The Gaussian pyramid is processed to obtained Laplacian pyramids represented by \(\mathrm {Lp_{1},Lp_{2},Lp_{3}...Lp_{n}}\). It is accomplished by placing \(\mathrm {Lp_{n}}=\mathrm {Gp_{n}}\), and iteratively employing \(\mathrm {Gp_{i}}=\mathrm {Lp_{i}}\) grow \(\mathrm {Gp_{i+1}}\).

Later, the Laplacian pyramid feed into the directional filter bank (DFB). The input image obtained from the Laplacian pyramid of Level 1 modulated. In modulation, the constituent frequency component shifted before re-sampling the image component. In re-sampling, we have considered the frequency re-sampling matrix 2 x 2. It is used to reshape the dimand shap component to different parallelogram. At the end, down-sampled the image. It reduces the number of grid point in the original image. In down-sampling the smoothing and sharpening performed. Later We subtracted the filtered responses to obtain the contours. The obtained edges are in 8 direction. Each of the directional filter bank steps are describe in brief as follows.

  1. 1.

    The first operation is a Gaussian low-pass filtering for smoothing the image. It calculates the average of pixel and replace with the original pixel repeatedly applies for every pixel with cutoff frequency \(\mathrm {D_0}\) as 85. It is represented by \(Lo(\omega )\). The transfer function of a Gaussian lowpass filter is defined as:

    $$\begin{aligned} Lo(\omega ) = \mathrm {e^{-D^2(u,v)/2D_0^2}} \end{aligned}$$
    (3)
  2. 2.

    The next operation is a high-pass. It produces the Gaussian high frequency component and reduces low frequency component with cutoff frequency \(\mathrm {D_0}\) is 15. It is represented by \(Ho(\omega )\). The transfer function of a Gaussian highpass filter is defined as:

    $$\begin{aligned} Ho(\omega ) = 1-\mathrm {e^{-D^2(u,v)/2D_0^2}} \end{aligned}$$
    (4)
  3. 3.

    At the end both values are subtracted to obtained the contours as Eq. 3. It is represented by \(Co(\omega )\).

    $$\begin{aligned} Co(\omega ) = Ho(\omega ) - Lo(\omega ) \end{aligned}$$
    (5)
  4. 4.

    The same process continuously applies up to third iteration.

  5. 5.

    The three levels of information are extracted from the transformation. It is eight directional information. The sub-band of 0-3 is taken as horizontal direction and sub-band of 4-7 taken as vertical direction. Later these features are combined with CNN features of Mask R-CNN to obtained the detection result.

  6. 6.

    The histogram equalization is applied over the transformed image to obtained the normalized image.

Next, sub-section describes the ASIFT feature extraction and architecture of improved Mask R-CNN as shown Fig. 5 used for pedestrian detection and classification.

4.5 ASIFT feature extraction and mask R-CNN

The new scale-invariant feature map generation algorithm is placed in the Regional Proposal Network (RPN) to generate multi-scale feature maps for pedestrian instances of different scales. We divide the proposed framework into three stages viz. 1) scale-invariant feature map generation. 2) region proposal generation. 3) Extract the detected pedestrian target from the image. Input to the Mask R-CNN is the transformed image, and it generates the object surrounded by a bounding box, class label, and pixel-level mask. We used Mask-RCNN for pixel-level mask generation, as shown in Fig. 5. We describe the algorithm for generation of scale and translation invariant CNN feature as follows:

figure b

As shown in Fig. 4, enhanced image is transformed by assuming all possible orientation and translation changes in the position. The tilt represented by t computed by Eq. 4.

$$\begin{aligned} t = \dfrac{1}{\cos \theta } \end{aligned}$$
(6)

The directional transformation t on the input image u(xy) change to \(u(x,y) \rightarrow u(t x,y)\). The gaussian filtering is used for reducing aliasing error. ASIFT key points, rotation invariant features, and CNN features are combined to obtained the detection result (Fig. 6).

Fig. 6
figure 6

Affine Scale Invariant Feature Map Generation. Point along the ray from initial point

We divided the process of pedestrian detection into three stages: 1) Region proposal generation, 2) the multi-scale, rotation, and affine invariant feature extraction. 3) Bounding box and category prediction. We convolve the different scale masks with various size anchor box filters to obtain the confidence score without omitting the low-resolution layer, which was omitted in the existing Mask R-CNN RPN network. Later, this score integrated with the rotation, scale, and affine invariant features obtained in an earlier stage. The primary purpose is to use the proposed method to improve the accuracy of regional recommendations. In RPN, we convolve the enhanced image with bottom-up and top-down methods to extract features. From the enhanced image, ResNet is used to extract high-level features.

The previous layer convolved with a 3x3 sliding window to generate the multiple feature map. Next, apply the 1x1 convolution filter again to the class prediction and bounding box. For the 3x3 mask, we used a sliding window because the mask applied from the center will result in accurate information. K anchor boxes of different sizes used at the time of convolution so that different shape objects detected. We generate the output as the 2k score of the class layer and the 4k score of the bounding box regression. The scale-invariant feature map is fused with other feature maps to effectively locate objects of different scales. Next, each region’s proposal aligned using the ROI align. In the end, we represent all the detected objects with the bounding box and segmentation mask at a pixel in the original image. At last, human detected class IDs separated from all detected objects.

5 Implementation

We assessed the performance of the proposed SIRA Mask R-CNN on benchmark pedestrian databases such as Caltech [1], INRIA [2], MS COCO [26], ETH [27], and KITTI [28] and our pedestrian database. The experiments and proposed deep learning framework implemented on single NVIDIA GPU and a CPU Intel Core i5 3.4GHz processor having 16GB RAM and 16GB NVIDIA graphics card.

6 Experiment

The State-of-the-art pedestrian analysis research conducted on publicly available benchmark data sets such as Caltech [1], INRIA [2], MS COCO [26], ETH [27], and KITTI [28], etc. The limitation of these datasets are: 1) the limited range of pedestrian poses recorded on the city streets in a controlled environment, 2) these datasets contain data with short lapses of time between successive observations of each ID in a single day, which allows to use clothing appearance features in identity matching, 3) All of these pedestrian data sets are recorded in various places such as streets and parking spaces, but they do not cover student behavior in an academic environment. This paper proposes a new dataset in an academic environment. Human experts annotated student pedestrian behavior on each frame sequence of the video, providing three types of information.

  1. 1.

    Pedestrian positioning with bounding box. The position of each pedestrian in the video frame is represented as a bounding box, and we can use this data for pedestrian detection, tracking, instance, and semantic segmentation.

  2. 2.

    Physical, behavioral, or adhered to human characteristics. Each pedestrian fully characterized by labels such as ‘Face’: eyes, eyebrows, forehead, nose, ears, mouth, facial hair, moustache, glasses, beard, ‘hairstyle’, ‘hair color’, ‘age’, ‘body volume’, ‘gender’, ‘age’, ‘height’, ‘body accessories’, ‘ethnicity’, ‘head accessories’, ‘action’ and ‘clothing data’.

  3. 3.

    Annotated class label and ID. Each pedestrian has a unique identifier that is uniform across all video frames. These characteristics of the data set make it suitable for various recognition difficulties.

6.1 Experimental setup

We propose a pedestrian database composed of different behaviors of students under different conditions in academic activities such as students studying in a practical lab, examination hall scenarios, classrooms, a student cheating in an exam hall, a student taking an answer book outside the exam hall, a student stealing a mobile phone or other electronic devices such as a mouse or keyboard, a student stealing lab equipment, a student dispute on the college premises, a student disturbing another student, a student threatening another student, etc. as shown in Fig. 7. Students’ behavior in college premises is recorded using a high-quality DSLR camera from different viewing angles. The video is recorded at 30 f/s. The database includes approximately 100 sample videos. 20 to 30 minute duration for each sample video.

6.1.1 Image/video acquisition framework

Student’s behavior in college premises recorded using a high-quality DSLR camera from a different viewing angle. We recorded video at 30 f/s, enabled 4K recording, with a resolution of 3840\(\times\)2160, and H.264 compressed to .mp4 format. The database includes more or less 100 sample videos. The duration of each example video is 20-30 minutes. The camera tilt angle varies from 45\(^{\circ }\) to 90\(^{\circ }\). Pedestrians are students of Yeshwantrao Chavan College of Engineering in Nagpur, aged 22-27 years old, over 90%, of which 65% are male and 35% are female, mainly of Indian ethnicity. The key features of the data acquisition settings summarized in Table 3.

Table 3 The proposed dataset video Acquisition configuration
Fig. 7
figure 7

Sample image of the proposed database. The first row illustrates two girls dispute in the lab. The second row illustrate the scenario stealing the mobile phone in lab. The third row illustrate a scenario of a student threatening. The fourth row shows the same threatening scenario(front view). The fifth row shows the scenario of students stealing the lab equipment. The sixth row shows the scenario cheating in the exam hall

6.1.2 Data pre-processing

In the proposed dataset, we classified the video information namely into three directories as “Train”, “Test” and “Validation”. We describe the following three entities:

  1. 1.

    Annotations directory contains a XML file for each image. This file contains all the information about the image.

  2. 2.

    The frames are extracted from each video and divided into training and validation set in different directory.

  3. 3.

    The details about the frames are stored in the .txt file, it stores a unique identity number for each image.

However, for the classification purpose, we defined a SIRA Mask R-CNN model, along with “ReLU” as an activation function. In the proposed approach, we used sparse categorical cross-entropy instead of categorical cross-entropy for the compilation of SIRA Mask R-CNN model. The major advantage of using sparse classification cross-entropy is that it preserves time and computational effort, because it only uses a single value for a class label instead of the entire feature vector. We used simultaneous callbacks. A callback is a phenomenon that can execute processes at different stages of training phase. Each of these stages are describe as follows:

  1. 1.

    Early stop** stage: When the observed indicators stop improving the results, the training process stops.

  2. 2.

    Reduce Learning rate On Plateau:When the observed indicator stops improving the result, it is used to reduce the learning rate.

The detailed configuration of the the proposed SIRA Mask R-CNN model illustrated in Table 4. We can see that most of the images classified by our model are correct. Also, as there is always room for improvement, this model can perform better with a larger dataset. The summary of layers with their required parameter while training and validation phase for the proposed SIRA Mask R-CNN shown in Table 5.

Table 4 The proposed SIRA Mask R-CNN model configuration
Table 5 The proposed SIRA Mask R-CNN model summary

The model got a notable accuracy of 96% on the testing set and 95% of accuracy on the validation set. Next, we have computed the training loss and accuracy and validation loss accuracy. The Fig. 8(8) illustrate the accuracy and epochs plot for representing accuracy and validation accuracy and Fig. 8(8) illustrate the loss and epochs plot for representing loss and validation loss. The loss represented training loss. The accuracy represented training accuracy. The valLoss represented validation accuracy, and the valAcc represented validation accuracy.

Fig. 8
figure 8

Accuracy and loss training and validation. (a) Accuracy vs Epochs for representing accuracy and validation accuracy. (b) Loss vs Epochs for representing loss and validation loss

6.1.3 Pedestrian annotation

The proposed dataset completely annotated at the frame of video, by human specialists. We provide a csv file for each video sequence, using the same file naming protocol as videofilename.csv. The labeling process is divided into three stages: 1) human detection; 2) tracking, 3) suspicious activity recognition and soft biometric features. First, the Mask R-CNN [12] method is used to provide an initial estimate of the position of each pedestrian in the scene, and the data obtained are manually verified and corrected. Next, the deep sort method [14] provided the preparatory tracking information, which was again corrected manually. As a result of these two initial steps, we obtain a rectangular bounding box representing the region of interest (ROI) for each pedestrian in each frame. The final stage of the annotation process is carried out manually, where a human expert who personally knows the students of the college sets up ID information and characterizes the samples based on soft labels. Table 6 shows the detailed information of the label annotated for each pedestrian instance in the frame, as well as the ID information, the bounding box that defines the ROI, and the frame information. For each label, we also provide a list of its possible values.

Table 6 The proposed dataset 16 annotated attribute with other soft biometric Labels of the pedestrian in the frame

6.2 Experimental results and analysis

AP (Average Accuracy) is a commonly used index to measure the accuracy of object detectors (such as Faster R-CNN, Fast R-CNN SSD, etc.). Average precision calculates the average precision value of the recall value in the range of 0 to 1. We considered the following criteria in our experiments.

  1. 1.

    IF IoU \(\ge\) 0.5, classify the object detection as True Positive (TP). We have calculated AP at IoU .50 and .75. represented as \(\mathrm {AP_{50}}\), \(\mathrm {AP_{75}}\). Also, at across scales small, medium, and large, represented as \(\mathrm {AP_{S}}\), \(\mathrm {AP_{M}}\), \(\mathrm {AP_{L}}\) as shown in Table 7.

  2. 2.

    If IoU < 0.5, then it is a wrong detection and classifies it as False Positive (FP).

  3. 3.

    When ground truth is present in the image and the model failed to detect the object, we classify it as False Negative (FN).

  4. 4.

    True Negative (TN) is every part of the image where we did not predict an object. This metrics is not useful for object detection, hence we ignore TN. We have computed the precision, recall, and mean accuracy precision (mAP) using the mathematical model as:

    $$\begin{aligned} Precision = \dfrac{TP}{TP+FP} \end{aligned}$$
    (7)
    $$\begin{aligned} Recall = \dfrac{TP}{TP+FN} \end{aligned}$$
    (8)
    $$\begin{aligned} mAP = \dfrac{1}{N}\sum _{i=1}^{N} \mathrm {AP_{i}} \end{aligned}$$
    (9)

    We compare SIRA Mask R-CNN to the state-of-the-art techniques in the area of instance segmentation in Table 8. Our proposed model outperform baseline variants of previous state-of-the-art models. This involves MNC [50] and FCIS [51], which are the winners of the COCO 2015 and 2016 challenges, respectively. SIRA Mask R-CNN with ResNet-101-FPN backbone is better than FCIS [51]. We achieve a benchmark AP of 79%, \(\mathrm {AP_{50}}\) of 68%, and \(\mathrm {AP_{75}}\) of 64%. Again, we validate our system at multiple scale and obtained the results as \(\mathrm {AP_{S}}\) of 16.5%, \(\mathrm {AP_{M}}\) of 39.5%, and \(\mathrm {AP_{L}}\) of 54.5%. SIRA Mask R-CNN results are illustrated in Figure 16 under challenging scenarios.

Table 7 Evaluation metrics used for SIRA Mask R-CNN pedestrian detector
Table 8 SIRA Mask R-CNN average precision results. Comparative analysis of average precision of proposed model to the state of the art approaches available in the literature

6.3 Comparison with state-of-the-art pedestrian detection methods

6.3.1 Caltech

The proposed framework trained using Caltech database images. The Fig. 9 shows the proposed method is compared with the techniques such as SAF R-CNN [30], Side Fusion FPN (SF-FPN) [44], Small-scale sense network (SSN) [45], Attention-Enhanced Multi-Scale Region, and Proposal Network (AEMS-RPN) [46]. It can be observed that SIRA Mask R-CNN outperforms the other methods by a huge margin. We achieve the lowest miss rate of 8.31%. It is the state-of-the-art performance for object detection by utilizing Mask R-CNN as shown in Table 9.

Fig. 9
figure 9

Caltech dataset-SIRA Mask R-CNN gives the lowest log-average miss rate of 8.30%

6.3.2 INRIA and ETH

The SIRA Mask R-CNN also trained and tested with the INRIA and ETH database images. The comparative results are shown in Figs. 10 and 11. First, for the INRIA dataset image, the proposed method gives the miss rate of 7.31%, which outperforms the existing method YOLOv4 [2], Mask R-CNN [5], SAF R-CNN [30], and AEMS-RPN [46]. Second, for the ETH dataset, the miss rate of the proposed model is 32.63% compared with 33.87% of [41] and 36.46% of [46]. In general, the proposed method achieves a higher detection rate on both the dataset and lower miss rate.

Fig. 10
figure 10

INRIA dataset-SIRA Mask R-CNN gives the lowest log-average miss rate of 7.31%

Fig. 11
figure 11

ETH dataset-SIRA Mask R-CNN gives the miss rate of the proposed model is 32.63%

6.3.3 KITTI

The SIRA Mask R-CNN also tested on challenging the KITTI dataset. The pedestrian detection performance comparisons of the SIRA R-CNN with existing methods SAF R-CNN [30], SF-FPN [44], SSN [45], AEMS-RPN [46] are shown in Fig. 12. The proposed approach gives promising results on the KITTI dataset, i.e., 79%, 68%, and 64%.

Fig. 12
figure 12

KITTI dataset-SIRA Mask R-CNN gives promising results 79%, 68%, and 64%

6.3.4 MS COCO and proposed pedestrian dataset

The results are shown in Figs. 13 and 14 The proposed approach compared with the existing methods in YOLOv4 [2], SAF R-CNN [30], MNPrioriBoxes-Yolo [41], and AEMS-RPN [46]. It gives a miss rate of 8.56% on MS COCO and miss rate of 8.68% on the proposed pedestrian dataset.

Fig. 13
figure 13

MS COCO dataset-SIRA Mask R-CNN gives a miss rate of 8.56%

Figure 15 shows the results of the recall and precision of the proposed model compare with YOLOv4 [2], SAF R-CNN [30], MNPrioriBoxes-Yolo [41], and AEMS-RPN [46].

Fig. 14
figure 14

Proposed pedestrian dataset-SIRA Mask R-CNN gives a miss rate of 8.68%

It can be observe that the SIRA Mask R-CNN outperform existing methods in terms of accuracy, speed, and time needed for pedestrian detection as shown in Fig. 15 on various pedestrian dataset. Table 9 illustrate the comparative analysis of proposed SIRA Mask RCNN with the state of the art deep learning framework such as Faster R-CNN [7], YOLO [2], Fast R-CNN [9], SAF R-CNN [30]. The proposed SIRA Mask R-CNN detect the pedestrian in presence of artifacts efficiently. The pedestrian detection accuracy is 96%, which is notable accuracy on challenging benchmark pedestrian databases.

Fig. 15
figure 15

Precision vs Recall: SIRA Mask R-CNN compare with recent state-of-the-art methods

Fig. 16
figure 16

SIRA Mask R-CNN result. First row - proposed database images. Second row-Results on proposed database. Third row-MS COCO dataset images [3]. Fourth row-Proposed method results on the Caltech dataset. Fifth row-Caltech dataset [1]. Sixth row-Proposed method results on the Caltech dataset

Table 9 SIRA Mask R-CNN pedestrian detector compare with the state-of-the art pedestrian detector

7 Conclusion and future scope

In this paper, we proposed a scale, illumination, rotation, and affine invariant Mask-RCNN based pedestrian detector. The proposed framework detects the pedestrian in presence of scale, rotation, illumination artifacts efficiently.It delivers competitive results on the benchmark datasets such as Caltech [1], INRIA [2], MS COCO [26], ETH [27], KITTI [28], and our proposed academic environment database. The experimental results confirmed that the proposed SIRA Mask R-CNN delivers 1) the lowest miss rate of 8.31% on the Caltech dataset, 2) the lowest log-average miss rate of 7.31% INRIA, 3) miss rate of 32.63% on ETH dataset, the pedestrian detection accuracy of 79% on the KITTI dataset and 4) miss rate of 8.68% on the proposed database. The proposed method is superior in detecting the different sizes and varying illuminated pedestrian along with variation in orientation compared with the existing state-of-the-art techniques such as YOLOv4 [2], Mask R-CNN [5], SAF R-CNN [30], MNPrioriBoxes-Yolo [41], SSN [45], and AEMS-RPN [46].

In the future, the proposed framework can be enhanced from the following aspects. First, the proposed work addresses scaling, rotation, and lighting issues. Other problems, such as motion blur, partial or complete occlusion, can be effectively solved by extracting different features of pedestrian objects that are invariant to occlusion and motion blur to improve the feature extraction process. It should improve the pedestrian detection time to meet the real-time usage. An unsupervised approach to pedestrian detection can be useful to reduce the time of detection. The proposed model can also be used to detect human poses and trajectories of different sizes and to block pedestrians.