Abstract
Identifying criminals in serious crimes from digital images is a challenging forensic task as their faces will be covered in most cases. In addition, the only available information will be hand. A single robust technique to identify the criminals from arm’s hair patterns can be a potential cost-effective and unobtrusive solution in various other areas such as in criminal psychiatric hospitals during rehabilitation to identify and track patients instead of using barcoding, radio frequency identification (RFID), and biometrics. The existing state-of-the-art methods for person identification uses convolutional neural network (CNN) and long short-term memory (LSTM)-based architectures which require the entire data to be trained once again when new data comes. To address these issues, we proposed a novel Siamese network-based architecture which not only reduces this training paradigm but also performs better than several existing methods. Since there were no standard datasets for person identification from arm’s hair patterns, we created a database with several voluntary participants by collecting their hands’ images. Several data augmentation techniques are also used to make the database more robust. The experimental results show that the proposed architecture performs better for the created database with mAP, mINP, and R1 of 94.8, 90.0, and 93.5, respectively. The proposed CTTSN performs well for the closed world person re-identification problem using soft biometric features in real time (52 frames per second).
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
Person identification can use several human parts or traits and are classified as primary biometric and soft biometric traits [1]. Primary biometric traits are fingerprint [2], hand [3], body [4], gait [5], face [6], and voice [24]. Soft biometric traits such as androgenic/arm’s hair patterns, gender, age, weight, skin marks, height, and color (skin, hair, and eye) are used along with the primary biometric traits to obtain improved accuracy [1].
Often, the evidence collected is in the form of digital images and captured in uncontrolled situations [7]. As most perpetrators cover their faces, the only available information in these images will be their hands. Though hands are primary biometric traits, these have less variability when compared to faces. The facial features are generally more complex and visible, making it a more robust biometric trait for identification. With the advent of more sophisticated and advanced digital cameras and better resolution closed-circuit television (CCTV) cameras in public places, several security systems have used hand vein patterns and androgenic hair patterns for person identification [8].
There are several methods to recognize humans from primary and soft biometric traits. Afifi [9] used a two-stream convolutional neural network and support vector machine classifier for hand-based person identification. They considered the subjects’ both hands as the same, which is not usual and is less accurate. Baisa et al. [3] proposed global and part-aware deep feature representation learning for hand-based person identification. Similarly, several other deep learning architectures such as part-based convolutional baseline (PCB), multiple granularity network (MGN), pyramidal representations network (PyrNet), attentive but diverse network (ABD-Net), omni-scale network (OSNet), discriminative and generative learning network (DGNet), dual part-aligned representations network (P2Net), and interaction and aggregation network (IANet) are used for person identification from digital images, but all these networks need to train entirely when a set of data comes for person re-identification [6, 8]. In case of serious crimes, new criminals will get added over time, and for each new addition of criminals, training the entire database again is a very tedious task. There are very few works on person identification from arm’s or androgenic hair patterns [10,11,12]. The existing methods used grayscale, local binary patterns (LBP), and histogram of oriented gradients (HOG). These techniques used hand-crafted features for person identification. It is evident from the literature that the state-of-the-art deep learning techniques perform better than these machine learning techniques, which use hand-crafted features. The earlier methods related to Person re-ID (re-IDentification) extracted local descriptors, low-level features or high-level semantic attributes, and global representations through sophisticated but time-consuming hand-crafted features. In addition, the hand-crafted feature representation failed to perform better when image variants such as occlusion, background clutter, pose, illumination, cultural and regional background, intra-class variations, cropped images, multipoint view, and deformations were present in the data. However, deep neural networks were introduced in person re-ID in 2014, which completely changed the feature extraction methodology. Deep learned features perform better in end-to-end learning and are robust to the image variants. The improved feature representation in the deep learning architectures makes it more popular than machine learning methods for person re-ID [6, 8, 13, 14].
To address the following issues, we proposed and implemented a novel architecture based on Siamese networks to identify the person based on their arm’s hair patterns. Since there exists no standard database dedicated for arm’s hair pattern recognition, we created and analyzed arm’s hair pattern person identification with several state-of-the-art deep learning architectures.
The key contributions of this paper are as follows:
-
Person identification with a novel color threshold (CT)-twofold Siamese network architecture using arm’s androgenic hair patterns.
-
Created a database with images of person’s hand for person identification collected from Indian subjects.
Rest of the paper is organized as follows. The next section discusses the literature work on existing methods of person identification. The third section describes the proposed methodology. The fourth section discusses the experimental results, and finally, the last section concludes the paper with future directions.
Literature survey
Person re-identification from the arm’s androgenic hair comes under the closed world person re-id. Here, a single modality is used with bounding boxes, sufficient and correct annotated data exists, and the query exists in the gallery. The three standard components in a closed world re-id system are feature representation learning, deep metric learning, and ranking optimization [
The collected high-resolution images are reduced to lower resolutions of around 244 \(\times \) 244 based on the preprocessing steps and the deep learning architecture used. Hence, this study’s range of image resolutions is between 40 and 12.5 dpi (dots per inch). After reducing the resolution of the images, we observed that some cropped image quality has drastically deteriorated. To address this issue, we divided the image into two or more parts (with the same person id), and hence the total number of images increased from 343 to 424. All the obtained 424 images do not contain any tattoos or external markings made on their hands.
The naming convention used for the collected images is shown in Fig. 3. The first three digits correspond to the subject, and hence this becomes the unique part of the image name w.r.t. each subject. The following two digits are either 00 or 11, representing right hand and left hand, respectively. The last two digits are the sequence numbers representing the sequence of images taken for each hand. The naming convention is used so that the training and validation process becomes smooth when we use functions like data generators while using deep learning architecture.
Preprocessing
The database of criminals in forensic analysis is generally created in controlled environments. The created database contains images from the controlled environment as well, but the crime scene data comes from uncontrolled situations. Therefore, it has different angles, resolutions, illuminations, and so on. To make both the database and the deep learning architecture more robust, we have used data augmentation techniques and preprocessing.
Rotation range, height shift range, width shift range, zoom range, fill mode, horizontal flip, channel shift range, and zca whitening are the eight different data augmentations techniques used in this study. The corresponding augmentation values/parameters are 40, 0.2, 0.2, 0.2, nearest, true, 20 and true, respectively. The data augmentation techniques used are from the standard literature [13, 14, 36, 44]. We used all the data augmentation techniques given in the TensorFlow documentationFootnote 2 except the color space transformations. Color Space Transformations change the color of the hand, and it is not advisable to use for person re-ID as per the existing literature. Since it alters one of the unique features, the hand’s skin tone, it is not recommended for person re-ID identification. Regarding the values used in the data augmentation, we used the standard values from the literature and cross-verified them manually by the empirical study. The standard values perform the best even in person re-ID.
The proposed architecture uses both the color image as well as the thresholded image as input. The following steps were used to convert the color image to the thresholded image.
-
Step 1—GrayScaled Image:Footnote 3 The input color image is first converted to a grayscale image. The Sobel edge detector is used to smoothen the grayscale image.
-
Step 2—Black-hat transforms operation:Footnote 4 It is used in digital image processing and morphology to extract small elements and details from given images where all the objects which are white on a dark background are highlighted as shown in Fig. 4. The settings used in this study such as anchor, iterations, borderType, const and borderValue are set to Point\((-1,-1)\), 1, BORDER_CONSTANT, Scalar and morphologyDefaultBorderValue(), respectively.
-
Step 3—Binary thresholding:Footnote 5 It is used to get the thresholded image where the pixel value is set to 255 if it is greater than the threshold or considered zero.
Figure 4 shows a sample screenshot of all the preprocessing steps followed. The output images present in Fig. 4 is a sample output of a portion of a single input color image. After the preprocessing, all those images are stored with the same name as that of the input color image in a separate folder. Figure 5 shows the sample thresholded image of the subject 002.
The input image given in preprocessing step is manually cropped for the hand part in the image. Though we did not send the complete picture for preprocessing, to understand the cropped and removed part for the image shown in Fig. 5, the uncropped color image of the same subject is given as an input for preprocessing, and the output is shown in Fig. 6. Here the parts which are not used for the computation are shown as cropped and unused. Only the part which contains the arm’s hair (the middle part in the image) is considered for the computation, which is also shown in Fig. 5.
After performing preprocessing, we obtained another set of 424 thresholded images. We applied eight different data augmentation techniques on these 424 color or actual images and 424 thresholded images. A total number of 6784 ((424 (color) \(\times \) 8) + (424 (thresholded) \(\times \) 8) images were obtained from the color and thresholded images. The manual verification was performed using two different human observers to avoid bias (mostly in crop** and discarding unrelated areas, the similarity of two images after augmentation or thresholding, and discarding the distorted images after augmentation or after lowering image resolutions). And we calculated the inter-rater reliability using Cohen’s kappa, and we observed that both the human observers agree with a \(\kappa \) value of 0.96 (this \(\kappa \) value calculation includes all the steps whenever the human observers are used).
After data augmentation and thresholding, all the images were cross-verified manually again. The images that do not contain hair parts like after crop** in augmentation, some images contain the cropped part of the hand which is close to the wrist and did not contain much hair there, all such images were discarded. After discarding the images in this step, the total number of images obtained is 6500 (284 images are discarded in this step). The complete details of the created database are given in Table 2.
Proposed methodology
Person identification using visual features can be modeled as a similarity learning problem. Siamese architectures are extensively used in deep CNN models based for similarity learning, where it requires less parameters to be trained whenever a new entry comes to the database for person identification. From the literature, two types of input images performed better for person identification from arm’s hair [8]. They are the thresholded image and the color image, and both are used in our proposed architecture.
Figure 7 shows the complete methodology of the proposed work. The proposed color threshold (CT)-twofold Siamese network is composed of two different CNN-based networks. The notations used in Fig. 7 are \(c, c^t\), and X, where it represents the color image, thresholded image, and search region, respectively. The size of \(c^t\), and X is \(W_t \times H_t \times 3\) . X is the collection of image patches with the same dimension as c and hence the target size is \(W_r \times H_r \times 3\) where \(H_s<H_t\) and \(W_s<W_t\) and located at the centre of \(c^t\). The C-Net and T-Net are not combined until the testing time, which is similar to [46].
T-Net: The network which takes the thresholded images as input clones its architecture from the SiamFC network (Siamese FullyConvolutional) [47]. The convolutional network used here extracts the features from the thresholded image (denoted by \(f_{a}(.)\)) and is known as T-Net. The following equation shows the appearance branch response map where corr(.) is the correlation operation:
All the parameters of the T-Net are trained from scratch for similarity learning. The following equation shows the logistic loss function, which was minimized to optimize the T-Net. Here, \(Y_i\), \(\theta _a\) and N are the search region, parameters of T-Net and number of training samples, respectively:
C-Net: The second network takes color images as its input (named as C-Net); here, inception v3 architecture is used in the pre-trained network, and then its parameters are updated in the last two convolutional layers (except last two layers, all others are freezed). The low-level features are not extracted from the pre-trained networks as they provide different levels of abstraction. Each convolutional layer’s features have a different spatial resolution, and it needs to be concatenated (represented by f(.)). After the feature extraction, \(1\times 1\) ConvNet is used as a fusion module to make these features suitable for correlation. This fusion is performed within the same layer features. \(g(f_s(X))\) gives the feature vector for the search region after the fusion.
In target processing, \(c^t\) is taken as a target input by C-Net. This target input contains the contextual features denoted by t. The features obtained from this module include high-level features and are robust for changes in the object; hence, they are more generalized and less discriminative. Channel attention modules were introduced to enhance the discriminative power of the architecture. The attention modules use \(c^t\) as a feature map instead of t to provide importance to the surrounding context along with the target. Channel wise operations are used in the attention module, and the attention process for \(i\mathrm{th}\) channel is shown in Fig. 8.
Several operations were performed; for example, a feature map of conv5 contains \(22 \times 22\) spatial dimension, and the feature maps are of \(3 \times 3\) grid with a central grid is for the target with the dimension \(6 \times 6\). Max pooling is performed within the grid, and then a coefficient is produced using a two-layer multi-layer perceptron. Here the perceptron uses the same convolutional layer to share the weights across the channels. The final output \(w_i\) is obtained using a sigmoid function with bias. A single crop operation is used on \(f_t(c^t)\) to obtain \(f_t(c)\). The output of the attention module is the channel weights \(w_i\), and the input is \(f_t(c^t)\). The following equation provides the response map where the dimension of w is same as \(f_t(c)\). The elementwise operation is represented by . (dot). Here, only the channel attention module and the fusion modules are trained:
The logistic loss function (Eq. 4) is minimized to optimize the response map. The training pairs are \(((c^t)_i, X_i)\), and the response map is \(y_i\):
where N denotes the training samples, and \(\theta _s\) denotes the trainable parameters. A weighted average of heatmaps (Eq. 5) is used to get the overall heatmap of two branches during the test time. Here, \(\lambda \) is the weighting parameter. The validation set can be estimated using \(\lambda \). The most matched location in re-ID is given by \(h(c_{t},X)\) and have the largest value:
VGGNet (visual geometry group network) like architecture is used as a base network for both the T-Net and C-Net. As mentioned earlier, T-Net is a replica of the SiamFC network. The C-Net is loaded from a pre-trained VGGNet on the ImageNet. C-net strides are adjusted so that the last layer of C-Net and T-Net have the same dimension. To avoid the channel getting suppressed to zero in the attention module, a nine-dimensional vector is used to get the pooled features of each layer. Therefore, the layers in MLP (multiLayer perceptron) had nine neurons, ReLU (rectified linear unit) non-linear function, and after the MLP sigmoid function is used with 0.5 as bias.
Results and analysis
The standard metrics used in person re-identification methods are cumulative matching characteristics (CMC) and mAP (mean-average precision). Generally, it is used in the biometric system, which operates in closed-set identification tasks. The test images (templates) are compared with the annotated images present in the database (biometric subject) and ranked based on similarity. Based on the match rate, the rank versus identification task is compared using the CMC. Suppose each test sample (single gallery shot) identity has only one instance, then for every query, the algorithm will rank the test samples using a step function, CMC top-k accuracy, and it is given in the following equation:
Due to data augmentation and the use of multiple images of both the right and left hand of the same person, we can input multiple instances of the same person in test samples and can be tested (multi-gallery-shot setting). To address this case using a better performance metric, we have also used another metric, mINP (mean inverse negative penalty), to check the performance of the model w.r.t. the created database. The hardest correct match’s penalty is measured using negative penalty and is shown in the following equation, where \(Q_j\) indicates the total number of correct matches for query j and \(H_i^\mathrm{hard}\) indicates the rank position of the hardest match:
INP (inverse negative penalty) is the inverse of NP, and we have used mINP as shown in Eq. 8. CMC and mAP evaluations are dominated by the easy matches and it is avoided using the mINP. The limitation of mINP is for a larger dataset, and our dataset contains only 50 subjects, so it is used as one of the supplementary metrics for evaluation along with the widely used CMC and mAP metrics:
Implementation details: A small weight \(\lambda \) is used to combine the branches, and it is updated by considering the validation set. It is observed that the hyperparameter \(\lambda \) performs its best when \(\lambda = 0.3\). The attention module has one hidden layer with a 9-dimensional vector with ReLu as a non-linear function of the hidden layer. From the empirical study, it is observed that the proposed model performed its best when the learning rate was 0.01. The average training speed of the CTTSN was 52 frames per second (fps). The grid search was performed from 0.1 to 0.9 with step 0.2. Three scales are searched in this study to handle scale variations for evaluation and testing. Weight decay and momentum were empirically set to 0.0005 and 0.9, respectively.
Table 3 shows the CMC, mAP, and mINP results comparison with proposed and other methods. The proposed CT-twofold Siamese network (CTTSN) is trained on Imagenet using the VGG network. We changed that to other popular networks such as Inception v4, ResNet, AlexNet and Xception (all four architectures are also trained on ImageNet). It is observed that the results are comparatively less as the ImageNet weights are not a significant contributor compared to other features used in the data. SiamFC [47] is the base architecture of the Siamese network, and DSiam (Dynamic Siamese Network) architecture for visual object tracking [48] is also tested and compared with other architectures.
Figure 9 shows the Rank-1 results of the proposed architecture. We compared it with the Siamese network and the modified versions of our network. CTTSN (color) contains both the C-Net architectures and CTTSN (threshold) has only T-Net architectures. It is observed from Fig. 9 that the proposed architecture with both C-Net and T-Net performs better (upto rank 30).This infers the importance of complementary features required to strengthen the proposed model’s performance.
We also used data augmentation and used those images during the training process. Figure 10 shows the comparison of results with and without data augmentation. It is evident from the figure that the data augmentation increases performance, and it is in line with the existing literature [44].
The dataset contains thresholded images, data augmented images, and color images. In the testing phase, if we provide these images separately, the performance of the proposed model is shown in Figs. 11, 12, and 13 for color images, thresholded images, and augmented images, respectively.
Figure 11 has more accuracy because there are several other features like hair color and so on that are also considered by the proposed architecture, and hence it performs better. The data augmented images contain different resolution images. It contains cropped images that may contain cropped part of the input image that has very little hair in that part. Hence, it failed to perform better when compared to color or thresholded images. We observed similar fall in person re-identification when we compared it with male vs. female; there was a dip approximately by 7% in mAP and mINP in females as the hairs were less in some parts of their hands.
The results are also compared with different image resolutions, and we observe that the performance of the proposed method decreases with a decrease in input image size as shown in Fig. 14. The different image sizes taken for the comparison are as per the standard sizes used in the literature [10] for the comparison in the criminology department. The graphs reflect the importance of clearly visible hair features as the performance decreases with a decrease in the image resolution.
Grad-CAM (gradient-weighted class activation map**): We used the Grad-CAM class activation visualization (as per in the Keras documentationFootnote 6). Grad-Cam depicts the discriminative features that are responsible for person identification using heat maps. It uses the gradient information flowing into the last convolutional layer of the proposed architecture to understand each neuron for a decision of interest [49]. A sample image from the database {Subject: 0020003} is shown in Fig. 15 where the feature from the highlighted part is responsible for person re-identification {Values related to \(h(c_{t},X)\)}. In addition, Fig. 16 shows the heat map in the image where the part of the hand with the highest probability of person re-identification is shown in yellow. From Figs. 15 and 16, it is evident the proposed method effectively identifies the discriminative features to re-identify the person, and also the region of interest shows the generalizability in choosing the region of interest.
Apart from the detailed results shown in the graphs, we have also performed some ablation analysis on the proposed method with the created dataset.
-
Both T-Net and C-Net were trained with random initialization and obtained an mAP of 81.5, suggesting the requirement of an initialization function.
-
We took only the color images in both T-Net and C-Net and did not get better performance (mAP 79.11), and it is also similar for thresholded images (mAP 83.17). This infers the importance of complementary features required in the proposed method.
-
We removed the channel attention module in the C-Net and observed a drastic decrease in the performance (mAP 76.11). This shows the importance of balancing the intra-layer and inter-layer channels.
-
We separately trained both the architecture in the study. Here we jointly trained both the network branches and observed mAPs of 83.67 and 84.21, respectively, suggesting the importance of optimizing multilevel features independently.
-
We ran channel weights on two different subject hand images. We used a sigmoid function with a bias of 0.5, so the weight distribution ranges from 0.5 to 1.5. We observed a different weight distribution for conv4 and conv5 layers for each image, indicating the importance of channel weights in the proposed method.
From the results, it is observed that the created database with 50 subjects performs better when we use both the C-Net and T-Net trained with color, thresholded, and data augmented images. When the test input is color images, it performs better. Further, it performs better for male and high-resolution input images. Thresholded images contain only hair patterns. The Person re-identification from hair patterns comes under soft biometric and the results observed with only thresholded images (Fig. 12) suggests the huge potential of using arm’s hair patterns for person re-identification using digital images.
Limitations
The created database contains Asian subjects only. The performance of the proposed model is not sufficient when the input image is taken from less-resolution cameras (some CCTV cameras). When we provide color images during training, by default, the deep learning architecture uses all the features present in that image, and it is difficult to analyze the impact of only hair patterns in that.
Conclusion
In most cases, the data obtained for identifying a criminal is collected from uncontrolled situations in criminology. Generally, they wear a mask during the crime, and also, the other body parts may not be visible so clearly compared to hands. This paper discussed the created database with 6500 images from persons’ hand images with androgenic hair as a soft biometric parameter for the person re-identification problem. We proposed a CT-twofold Siamese network and analyzed its performance on the created database. The results show that there is a potential to recognize a person from arm’s androgenic hair patterns. We observed that the proposed model performs well for the created database with a Rank-1 cumulative match of 93.5. The proposed methodology is targeted to work primarily in forensic psychiatric hospitals as the target users are generally non-cooperative. This set of person re-identification problems comes under closed world re-ID and should work unobtrusively in real time with less training and testing time. The method should get trained fast and perform better whenever new data are obtained. Hence, in this work, the proposed CTTSN performs well for the closed world person re-identification problem and uses androgenic hair patterns (soft biometric) for the person re-ID. The trained data are robust due to data augmentation, and the methodology is both discriminative and generalized. The proposed method works in real time with 52 fps for the test images.
The future directions related to the methodology include the use of other intra-image modalities such as skin color, skin marks and veins, and hair patterns to identify the person. In addition, to collect the data and test the proposed model with diverse data and improve the robustness of the proposed method. The proposed architecture needs to be tested in various areas like forensic psychiatric hospitals with different CCTV locations and subjects to check its robustness and further improvements. Generally, hospitals have CCTV installed. Instead of using barcoding, radio frequency identification (RFID), and biometrics to track and identify patients during the rehabilitation process, the proposed technique can be a cost-effective and unobtrusive solution to the existing ones.