Keywords

1 Introduction

Person Re-identification (Re-ID) has recently attracted more and more research interest. It presents a fundamental task for the multi-camera automated video surveillance systems. This task aims at identifying a target captured at different times and/or locations, considering a large set of candidates [1]. This problem has widespread applications such as tracking criminals, analyzing suspect movements, and finding missing people, etc.

A typical Re-ID system is divided into two main steps: (1) extracting an appearance signature \(AS_{p_{i}}\) for a given probe person image \(p_{i}\), then (2) matching it by using a similarity metric against a gallery set \(G=\left\{ g_{1},...,g_{N} \right\} \), and finally the most similar \(g_{p}\) is assigned to the probe image. However, modeling people’s appearance is a paramount and challenging problem because people are often monitored at low resolution, under occlusions, with different lighting conditions, viewpoints and poses. However, conventional biometric features such as face or iris are inappropriate due to the uncontrolled acquisition conditions and insufficient image details for extracting robust biometric features [5]. Instead, bodily human features in term of cloths and carried objects present more reliable characteristics for modeling visual aspect of people. State-of-the-art approaches have mainly focused on modeling people’s appearance by extracting discriminative and robust visual characteristics. In the literature, appearance-based approaches can be further divided into two groups: (1) single-shot approaches and (2) multi-shot approaches.

Typical person re-identification approaches extract visual characteristics from a single image depicting the target’s appearance [13, 18,19,20,21,22]. It is worth noting that the literature abounds with single-shot approaches as it represents the simplest and most general case. However, the performance of these systems is limited due to unavailability of tracking information. Obviously, we risk to extract a blurred appearance signature due to the large intra-class variation (e.g. pose and viewpoint variation, partial occlusion).

In the real-world video surveillance systems, the tracking algorithm provides a multiple frames depicting the same person during his movement in front of the camera view. The multi-shot re-identification approaches exploit a multiple images of each target to model an appearance signature [2,3,4, 6,7,8,9,10]. These approaches aim to extract a more complete and invariant signature. However, these algorithms require an additional step that aims to automatically select a set of appearance images which will be considered in the re-id task. Different image selection methods have been proposed. Bird et al. [2] have modeled people by the median color value of each body part accumulated over different frames. In addition, [3, 4] have randomly chosen 5 samples for each person. In this case, the selected images risk to be redundant and/or contains uncompleted appearance features (e.g. partially occulted). [6,7,8] have applied an unsupervised gaussian clustering method [11] to the HSV histograms of the people images. Then, they randomly selected one image from each cluster. The re-id is carried out by considering 2, 5, and 10 images for each target. Bak et al. [9] have proposed to cluster the trajectory based on the body’s estimated pose using 3D scene information. Then, they have generated a signature for every different appearance. However, the calibration information of each camera is needed, which is not always available. Wang et al. [10, 23] have selected video fragments from image sequences based on a combination of HOG3D features and optic flow energy profile over each image sequence, which is a computational cost for real applications.

However, due to the uncontrolled acquisition conditions for video surveillance systems, several challenges need to be addressed: (1) the frame rate may be different from one camera to another, (2) the length of person’s trajectory is not a constant, and (3) the unpredictable people’s path may cause different body postures. So, it is irrelevant to extract the representative appearance images uniformly for each person.

In addition, video surveillance systems generate a vast quantity of video sequence that contains an immense amount of information. So, it is important to take advantage as much as possible from these data. In light of these shortcomings, we propose a new multi-shot person re-identification approach based on key frame selection. We propose to automatically select a set of representative appearance images that depict the different body posture variations for each target adaptively. The key idea is to eliminate the redundancy and noisy image from the target’s trajectory. Then, these selected images will be modeled into a global appearance signature. Next, the matching score is calculated by the average of all pairwise distances as the probe and gallery sets are made of multiple images for each subject instead of a single image as the traditional state-of-the art approaches.

The remainder of the paper was organized as follows. The different steps of the proposed approach were described in Sect. 2. The experimental results and discussions were reported in Sect. 3. Finally, our conclusions were given in Sect. 4.

2 Proposed Approach

In this section, we describe the proposed multi-shot person re-identification approach in multi-camera videosurveillance system based on key frame selection. Figure 1 shows the framework of the proposed approach. In general, the single camera tracking algorithm produces a collection of image sequence \(Q_{p_{i},c_{j}}\) that estimates the trajectory of each person \(p_{i}\), \(i\in 1..P\) where P is the number of people, in the camera view \(c_{j}\), \(j\in 1..C\) where C is the number of cameras. Each image sequence Q is defined by a set of consecutive images I as \(Q=\left\{ I_{1},...,I_{S}\right\} \), where S is the number of frames. It must be noted that the value of S is not a constant and depends from the length of person’s trajectory.

The first step, namely Key Frame Selection, automatically selects a small set of representative appearance images \(Q_{p_i,c_j}^{'}=\left\{ I_{m},...,I_{n}\right\} \) depicting the different body posture variation during his movement in front of the camera view. The main idea is to discard the redundant and noisy images from the person’s trajectory, and, to conduct a multi-appearance presentation. We start by modeling the local silhouette contour variations of the person’s trajectory by extracting Histogram of Oriented Gradients descriptor. Then, we automatically detect the appearance transitions, and, a representative image for each segment is selected.

The second step, namely Appearance Signature Extraction, models the discriminative visual characteristics of the selected images \(Q_{p_i,c_j}^{'}\) into a Global Appearance Signature \(GAS_{p_i,c_j}\) for each person \(p_i\). It starts by alleviating the lighting conditions of the uncontrolled acquisition conditions by adopting a pre-processing step based on Grayworld normalization [17]. Then, we extract the Multi-Channel Co-occurrence Matrix [13] descriptor which encodes both color and texture visual features. These feature vectors are gathered into a global appearance signature \(GAS_{p_i,c_j}=\left\{ AS_{m},...,AS_{n}\right\} \).

Finally, Appearance Signature Matching step compares a specified probe person \(p_i\) against a gallery set \(G=\left\{ GAS_{1},...,GAS_{N}\right\} \). A set matching strategy based on the average of all pairwise distances is adopted to compute the similarity score between the corresponding probe and the gallery global appearance signatures.

Fig. 1.
figure 1

Proposed approach for multi-shot person re-identification based on key frame selection.

2.1 Key Frame Selection

Usually, video surveillance systems generate a vast quantity of video sequence that contains an immense amount of information. However, the temporal correlation between consecutive images may produces redundant information. In addition, cameras are usually mounted above people’s head (e.g. fixed on the roof), some body parts may be truncated. Hence, these images contain uncompleted appearance information which is irrelevant to be considered in appearance signature extraction step.

Our goal is to benefit from the available data by automatically selecting a set of representative person images, \(Q_{p_i,c_j}^{'}=\left\{ I_{m},...,I_{n}\right\} \), from the target’s trajectory (i.e. video sequence), \(Q_{p_i,c_j}=\left\{ I_{1},...,I_{S}\right\} \), in order to detect the different appearance variations (e.g. different body postures). This step is based on two sub-steps as presented in Fig. 2: (1) Key Frame Extraction, and (2) Key Frame Filtering. These two sub-steps aim to eliminate the redundancy and the noisy information from the trajectory, respectively.

Fig. 2.
figure 2

Key frame selection process.

Key Frame Extraction. Our goal is to eliminate the redundancy from the consecutive images presented in a single trajectory \(Q=\left\{ I_{1},...,I_{S}\right\} \). First, we start by modeling the body’s contour variations during the tracking. Then, we automatically segment the trajectory by detecting the posture transition. Next, we select a representative appearance image for each segment part. More details are provided in the following.

Modeling body’s contour. The different body postures are modeled by extracting the characteristics of silhouette contours. We opted for Histogram of Oriented Gradient (HOG) features to model each person appearance image I. HOG descriptor has been firstly introduced for pedestrian detection problem [12]. It describes the object appearances by analyzing the local distribution of gradients. Extracting the HOG features consists of a series of steps, as shown in Fig. 3.

Fig. 3.
figure 3

Overview of the HOG feature extraction process.

First, the gradient of an image is computed by filtering it with two one-dimensional masks [\(-1\) 0 1]. For color images, each RGB color channel is processed separately, and pixels assume the gradient vector with the largest norm. After that, the magnitude and the orientation of the gradient are extracted by Eqs. (1) and (2), respectively.

$$\begin{aligned} \left| G \right| = \sqrt{gradient_{x}^{2}+gradient_{y}^{2}} \; . \end{aligned}$$
(1)
$$\begin{aligned} \theta = arctan\frac{gradient_{x}}{gradient_{y}} \; . \end{aligned}$$
(2)

The gradient image is divided into cells (\(C_w \times C_h\) pixels) with a fixed size. For each cell, the orientation is quantized into \(C_b\) orientation bins, then, the magnitude in each orientation is accumulated to make a histogram (i.e. the feature vector for a cell).

Next, we group cells into larger blocks. The adjacent blocks overlap at half of the block size and local histogram is computed for each block. These histograms are normalized to ensure better invariance to illumination by accumulating a measure of the local histogram energy over blocks and using the results to normalize all cells in the block, as formulated by (3).

$$\begin{aligned} h=\frac{h}{\sqrt{\left\| h \right\| ^{2}+\varepsilon ^{2}}} \; . \end{aligned}$$
(3)

This produces a feature vector for each block. These histograms are concatenated into a single feature vector that represents the global HOG description for the whole appearance image.

Appearance Transition Detection. We compute the similarity score between the contours of the consecutive frames \(I_i\) and \(I_{i+1}\). The similarity between HOG features vectors is computed by using Bhattacharyya distance [16], as formulated by (4).

$$\begin{aligned} Similarity\left( I_{i},I_{i+1} \right) =\sum _{x=1}^{X} \sqrt{HOG^{I_{i}}\left( x \right) \times HOG^{I_{i+1}}\left( x \right) } \; . \end{aligned}$$
(4)

where:

\(HOG^{I_{i}}\) presents the feature vector for \(I_i\), and X is the size of the feature vector.

Next, we plot a curve based on consecutive image similarities to present the contour shape variation during a trajectory. Local minima of the curve are automatically detected, as shown in Fig. 4. These points present high dissimilarity score and hence a high variation in the appearance. Based on their location, we detect the appearance transition and divide the trajectory into several parts \(Q_{p_i,c_j}={S_{1},...,S_{T}}\), where T is the number of detected segment which is not a constant. Each part models a posture variation. Clusters with a lower image number (\(<= 3\) in our experiments) are merged with the previous cluster.

Fig. 4.
figure 4

Consecutive images similarities curve.

Frame Extraction. For each segment \(S_i\), we select a representative appearance image. We take the middle image in each cluster as a key frame.

Key Frame Filtering. In general, surveillance cameras are usually mounted above people’s head (e.g. fixed on the roof). In addition, the entry and the exit of the person in the camera field of view is done progressively. In these cases, some body parts may be partially occulted. If we rely on uncompleted appearance images to model the visual features, some important characteristics are neglected and omitted. We propose to eliminate the truncated images by analyzing the position of the person image in the camera frame.

Finally, the selected key frame will be modeled in order to extract a discriminative appearance signature to perform the re-id task.

2.2 Appearance Signature Extraction

The uncontrolled lighting conditions of video surveillance systems largely affect people’s appearance in terms of color correspondence between the different camera views. In this step, we start by alleviating the heterogeneous lighting conditions for each selected image based on Grayworld normalization [17]. This method focuses on measuring clothes colors independently of the light source and camera characteristics, as formulated by Eq. (5). It assumes that the average reflectance in the scene is achromatic.

$$\begin{aligned} R^{'}=\frac{R}{\mu _{R}}, G^{'}=\frac{G}{\mu _{G}},B^{'}=\frac{B}{\mu _{B}} \end{aligned}$$
(5)

where:

\(\mu _{R}\), \(\mu _{G}\), and \(\mu _{B}\) denote, respectively, the average of Red, Green, and Blue image component.

Then, we model the visual characteristics of each corrected image by extracting the Multi-Channel Co-occurrence Matrix (MCCM) [13] descriptor. MCCM descriptor encodes both color and texture visual features. The main advantage it allows to distinguish between people wearing similar colors with different motives. It encodes the information about pixels neighbor by presenting the frequency of a pixel with color value i is adjacent to the pixel with color value j from each body strip \(BS_i\), \(i\in 1..SBS\), of the size \(N \times M\), as expressed in (6).

$$\begin{aligned} MCCM_{p_{i}}^{BS}\left( i,j,k\right) = \sum _{x=1}^N \sum _{y=1}^M {\left\{ \begin{array}{ll}1, &{} BS\left( x,y,k\right) = i \wedge BS\left( x+\triangle _{x},y+\triangle _{y} ,k\right) =j \\ 0, &{} otherwise \end{array}\right. } \end{aligned}$$
(6)

where:

k denoted the color component (H,S,V), \(\triangle _{x}\) and \(\triangle _{y}\) are the offset, we calculate MCCM according to the direct neighbor by considering 4 directions (\(0^{\circ }\), \(45^{\circ }\), \(90^{\circ }\), and \(135^{\circ }\)).

Then, the four adjacency matrices are summed and converted into a vector. Using the sum of the matrices encodes the distributions of the intensities and totally ignores the relative position of neighboring pixels which is rotation invariant. Also, HSV color space is very close to the way the human brain perceives color and it is known by the separation of the image intensity from the color information which is more robust to lighting variations. In video surveillance systems, the unpredictable camera positions and people’s paths aggravates the problems of appearance variation across different viewpoint and pose. The people’s appearance in one view is always different from the other views, i.e. frontal vs. back views. We have already demonstrated that some body regions are more informative and expose more invariant appearance characteristics against the viewpoint and pose variations [13]. We have automatically selected the salient body stripes from the most adopted body parts model, i.e. 6 equalized body stripes, by adopting the Sequential Forward Floating Selection algorithm (SFFS) [14]. As a result, we have proved that regions defined by the upper torso, lower torso and upper legs, \( SBS_p=\left\{ UT_p, LT_p, UL_p \right\} \) are the most stable and informative body parts to capture the salient appearance characteristics in uncontrolled pose and viewpoint conditions. MCCM descriptor has been extracted only from the salient body stripes.

Finally, these feature vectors are gathered into a global appearance signature \(GAS_{p_i,c_j}=\left\{ AS_{1},...,AS_{n}\right\} \).

2.3 Appearance Signature Matching

This step consists in associating each person in the probe set \(P=\left\{ p_{1},...,p_{N} \right\} \) to the corresponding one in the gallery set \(G=\left\{ g_{1},...,g_{N} \right\} \). The appearance characteristics of each person is modeled by a global appearance signature for both gallery set, \(GAS_{P}=\left\{ GAS_{p_{1}},...,GAS_{p_{N}} \right\} \), and probe set, \(GAS_{G}=\left\{ GAS_{g_{1}},...,GAS_{g_{N}} \right\} \). Given a probe person \(p_{i}\), the re-identification is defined as maximum likehood estimation problem [8], as formulated in (7).

$$\begin{aligned} ID\left( p_{i} \right) = \underset{j\in 1..N}{argmax}\left( d\left( GAS_{p_i},GAS_{g_j} \right) \right) \end{aligned}$$
(7)

In our approach, each person is modeled by a global appearance signature that consist of multiple image signatures. To compare two global appearance signatures, we calculate the average of all pairwise distance between two pedestrians as formulated in (8).

$$\begin{aligned} d\left( GAS_{p_i}, GAS_{g_j} \right) =\frac{1}{M\times N} \sum _{m=1}^{M} \sum _{n=1}^{N}d\left( AS_{m}^{p_i},AS_{n}^{g_j} \right) \end{aligned}$$
(8)

The distance between two image signatures \(AS_{m}^{p_i}\) and \(AS_{n}^{g_j}\) is computed by using the Bhattacharyya distance [16] from the corresponding MCCM feature vectors, as formulated in (9).

$$\begin{aligned} d\left( AS_{m}^{p_i}, AS_{m}^{g_j} \right) =\frac{1}{\left\| u \right\| } \sum _{u\in \left\{ UT,LT,UL \right\} }^{ } d\left( MCCM_{u}^{p_{i}}, MCCM_{u}^{g_{j}}\right) \end{aligned}$$
(9)

3 Experimental Results

In order to evaluate the performance of our proposed approach, we carried out a series of experiments on the HDA+ dataset [15]. The first experiment validates the choice of HOG parameters for key frame extraction step. The second experiment evaluates the performance of the proposed multi-shot person re-identification approach.

3.1 Presentation of HDA+ Dataset

The proposed approach has been experimentally validated on the High-Definition Analytics dataset (HDA+). It has been acquired across 13 indoor cameras in an office distributed over three floors of a research department (e.g. corridors, leisure areas, halls, entries/exits) recording simultaneously for nearly 30 min. Cameras were set to acquire video from different points of view, affecting the geometry of the imaged scene and resulting in different distance ranges. The dataset contains a high degree of viewpoint, pose, and lighting variations. In our experiments, we selected 50 image sequence for 20 people. We have selected those people that reappearing across the different camera views of the network. All images are normalized to 128 \(\times \) 64 pixels.

3.2 Performance Validation of the Used HOG Parameters

Our aim is to find the best parameters of the block size and cell size for HOG descriptor that gives the best trajectory segmentation results. The selected people sequences have been manually segmented. Then, we evaluate the segmentation results with the different combination of cell and block size values based on Recall, Precision and F-measure as presented by Eqs. (10), (11), and (12), respectively. Table 1 presents the performance of the different combinations for HOG descriptor.

$$\begin{aligned} Recall=\frac{number \; of \; correctly\; detected \;transitions}{number \;of \; transitions} \end{aligned}$$
(10)
$$\begin{aligned} Precision=\frac{number \; of \; correctly\; detected \;transitions}{number \;of \;detected\; transitions} \end{aligned}$$
(11)
$$\begin{aligned} F-measure=2\times \frac{Precision\times Recall}{Precision+ Recall} \end{aligned}$$
(12)
Table 1. Results of the different cell and block combinations for the HOG descriptor.

As shown in Table 1, the best results are given by Cell size \( \left[ 6\times 6 \right] \) and Block size \(\left[ 2\times 2 \right] \). This combination allows the detection of the fine variations in the silhouette’s contour. These parameters will be considered in the rest of the experiments.

3.3 Performance Evaluation of Our Proposed Approach

Following the evaluation protocol of the state-of-the-art person re-identification approaches, we randomly select one image sequence to build the gallery set, and one image sequence to build the probe set picked from different camera for each pedestrian. We repeat the experiments for 10 trials and report the average performance. There are a variety of metrics to evaluate the effectiveness of a re-identification system. The results are measured by Cumulative Matching Characteristics curve (CMC) which measures the expectation of the correct match at rank r. We reported the performance at different ranks. Rank-1 accuracy refers the percentage of probe images which are perfectly matched to their corresponding gallery image. In addition, we reported the normalized Area Under the Curve (nAUC) which is derived from the CMC curve. It presents the area under the entire CMC curve normalized over the total area of the graph (a perfect nAUC is 1.0). As we narrow a real-world application, the first ranks in the CMC curve are the most important metrics since they are indicative of how well fully automatic reidentification performs.

In this section, we start by evaluating the adopted method for the set matching for multi-shot re-identification based on key frame selection. It consists of calculating the average of all pairwise distances (i.e. Our method). For this evaluation, we compared our solution with the adopted method by Cheng et al. [3] and Farenzena et al. [7] that calculate the maximum of all the sample similarities between two targets (i.e. Method A). The re-identification results are given by Table 2 and Fig. 5.

Fig. 5.
figure 5

CMC curve for set matching evaluation.

Table 2. Re-identification performance for set matching evaluation.

We can see that our method consistently outperforms method A: re-identification rate at rank 1 for our method is 23% while that of method A is 18.5%. However, the method A depends only on a single sample from each set which can be sensitive to the outliers. As an example, if we compare two pedestrians \(p_1\) and \(p_2\). \(p_1\) wears an open dark coat with a multi-color plaid shirt and a dark trouser. \(p_2\) wears a dark shirt and a trouser. If we compare them from the rear view, they will be very similar. But, if we compare them from the frontal view, we can clearly detect the appearance differences. In contrast, our method takes advantage from the different pairwise distances and combine them together.

Then, the second part is dedicated to evaluate the proposed key frame selection method for multi-shot person re-identification (i.e. Our method). For this evaluation, we compared our solution with the proposed method by Cheng et al. [3] and Zeng et al. [4]. They have randomly selected 5 images for each person. The matching policy for Cheng et al. [3] (i.e. Method B) is based on finding the maximum of all the sample similarities between two targets. Method C refers the case where 5 samples are randomly selected and the matching is based on the average of all the pairwise similarities. The re-identification results are given by Table 3 and Fig. 6.

Fig. 6.
figure 6

CMC curve for selected images evaluation.

Table 3. Re-identification performance for selected images evaluation.

We can see that the performance of our method is much better than that of method B and method C in the first ranks of the CMC curve and the nAUC. However, the major drawback of the randomly selected images is they may have a redundant information or uncompleted visual details that will be considered in the global appearance signature. Also, they perform uniformly for each target and does not take in consideration the unpredictable target’s path. On the other hand, our method automatically selects a set of representative appearance images based on the silhouette’s contour variations from the target’s trajectory, operating on each target independently. In addition, no learning phase is needed that makes it suitable for real scenario.

4 Conclusion

In this paper, we have presented a novel unsupervised appearance key frame selection approach for multi-shot person re-identification problem. More precisely, we have proposed to automatically select a small set of representative appearance images depicting the different body postures by analyzing the silhouette’s contour variations from the target’s trajectory. Our approach does not need any training data which is a huge advantage for real applications. The visual characteristics of the selected images are modeled into a global appearance signature, and, the similarity are computed based a the average of all pairwise distances. We evaluated our method on the challenging HDA+ dataset. We have proved robustness of the proposed approach against state-of-the-art approaches.

Encouraged by the promising performance of our method, we are currently examining how to improve the results by grou** the selected appearance image in different semantic clusters based on the estimation of the body’s pose. Images with the same pose are likely to be compared in the matching step. Such step has an important impact on further stages such reduce the temporal and spatial complexity of the re-identification problem. In addition, we aim to evaluate the proposed approach on other person re-identification datasets.