1 Introduction

A target tracking system consists of three main parts including, target detection, display and tracking. Detecting targets from the background have been done using the single-frame, two-frame or multi-frame processes [1]. To display targets, different methods have been used such as, point, gravity center, and contour [2]. The most complex part of the targets tracking is data association for the multiple targets [3]. All probability-based methods used both the Chapman-Kolmogorov and Bayesian equations simultaneously to solve tracking problem. According to the problem and reasonable assumptions, methods such as the EKF, and particle filter have been introduced [4]. Theory of random sets with finite members has also been presented as a probabilistic hypothesis for the targets tracking in order to use the density filter [5]. Kalman filter (KF) provides a closed form state estimation for a linear system by using noise measurements. But in applying KF, two important constraints are to be considerd, a linear model for the target kinematic equations as well as the Gaussian model for the measured noise. In the tracking process, as the number of targets and complexity of tracking increase, the error rate of tracking system increases as well. Due to the recursive nature of KF, error correction is also a complex matter that increases the computational load and time of tracking [6]. On the other hand, multi targets tracking needs data association methods. The data association methods are divided into local and global categories [7]. The global ones solve the data assignment problem in a set of high number of frames [8]. However in the local ones, data assignment is considered only on two or more consecutive frames. Some of the local data association methods are such as the Global Nearest Neighborhood (GNN), and JPDA filters [9]. So far, all the researches resulted in compromises between accuracy and speed in the tracking process but tracking multi targets automaticaly with full accuracy and less execution time is still a problem. Actually, the major problem is data assignment when ambiguous measurements caused by the targets collision and their maneuvering movement. Collision is a state in which a target is placed in front of other targets, however after collision, each target must be detected and tracked with its previous identifier (ID).

In this paper, a method is proposed based on combining the EKF, and PSO along with the JPDA filter for the multiple targets tracking. First, to detect moving targets and obtain their boundaries, frame subtraction and background modeling along with the Canny edge detector are used. Second, the center of gravity of targets are extracted and used as the EKF inputs and the presence area of targets are estimated in the next frames using EKF. Third, by applying PSO, the EKF performance is improved, as the measurement noise effects are reduced. To compensate the execution time increasing of the proposed method by adding PSO, the number of frames per second (fps) is also reduced from 30 to 10 without losing the tracking accuracy. Fourth, incorrect assignments of states in the state estimation process, are eliminated using JPDA filter. So, a one-to-one correspondence between targets and measurements is stablished.

In the following, the related past works are discussed in Sect. 2. The proposed method is presented in Sect. 3. The simulation results of the proposed method are shown in Sect. 4 and compared to the results of the similar methods. Finally, the paper is concluded in Sect. 5.

2 Past work

We investigated the past researches in the multi targets tracking field and the similar methods to our method have been briefly explained in below.

Authors in [3], first used frame subtraction and background modeling to extract targets boundaries. Next, to display targets, their features such as colors, local binary patterns, and slope-based histogram were obtained. Then, JPDA filter was used to match targets with measurements. At last, a mixed Kalman/\({H}_{\infty }\) filter was applied for the targets state estimation. An algorithm named “CFMD” [10] was proposed that removed camera movement and its lens shake. Then the targets motion prediction was done that improved the tracking accuracy. This algorithm was also evaluated on the “OBT-100” and “VOT-2018” databases. Authors in [11] presented a tracking framework and a “semantic matching strategy combined with a scene-aware affinity measurement algorithm. They also introduced an indoor tracking database and increased the variety of “existing indoor tracking benchmark dataset”. Experiments were done on their own database, “the MOT benchmarks, MOT17 and MOT20 datasets”. A multilevel framework [12] was introduced for the targets tracking that targets was detected using a fuzzy inference system. The multi targets tracking was done applying Hungarian method and Kalman filter. The Hungarian method was applied to recognize a particular person in the video frames. However, they assumed “each human movement was independent of others”. A “tracking-by-detection” method [13] was presented to track the targets from “an overhead camera”. The method was combination of movement recognition using “adaptive accumulated frame difference” with “Shi-Tomasi corner detection”. An algorithm based on the “GMPHD” filter and “occlusion group management scheme” [14] was presented for the multi targets tracking. The algorithm was evaluated in the “benchmarks of MOT15 and MOT17”. A method which was an improvement of the “existing color local entropy particle filter tracking algorithm” [15] was proposed. The “SIFT feature tracking method” was also inhanced. Then these two tracking methods were combined and used for detecting the moving vehicles in the video frames. Finally, Kalman filter was used to predict the moving vehicles. A “multiple information fusion-based multiple hypotheses tracking algorithm integrated with appearance features, local motion pattern feature, and repulsion inertia” [16] was introduced for the multi targets tracking. The adaptive spatial temporal context-aware (ASTCA) model [17] was presented based on the discriminative correlation filter (DCF) tracking framework for unmanned aerial vehicle (UAV) tracking. The ASTCA model could learn a spatial–temporal context weight and distinguish the target and background in the UAV-tracking scenarios. Challenges in the UAV scenarios were small scale and aerial view. The self-supervised learning-based tracker in a deep correlation frame work (self-SDCT) [18] was introduced based on forward–backward tracking consistency and a multi-cycle consistency loss as self-supervised information for learning feature extraction network from adjacent video frames. At the training step, pseudo-labels of consecutive video frames were generated by forward–backward prediction and utilized the multi-cycle consistency loss to learn a feature extraction network. At the tracking step, the pre-trained feature extraction network was employed to locate the target. An anchor-free tracker (SiamCorners) [19] was presented which was end-to-end trained offline on large-scale image pairs. A modified corner pooling layer was also introduced to convert the bounding box estimate of the target into a pair of corner predictions (bottom-right and top-left). In the network design, a layer wise feature aggregation strategy was introduced that enabled the corner pooling module to predict multiple corners for a tracking target in deep networks. A penalty term was also applied to select an optimal tracking box in the candidate corners. A Dual-level feature model [20] was introduced containing thermal infrared (TIR)-specific discriminative feature and fine-grained correlation feature for TIR object tracking. First, an auxiliary multi-classification network was designed to learn the TIR specific discriminative feature. Then to recognize intra-class TIR objects, a fine-grained aware module was proposed to learn the fine-grained correlation feature.

3 Proposed Method

Steps of the proposed method which is shown in Fig. 1, are target detection, tracking and data association. First, targets and their boundaries are detected using frame subtraction and background modeling along with the Canny edge detector. Next, the center of gravity of targets are extracted. Then, these centers are inputted to the EKF to estimate state of these targets in the next video frame.

Fig. 1
figure 1

Flowchart of the proposed method for the multi targets video tracking

To reduce the effect of measurement noise, PSO is applied, which leads to reducing the estimation error, and increasing the accuracy of multi targets tracking. In order to reduce the running time of the proposed method due to adding PSO, the fps is reduced from 30 to 10. In following, the JPDA filter is also used to prevent incorrect targets states assignment, and targets loss during occlusion, and maneuverable movement in the state stimation process. So, a one-to-one correspondence between targets and measurements is stablished. In continue, each block of the proposed method flowchart in Fig. 1 is described.

3.1 Target detection

In order to detect targets, frame subtraction and background modeling along with the Canny edge detector are applied. To prevent incorrect extraction of the targets boundaries from the video frame background due to the illumination changes, shadows, and etc., the background must be able to be continuously updated through the tracking process. If we assume that \({B}_{K}\) is the background model at the moment of k, and \({\left.{P}_{k=1}(i)\right|}_{i=1}^{N}\) represents the bounding boxes around the targets at the moment of k-1, Eq. (1) describes the background updating at the moment of k [3].

$$ B_{K} = \alpha B_{K - 1} + \left( {1 - \alpha } \right)\left( {I_{K} - \mathop \sum \limits_{i = 1}^{n} P_{K} \left[ i \right]} \right) $$
(1)

where \(\alpha \) is the background updating parameter in range of [0.9 099]. \({B}_{0}\) is an initial background model that can be determined as the mean intensity of the initial n frames by Eq. (2). n is obtained experimentally.

$$ B_{0} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} I\left( i \right){ } $$
(2)

Procedure of targets detection in the proposed method is shown in Fig. 2. Initially, three consecutive video frames are subtracted. Next, due to the little displacement of targets in two consecutive frames, and by applying the morphology operators, holes and gaps of targets are filled. Then, the Canny edge detector is applied to extract the targets boundaries more accurately. In continue, the gravity center of each target boundary is identified and used as a target representative in the tracking process.

Fig. 2
figure 2

Procedure of targets detection in the proposed method

3.2 Target Tracking

3.2.1 Kalman Filter

KF is an analytical method for the state estimation of a linear system using noisy measurements. To use KF, two important constraints are; considering a linear model for the target kinematic equations as well as the Gaussian model for the measured noise [12]. First, the system initial state including its state equations and covariance matrix, is determined. The state vector of \({X}_{k}\) in Eq. (3) contains the coordinates of target gravity center of \(({P}_{x}\text{,}{P}_{y})\) and target velocity of (\({V}_{x}\text{,}{V}_{y}\)( in both directions of x and y that specifies speed and location of the target.

$$ X_{k} = \left[ {\begin{array}{*{20}c} {P_{x} } \\ {P_{y} } \\ {\begin{array}{*{20}c} {V_{x} } \\ {V_{y} } \\ \end{array} } \\ \end{array} } \right] $$
(3)

The KF equations are shown in Table 1. The state estimation equations of target and covariance matrix of noise at moment of k, are updated by Eq. (4) using a linear model with a constant speed and without considering the system acceleration. While \({X}_{k-1}\) is the target state at moment of k-1, A is the transition state matrix between moments of k-1 and k. B is the control matrix of input values, \({\mu }_{k}\) is the system acceleration and it equals to zero and \({\omega }_{k}\) represents the noise of system.

Table 1 The KF equations

The system noise is considered a Gaussian noise with zero mean and covariance matrix of \({\text{Q}}_{\text{k}}\). The covariance matrix of noise is ​​updated by Eq. (5). Data is read from the imaging sensor and the measurement is updated by Eq. (6). \({v}_{k} \) is a measurement noise which has a Gaussian probability distribution with zero mean and covariance matrix of \({R}_{K}\). The estimation error is calculated by Eq. (7) and called innovation. The innovation covariance is calculated by Eq. (8) and the Kalman gain of \({k}_{g}\) is computed by Eq. (9). The Kalman gain leads to the more accurate estimation of system state by combining the estimation error with measurement data from sensor. Once the Kalman gain is determined by Eq. (10), the state equations and covariance matrix are updated and reported as the outputs. Due to the recursive structure of KF, its outputs at moment of k are used as the inputs for the target state estimation at moment of k + 1, and this cycle continues to the last video frame.

3.2.2 PSO

The PSO has been modeled by studying the social relationship of insects such as ants, and the swarm behaviors of animals such as birds and fish [21]. There are a number of particles that scattered in the search space. Each particle calculates the value of objective function in its current location in this space. It then chooses a direction to move by combining information from its current location and the best location which has previously been in, as well as information from one or more of the best particles in this space. Each particle has four features including position, velocity, value of objective function in its current position, and a combination of its best experience in terms of position and objective function in the optimal position. The updating equations for particles velocity and location are modeled by Eqs. (11) and (12). Each particle wants to maintain its speed in the direction of its previous motion and moves towards the best reported position and its best individual experience.

$$ v_{i} \left( {t + 1} \right) = wv_{i} \left( t \right) + r_{1} c_{1} (pos_{i} \left( t \right) - x_{i} \left( t \right) + r_{2} c_{2} \left( {g\left( t \right) - x_{i} \left( t \right)} \right) $$
(11)
$$ x_{i} \left( {t + 1} \right) = x_{i} \left( t \right) + v_{i} \left( {t + 1} \right) $$
(12)

where \({v}_{i}\) represents instantaneous velocity, w is inertia coefficient between 0.4 to 0.9, \({pos}_{i}\) is the particle past best position, \({x}_{i}\) is the current particle position, \({c}_{1}\) and \({c}_{2}\) are individual and group learning coefficients in range of [0 2]. \({r}_{1}\) and \({r}_{2 }\) are random numbers with uniform distribution in range of \(\left(0 1\right)\) and g is the best experiences of whole group [21]. Figure 3 shows how a particle moves according to Eqs. (11) and (12) in the one step of PSO.

Fig. 3
figure 3

Particle movement in one step of PSO

3.2.3 EKF

The proposed algorithm uses a combination of the EKF and PSO to create a multiple targets intermediate tracker. The general diagram of this intermediate tracker is shown in Fig. 4. The purpose of combining the EKF with PSO is to minimize amount of the covariance of measurement noise to reduce the measurement error. Therefore, by reduction of the measurement error, the Kalman gain gives more weight to the sensor data and reduces uncertainty in the state estimation process. The cost function is defined by Eq. (13). \({p}_{x}\) and \({p}_{y}\) represent the coordinates of gravity center of target obtained by the measurement. While \(\widehat{{p}_{x}} \) and \(\widehat{{p}_{y}}\) indicate the estimated position by EKF.

$$ z = \sqrt {\left( {p_{x} - \widehat{{p_{x} }}} \right)^{2} + \left( {p_{y} - \widehat{{p_{y} }}} \right)^{2} } $$
(13)
Fig. 4
figure 4

Multiple targets intermediate tracker model in the proposed method

Each position obtained by EKF is subtracted from the position obtained by sensor. Ideally, this subtraction value for the correct position has to be zero. However due to the measurement noise, there is an error. PSO reduces this error and brings it more closer to zero by estimating of the covariance matrix of measurement noise, which is shown in the step 6 of Fig. 4. Finally, the stop conditions are checked. If these conditions are met, the results are reported as the outputs of intermediate tracker.

3.3 Data Association

3.3.1 JPDA Filter

Around each target of i, a hypothetical gate \({G}_{K}^{(i)}\) is defined as Eq. (14) and only data contained within this gate can be a measurement data related to this target. \({Z}_{K}\) is the set of previous measurements, \({\widehat{Z}}_{K|K-1}\) is the estimated measurements by EKF, \({S}_{K}\) is the covariance matrix of innovation, δ is a threshold value that if it is too small, the data association algorithm fails, and if it is too large, computational load of this algorithm increases [22]. In the association process using JPDA filter, it is assumed that the probability of targets identification is fixed. However, in the proposed method, the probability of targets identification can be calculated directly due to applying EKF.

$$ G_{K}^{\left( i \right)} = \left\{ {\left( {Z_{K} - \hat{Z}_{K|K - 1} } \right)^{T} \left( {S_{K} } \right)^{ - 1} \left( {Z_{K} - \hat{Z}_{K|K - 1} } \right) \le \delta } \right\} $$
(14)

Due to the EKF nature, the likelihood of targets identification increases in detecting targets. The probability of targets identification is defined as Eq. (15).\({N}_{D}\) and \({N}_{O}\) respectively, specify the number of identified targets within the hypothetical association gate and the number of targets in each frame. If a target is detected in a hypothetical gate, its value has to be considered 1, otherwise 0. Using the adding symbol of (∑), the pro.

bability of targets identification takes place on all video frames. The JPDA filter equations are given in Table 2.

$$ P_{D} = \frac{{\sum\nolimits_{{i = 1}}^{{tot\,frame}} {N_{D} } }}{{\sum\nolimits_{{i = 1}}^{{tot\,frame}} {N_{O} } }} $$
(15)
Table 2 The JPDA equations

In the equations of Table 2, \({\beta }_{k}^{i\text{,}j}\) is the joint margin association probability and indicates the association probability of target of i to the measurement of j by Eq. (16).

$$ \beta_{k}^{{i{,}j}} = \Pr \left( {\theta_{k}^{i} = j{|}Z_{1:k - 1} } \right) $$
(16)

The probability that target of i is not identified at time of k is denoted by \({\beta }_{k}^{i\text{,}0}\) by Eq. (17).

$$ \beta_{k}^{{i{,}0}} = \Pr \left( {\theta_{k}^{i} = 0{|}Z_{1:k - 1} } \right) = 1 - \mathop \sum \limits_{j} \beta_{k}^{{i{,}j}} $$
(17)

where θ is a practical data association matrix and defined by Eq. (18).

$$ \theta_{k}^{i}\,=\,\left\{ {\begin{array}{*{20}c} {j\,If\,target\,i\,is\,associated\,to\,measurement\,j} \\ {0\,Otherwise\,\left( {Target\,i\,is\,not\,identified} \right)} \\ \end{array} } \right. $$
(18)

Overall, adding the JPDA filter at the end of proposed method helps the proposed tracker labels accurately targets in the occlusion and maneuvering movements.

4 Simulation Results

The proposed method simulation results are presented in two steps; the output of intermediate tracker, and the output of JPDA filter. These results are also compared to the simulation results of similar methods as well. To evaluate the performance of proposed method, PETs2009 database [23, 24] is used, since this database provides a suitable and close model for the real-world scenarios.

In the PETs2009, scenarios of the moderate changes in brightness, no shadows, targets with maneuverable movements, and targets with variable speed and collision are provided by eight cameras.

S1 scenario is provided to estimate number of people and track sets of individuals. S2 scenario is for tracking individuals. We used S2.L1 scenario captured by camera 1 to track people in a video sequence. Simulation was performed by MATLAB software version 2019b with a laptop with Intel corei7 processor.

4.1 Results of Intermediate Tracker

Table 3 shows the initial values of parameters of identifier, EKF, and PSO parts of the proposed method.

Table 3 Initialization of the intermediate tracker parameters

Figure 5 illustrates qualitative results of the intermediate tracker for frames numbered 3, 10, 20, 30, 45, 55, 70, and 80. In Fig. 5, detected people are illustrated with different colored ellipses. The colored ellipses are illustrated to compare tracking results more easier. The centers of gravity of ellipses in each frame are computed to the intermediate tracker which is a combination of EKF and PSO. As it is seen, tracking people are done with a good accuracy even after collisions and even with their maneuverable movements by the proposed method.

Fig. 5
figure 5

Simulation results of the proposed intermediate tracker in the frames numbered 3,10, 20, 30, 45, 55, 70, 80 (from up left to down right) of S2.L1 scenario of PETs2009

4.2 Results of the JPDA Filter

Table 4 shows the initial values of JPDA parameters. JPDA also requires coordinates of the center of gravity of targets and their initial velocity in an input frame. Trajectories for 7 targets at the first eight seconds of S2.L1 scenario of PETS2009 are shown in Fig. 6. These targets are listed in Table 5. Each trajectory is marked with a different symbol and color. Figure 7 also demonstrates tracking of each of these 7 targets up to the frame numbered 80 by the proposed method. It is seen that all of these targets are tracked accurately. For example, target numbered 7 has been tracked through its maneuverable movement and occlusion. Furthermore, Fig. 8 illustrates the final results of proposed method for the frames numbered 3, 27, 56, and 80 of S2.L1 scenario of PETS2009 database.

Table 4 Initialization of the JPDA filter parameters
Fig. 6
figure 6

Trajectories of 7 targets at the first eight seconds of S2.L1 scenario of PETS2009

Table 5 Characteristics of 7 targets at the first eight seconds of S2.L1 scenario
Fig. 7
figure 7

Tracking of targets numbered 1 to 7 (Table 5) up to the frame numbered 80 by the proposed method

Fig. 8
figure 8

JPDA filter Results in the proposed method for the frames numbered 3, 27, 56, 80 (up left to down right) of S2.L1 scenario of PETs2009

4.3 Results Evaluation

So far, different criteria have been introduced by researchers to evaluate performance of a tracking system. The appropriate selection of these criteria depends on the application. On the other hand, summarizing performance of a tracking system into a digit, makes it easy to compare different tracking systems. However it is possible that just a number will not be enough to express accurately performance of different parts of a tracking system. Different evaluation criteria for the multiple targets tracking are listed in Table 6. TT is the sum of total targets and GT is the Ground-Truth which represents the actual situation of targets.

Table 6 Different criteria for the targets tracking evaluation

Table 7 shows performance comparison of the proposed method with a number of similar methods using S2.L1 scenario of PETS2009 database. Table 8 shows the comparison of mean of errors in calculation of the center of gravity of targets numbered 5 and 6 (Table 5) by the proposed method, the GMPHD filter [14], and the particle-multiple features method [15]. As it is seen, the proposed method performs better than the particle-multiple features method in terms of accuracy and error in calculation of the center of gravity of these targets. However, the GMPHD filter performs slightly better than the proposed method in terms of error in calculation of the center of gravity of target numbered 6, but its accuracy using MOTA is lower than the proposed method. Therefore, proposed method totally is a better multiple targets tracker according to MOTA.

Table 7 Performance comparison of the proposed method with a number of similar methods using S2.L1 scenario of PETS2009
Table 8 Comparison of mean of errors in calculation of the center of gravity of targets numbered 5 and 6 by the proposed method, and methods in [14] and [15]

Table 9 shows performance evaluation of the proposed method based on precision and the mean of precision criteria using S2.L1 scenario of PETs2009 database for tracking of targets numbered 1 to 7 (Table 5) up to the frame numbered 80.

Table 9 Performance precision of the proposed method using S2.L1 scenario of PETs2009 database for tracking of targets numbered 1 to 7 (Table 5) up to the frame numbered 80

Furthermore, the performance of proposed method without PSO part, is also evaluated using video sequences with 10 and 30 fps by MOTA which is shown in Table 7. As it is seen, applying EKF without PSO has reduced MOTA. On the other hand, reducing the sampling rate of EKF without PSO, significantly has reduced accuracy of the proposed method. Because EKF and JPDA can not compensate the reduction effect of the sampling rate on the performace accuracy of a multiple targets tracker. Finally, using the proposed method has reduced the measurent noise, and achieved the performace accuracy of 98%. However, applying EKF in the absence of an appropriate data association filter is also impossible.

5 Conclusion

In the proposed multiple targets tracking method, by combining EKF with PSO, the measurement noise, and the uncertainty of EKF outputs were reduced. So, the gain of EKF and accuracy of the proposed method were increased. However, applying PSO was led to an increase in the computational load and execution time of the proposed method. These problems were solved by reducing the fps from 30 to 10. In the estimation process, the incorrect targets states assignment, targets loss by occlusion, and maneuverable movement of targets were eliminated using JPDA filter. So, the search area for each target in the next frames became smaller and the interactions between targets were modeled.

The proposed method was implemented on S2.L1 scenario of PETS2009 database. Simulation results comparison among the proposed method and a number of similar methods showed that the proposed method can be considered as a multiple targets tracker with performance accuracy of 98% according to the MOTA criterion.

It is important to consider that in some video seqances, reduction of fps to 10, may result to less tracking performance accuracy. So, the fps has to be increased for example to 12 and maximum to 30 which increases the running time of the proposed tracker. Therefore, for the future work it is recommended, to test other optimization algorithms rather than PSO in order to have less running time without losing the tracking performance accuracy of 98%.

It has to be noticed that if the number of selected targets for tracking increases, especially on those targets which are very close to each other, the performance accuracy of proposed method decreases due to the uncertainty of EKF outputs.