Introduction

The Multi-Object Tracker identifies all instances of a specific object type in the video and tracks each instance through the video clips in the camera frame. Despite remarkable progress [1,2,3,4], multi-object tracking (MOT) task still has many real-world challenges, such as the lack of datasets and annotation difficulties. Taking the MOT15 [5] dataset as an example, it took at least 22 h to annotate a 6-minute video using the standard process [6]. Therefore, self-supervised or unsupervised multi-object tracking is urgently needed for practical MOT applications.

The advances of object detection [7,8,9,10] have led to the supervised tracking-by-detection two-stage paradigm becoming the dominant MOT approach in the past decade. However, this paradigm is inefficient and cannot achieve global optimality. Recently, the joint detection and embedding paradigm [11] is proposed to balance the efficiency and accuracy for the MOT. It combines the object detector and object feature embedding into a single-stage tracking paradigm. The JDE based MOT methods associates the one-stage detector to learn the object embedding and a feature embedding branch. It then associates the detected objects with the matched trajectories through the Hungarian matching algorithm. However, because the JDE-based MOT methods take each trajectory as a category, the objects tracking is essentially trained as a classification task. This makes the tracking performance heavily relies on the object detection results. Furthermore, when increasing the number of IDs, the number of categories grows as well, which brings more difficulty for the classification fitting. Hence, the highly efficient JDE-based MOT methods are prone to miss tracking and ID swiches [4, 11, 12].

Rencently, some studies [4, 13,14,15] have attempted to reduce the reliance on trajectory annotation. They still use a supervised approach for the detection part of the method and do not use labels in the tracking part [13, 14], or they use a small number of labels [15]. For example, the weakly supervised MOT method [13] took use of the detection results and Kalman filtering to generate pseudo-labels of trajectories for MOT on long-video segments. It is, however, also takes the multi-object tracking as a trajectory classification problem [14] used an unlabeled dataset to train a separate feature extractor to be used as an embedding model. However, the trained feature extractor has limited ability to generalize over the current tracking dataset (MOT Challenge-Data), and the method requires additional steps to extract features, is multistage, and is inefficient [15] applied contrast learning to the training of embedding branches, which verifies the efficacy of contrast loss. However, it still employs long-video segments, which is prone to error accumulation and partial trajectory labeling. The unsupervised pretraining of [4] relies on the repetition-free object dataset CrowdHuman, which has shown to have limited generalization ability over the MOT 17 and MOT20 datasets.

Inspired by self-supervised methods and the high efficiency of the JDE paradigm, in this paper, we design a self-supervised JDE based multi-object tracking method (SS-MOT). Our method uses a supervised approach for the detection task and a self-supervised approach for the tracking task. To reduce the problem of pseudo-labeling error accumulation in long video segments, we propose only two critical prior information based on the characteristics of video frames within and adjacent to the target video frame: (1) the objects within the same frame must not be the same, and (2) the objects of adjacent frames can be matched pairs with higher correctness based on the embedding features. We treat the matching pairs obtained from prior2 as positive sample pairs in self-supervised contrastive learning and use the embedding features of other objects as negative samples, thus achieving self-supervised training of the embedding branch. This training approach solves two problems of supervised training simultaneously, (1) trajectory annotation is no longer required, and the correlation between the size of the dataset and the difficulty of fitting the problem is reduced, and (2) this method can be trained on static images without temporal information, improving the scalability of the joint model.

According to Yolox [16], we discovered that the JDE’s paradigm detection and embedding suffers from a competition problem. For the detection task in crowded scenes, all bounding boxes are of the same class for foreground-background classification. The optimization direction is to bring the features of the targets closer. However, for the trajectory ID classification task, it is necessary to distinguish who the person is, and it requires increasing the feature distance between different targets. Therefore, we borrowed the multitask decoupling approach from Yolox [16], and we found that the number of convolution layers between feature map and the head is also a key factor in achieving decoupling. Consequently, we proposed introducing a decoupling module to mitigate the competition problem while balancing the number of parameters and tracking performance.

We conducted experiments on the MOT17 and MOT20 datasets to assess the efficacy of our proposed methodologies. We initially trained FairMOT and SiamMOT using the same training/testing data, using self-supervised/supervised methods. The experimental results demonstrate that our self-supervised method achieves the best results among unsupervised methods and obtains comparable results to supervised methods.

The contribution of this study can be divided into three aspects.

  1. (1)

    We present the first self-supervised MOT method based on an efficient JDE paradigm, which is self-supervised training based only on information within and adjacent to video frames. Furthermore, we validate the competition problem between the detection branch and the feature embedding branch of the JDE paradigm-based target tracking method and propose a method to choose the appropriate neural network depth for decoupling while balancing the number of parameters and performance metrics.

  2. (2)

    Our SS-MOT method addresses the challenges of existing approaches based on pseudo-labeling or relying on training with long video sequences, which struggle to fit well with an increasing number of targets and suffer from error accumulation. Additionally, our method tackles the issue of high annotation cost in object tracking datasets.

  3. (3)

    We demonstrate that our method achieves comparable results to supervised methods while achieving the best results in unsupervised mode by undertaking many comparative experiments on the MOT 17 and MOT 20 datasets.

Methods

Overall network framework

Fig. 1
figure 1

show the network architecture our proposed self-supervised MOT method based on the JDE paradigm. Our main contributions are two green background components: the inclusion of our decoupling module in front of the head branch and a metric learning module for the relevance of the object features in our self-supervised

The overview of our proposed framework of self-supervised multi-object tracking method based on JDE paradigm is shown in Fig. 1, where we input two successive frames, extract features after the backbone network, and embedding based on the features to achieve detection and tracking, respectively. To solve the costly problem of manual annotation, we designed an self-supervised metric learning module. We also added a lightweight decoupling module in front of the head to alleviate the problem of knowledge competition.

The tracking problem involves finding correlations between successive frames of the same object, and current supervised MOT methods require a large number of manual annotations. For example, existing JDE-based MOT methods [4, 11, 12, 17, 18] treat the MOT problem as a classification problem. Specifically, they treat each trajectory as a separate class to constrain the feature embedding representation during the training of different objects. This training method can achieve good results when the number of trajectories is small. However, if the number of trajectories is extremely large, then fitting the model will be difficult. This condition can limit the performance of the JDE paradigm tracking method because the length of trajectories in the dataset is inconsistent, leading to an unbalanced number of samples for each category. In this study, we propose an self-supervised metric learning method to learn the feature embedding expressions of objects. Our method uses the prior that different objects are different between the same frames and the prior that the same objects are positive samples between different frames to construct training sample pairs with positive and negative samples. We use the contrast learning method to train using the relationship between the same and different frames that are essentially the same and different. Compared with existing supervised methods, our method does not rely on a large amount of supervised labeling information. Compared with existing unsupervised methods, our method fully exploits the information of the consecutive frames of the image itself to train without generating pseudo-labels in the middle.

In the MOT Challenge scenario, the classification tasks include foreground- background classification and track number classification. However, these tasks can create a conflict when two different people need to extract features. The foreground-background classification requires the feature distances to be smaller, whereas the track numbering classification task requires the feature distances to be farther apart. To alleviate this conflict, we added lightweight decoupling modules.

For ease of understanding, we provide the following symbols and notations:

\(I_t\):

The image at frame t.

\(B_t\):

The positions of k objects in the t-th frame image.

\(\hat{B_t}\):

The object position output from the forward pass of the tracker.

\(y_t\):

The trajectory numbers of k objects in the t-th frame image.

\(\hat{y_t}\):

The predicted trajectory numbers for the k objects.

\(\hat{E_t}\):

The features extracted from the backbone network for image \(I_t\).

\(k_t\):

The number of objects in the t-th frame.

M:

The computed cosine similarity matrix.

sim():

Calculating the similarity between samples.

\(m_{i,j}\):

The cosine similarity between two object embedding vectors.

Self-supervised embedding learning

Fig. 2
figure 2

show our proposed metric learning module for the relevance of object features. We use the embedding vector obtained from the object position by the detector as the feature vector of this object. The cosine distance matrix computed using these features. When calculating the positive sample loss, we fill the embedding vectors of other intravideo objects as negative samples.The darker the color in the matrix, the higher the similarity

Before introducing our self-supervised approach, we recall how the joint tracker of the supervised approach is trained. The joint tracker [4, 11, 18] uses a dataset denoted by \(\{ I,B,y\}_{i=1}^N\) for supervised training, where \({I_t}\in {{\mathbb {R}}^{c \times h \times w}}\) denotes an image frame, \({B_t}\in {{\mathbb {R}}^{{k_t}\times 4}}\) denotes the positions of the k objects in the current frame t image, and \({y_t} \in {Z^{{k_t}}}\) denotes the number of the trajectory to which the k objects of the current frame belong. These joint trackers output the object location \(\hat{B}_t\in {{\mathbb {R}}^{\hat{k}_t\times 4}}\) and the embedded features \(\hat{E}_t\in {{\mathbb {R}}^{\hat{k}_t\times D}}\) in a single forward propagation (D represents the dimensionality of the feature vector) and the losses of the joint trackers are as follows:

$$\begin{aligned} L_{joint} = L_{det } + L_{id} \end{aligned}$$
(1)

where \({L_{det}}\) is the detection loss determined by the gap between \(\hat{B}_t\) and \({B_t}\), and \({L_{id}}\) is the loss of the embedded branch. The embedded features \(\hat{E}\) will be input to a fully connected layer used only in training for classification and \(\hat{y}_t \in {Z^{\hat{k}_t}}\) is obtained. \({L_{id}}\) is obtained by calculating the cross-entropy loss from \(\hat{y}_t\) and \({y_t}\).

In self-supervised training, we no longer have track labeling. Therefore, our dataset is represented as \(\{ I,B\} _{i = 1}^N\), and we can no longer use classification loss to train embedding branches. Our prior can be obtained from Fig. 3. We construct the contrast loss based on the following two key ideas: (1) objects within the same frame must be negative samples from each other; (2) similar objects can be found within the images of adjacent frames, and positive samples can be constructed accordingly. We prove this through experiments in Sect. “Validation experiment and ablation study” Our self-supervised training framework is shown in Fig. 2. To establish associations across frames, we follow the approach of [18,19,20] and use pairs of consecutive frames to construct short subvideo segments as input to the model. Each subvideo segment can be represented as \(\{ I,B\} _{i = t}^{\{ t,t + 1\} }\). After feeding these sub-videos into the network, their corresponding feature vectors \(\hat{E}_t = \{ {x_1},{x_2}...{x_{{k_t}}}\} \) and \(\hat{E}_{t+1} = \{ {x_1},{x_2}...{x_{{k_{t + 1}}}}\} \) can be obtained based on the detection labels of frame t and frame \(t+1\), where x represents the feature vector of the corresponding object, and \({k_t}\) and \({k_{t + 1}}\) represent the number of objects in the corresponding frame image, respectively.

Fig. 3
figure 3

Illustrates our motivation for designing metric learning. The objects between the same frame are not the same as each other, but similar objects can be found between neighboring frames.We use contrast loss to push the negative farther away and the positive closer

A. feature embedding learning within frames.We learn the distinguishability of different object features based on self-supervised contrast learning for objects within the same frame. Since self-supervised contrast learning is for positive sample pairs, and positive sample pairs cannot be found for objects within the same frame, we choose to retain the similar softmax operation in contrast loss, but replace the similarity of positive sample pairs in the numerator with the similarity of negative sample pairs, and no longer compute the logarithmic function and take the negative. The equation for the self-supervised comparison loss is as follows:

$$\begin{aligned} {L_{contrastive}} = - \log \left[ {\frac{{{e^{sim({x_i},x_i^ + )/\tau }}}}{{\sum \nolimits _{j! = i} {{e^{sim({x_i},{x_j})/\tau }}} }}} \right] \end{aligned}$$
(2)

where \(sim({x_i},x_i^ + )\) denotes the cosine similarity between the i-th sample and its positive sample, and \(sim({x_i},{x_j})\) denotes the similarity between the i-th object and other samples except itself, and is the temperature that controls the degree of constraint of difficult samples. We splice \(\hat{E}_t\) with \(\hat{E}_{t+1}\) and calculate the cosine similarity matrix \(M \in {\mathbb {R}^{({k_t} + {k_{t + 1}})\times ({k_t} + {k_{t + 1}})}}\). The value \({m_{i,j}}\) corresponding to each point in the matrix is calculated by Eq. 3.

$$\begin{aligned} m_{i,j} = \frac{{x_i}\times {x_j}}{{||{x_i}|{|_2}||{x_j}|{|_2}}},i,j \in (0,{k_t} + {k_{t + 1}} - 1) \end{aligned}$$
(3)

where \(m_{i,j}\) denotes the cosine similarity between the two object embedding vectors. \(M_{t,t}\) and \(M_{t+1,t+1}\) denote the similarity between objects in frame t and frame \(t+1\), respectively, while \(M_{t,t+1}\) and \(M_{t+1,t}\) denote the similarity between objects in frame t and frame \(t+1\). We design the loss \(L_{self}\) for negative samples based on the prior information condition that the object must be a negative sample within the same frame.

$$\begin{aligned} L_{self}= & {} \sum \limits _{i! = j}^{0 \le i,j \le {k_t} - 1} {\frac{{{e^{{m_{i,j}}/\tau }}}}{{\sum \nolimits _{l! = i}^{{k_t} + {k_{t + 1}} + 1} {{e^{{m_{i,l}}/\tau }}} }}} \nonumber \\{} & {} + \sum \limits _{i! = j}^{{k_t} \le i,j \le {k_t} + {k_{t + 1}} + 1} {\frac{{{e^{{m_{i,j}}/\tau }}}}{{\sum \nolimits _{l! = i}^{{k_t} + {k_{t + 1}} + 1} {{e^{{m_{i,l}}/\tau }}} }}} \end{aligned}$$
(4)

The denominator of the first item of \(L_{self}\) is the sum of all elements in A except for the diagonal elements, which tends to push the distance between all object features in the far frame t. The second item is the same operation for \(M_{t + 1,t + 1}\). The denominators of both are kept the same as the denominator of the contrast loss, but the numerator of the contrast loss is the similarity between positive sample pairs. Having positive samples in the same image frame is impossible. Therefore, \(L_{self}\) replaces the positive sample pair similarity in the numerator with the negative sample pair similarity while retaining the softmax-like operation in the contrast loss. However, it no longer performs the log operation and takes the negative operation to ensure that the optimization direction of the loss is consistent with the direction in which the negative sample distance becomes larger.

B. Feature embedding learing across consecutive frames. \({L_{self}}\) only restricts the need to obtain different features for objects in the same frame, without establishing a constraint on the features of targets across frames. To learn the similarity of identical targets, we use the Hungarian algorithm for X to treat the frame t target to the frame \(t+1\) target as a forward match, as a way to obtain matching pairs of identical objects in adjacent frames. These matching pairs will be considered positive sample pairs, and we use the contrast loss to compute L:

$$\begin{aligned} L_{cross} = \sum \limits _{i,j \in matched} - \log \left( \frac{{{e^{{m_{i,j}}/\tau }}}}{{\sum \nolimits _{l! = i}^{{k_t} + {k_{t + 1}} - 1} {{e^{{m_{i,l}}/\tau }}} }}\right) \end{aligned}$$
(5)

C. Feature embedding learning with cycle consisitency constraint. The purpose of learning the similarity of identical objects is to draw closer the similarity of matching pairs between adjacent frames. We interpret the matching operation on \(M_{t,t + 1}\) as the forward tracking of the object at frame t with the object at frame \(t+1\). Inspired by UDT[21], we believe that the forward tracking result should be consistent with the result of the reverse tracking using the object at frame \(t+1\) with the object at frame t. Therefore, we propose a constraint to enhance the consistency expression between the same object features, and construct the loss \({L_{cycle}}\):

$$\begin{aligned} L_{cycle} = \sum \limits _{i,j \in matched} {1 - \frac{{{e^{{m_{j + i,i}}/\tau }}}}{{\sum \nolimits _{l != j + i}^{{k_t} + {k_{t + 1}} - 1} {{e^{{m_{j + i,l}}/\tau }}} }}} \end{aligned}$$
(6)

\(L_{cycle}\) acts on \(M_{t + 1,t}\), uses the diagonal elements of the forward matching pair as the reverse matching pair, and does not use an additional matching operation, that is, the reverse operation of \({L_{cycle}}\) Fig. 2. This can further close the distance of features between matching pairs. We define the loss of metric learning as the sum of the above three losses, that is,

$$\begin{aligned} L_{id} = L_{self} + L_{cross} + L_{cycle} \end{aligned}$$
(7)

Considering that the number of negative samples greatly affects the effect of self-supervised comparison learning, we use the object boxes from different scenes in the same batch as additional negative samples, that is, the green squares in Fig. 2. We will discuss the effect of the number of negative samples on the training results in Sect.“Validation experiment and ablation study” by splicing negative samples in \({\hat{E}_{t + 1}}\) and then computing \(M' \in {{\mathbb {R}}^{({k_t} + {k_{t + 1}})\times ({k_t} + {k_{t + 1}} + {k_n})}}\) to replace the original M for metric learning.

Decoupling module

Inspired by YOLOx [16], we argue that the effectiveness of its multitasking decoupling is not only due to the separate output of each head, but the depth of each head also affects the effectiveness of the decoupling, so we propose the Simple Decoupling (SD) module. For FairMOT, which already has multihead output, we simply insert the SD module before the task head to better alleviate the competition between detection and embedding. For Cstrack, which uses a homemade decoupling module CCN, we replace CCN with the SD module directly.

Fig. 4
figure 4

shows the difference between Cstrack and FairMOT before and after using the decoupling module. (a) FairMOT’s heads and the proposed decoupled head. (b) Cstrack’s heads and the proposed decoupled head, where CCN is a decoupling method proposed by Cstrack

As shown Fig. 4, we add 3\(\times \)3 convolution layer to the middle of the FairMOT features output by the SD module to increase the nonlinear representation of the feature map after decoupling. To reduce computational effort, we unify the number of channels computed in the middle of the id head to 128. For Cstrack, we first reduce the number of channels of the feature map obtained from FPN to 256 by 1\(\times \)1 convolution, and do not further decouple the detection head because our focus is only on the competition between detection and embedding. We retain the subsequent operation of Cstrack to fuse the features of the three embedding heads. We will conduct experiments to compare the CCN module with the SD module in terms of number of participants and effectiveness, and to verify that the competition problem of Cstrack remains drastic after using the CCN module.

Experiments

We present the datasets used in our experiments, the evaluation metrics, and the parameter settings of our experiments in Sect. “Datasets, evaluation metrics, and details”. We describe our validation experiments and ablation experiments in Sect.“Validation experiment and ablation study”, and present our results on the MOT17 and MOT20 test datasets in Sect. “MOTChallenge results”.

Datasets, evaluation metrics, and details

Datasets and evaluation metrics

We used the MOT Challenge dataset, which contains MOT17 and MOT20. The MOT17 dataset contains a training set with 5316 frames from 7 videos and a test set with 5919 frames from the same 7 videos. MOT20 is a denser dataset relative to MOT17. The subsequent experiments are consistent with [4, 12] except for the test experiments, which use the first half of MOT17 data as the training set and the second half as the validation set. In the test experiments, we will use additional datasets, such as CrowdHuman, ETH, CityPersons, CalTech, CUHK-SYSU, and PRW, in line with [4, 11, 12].

We will use the standard MOT Challenge evaluation metrics, focusing on MOTA, IDF1, MT (Mostly Tracked objects), ML (Mostly Lost objects), and IDS (Number of Identity Switches) metrics.

Details

We apply the self-supervised training and decoupling modules to FairMOT, Cstrack, and SiamMOT. To ensure a fair comparison, we keep the hyperparameters of these network consistent. Cstrack and SiamMOT are trained for 30 rounds using the SGD optimizer. The learning rate is initialized to 5\(\times \)10–4 and decays to 5\(\times \)10–5 in 20 rounds. The weights of detection loss and id loss are also used as 1: 0.02 in the original paper. FairMOT uses the Adam optimizer for training, with a learning rate of 1\(\times \)10–4, and trains for 30 rounds. The detection loss and id loss use learnable weights. All training is performed on a single Tesla V100 GPU. In self-supervised training, consecutive frames are randomly drawn from within 10 frames before and after the first frame based on the video frame rate.

Fig. 5
figure 5

Similarity matrix between features extracted from the selected and its subsequent 1, 5, 10, and 20 frames. A threshold of 0.7 is used for matching, with correct matches marked with red boxes and incorrect matches marked with yellow boxes. Where the axes represent the IDs of the objects that can be matched

Validation experiment and ablation study

We will perform in this section the validation experiments mentioned in the previous section and the ablation experiments of the self-supervised embedding training and decoupling module.

Table 1 Effect of trained/untrained embedding branches on metrics

Validation experiment

In this section, we will first verify whether the untrained embedding branch mentioned above can still distinguish objects in short interval frames and show the effect of \({L_{self}}\) using only same-frame information on the real-time results. We will then verify whether the competition problem still exists in Cstrack and FairMOT. Moreover, we will conduct experiments on the FairMOT to demonstrate that our method can achieve comparable efficiency.

We put the 28th image in the MOT17–09 sequence along with its subsequent 1, 5, 10, and 20 frames into the network loaded with only coco pretrained weights. We will then calculate the similarity matrix M of the resulting embedded features and match them based on their similarity to obtain the results shown in Fig. 5. These results demonstrate that the untrained embedding branches still provide effective features at short intervals for the selected images, and this effect decreases as the time gap increases. To ensure that matching pairs are found during training, we will randomly select the second frame from within the 10 frames before and after the first frame.

Fig. 6
figure 6

show the matching rate curve and correct matching rate curve when using Eqs. 3 and 8 for training. The two values are computed separately for each image within each iteration and averaged over the entire epoch. Using Eq. 3 guarantees a high accuracy rate throughout, and the accuracy rate also dramatically affects the results of \({L_{cross}}\) and \({L_{cycle}}\)

The \(L_{self}\) constraint is simpler and easier to understand. It directly sums the nondiagonal values of \({M_{t,t}}\) and \({M_{t + 1,t + 1}}\) as losses. We will analyze the effect of the \(L_{self}\) constraint converging to zero for object similarity within the same frame. When using the simple constraint, Eq. 3 will be replaced by Eq. 8.

$$\begin{aligned} L_{self} = \sum \limits _{j! = i}^{0 \le i,j \le {k_t} - 1} {{m_{i,j}}} + \sum \limits _{j! = i}^{{k_t} \le i,j \le {k_{t + 1}} + {k_t} - 1} {{m_{i,j}}} \end{aligned}$$
(8)

Figure 6 shows the average number of successful matches and correct matches per epoch during training. Using Eq. 3 maintains a relatively high correct matching rate, and the number of matches increases steadily with the number of training rounds. By contrast, using Eq. 8 can quickly achieve a high number of matches, but its accuracy rate is not guaranteed. This result is because Eq. 8 only considers the features of the object in the current frame that is pushed far away, reducing the correlation between the objects of the two frames.

Both [4] and [12] mentioned the problem of branch competition and provided the corresponding solutions. We conducted a simple experiment to verify whether the competition problem continues. Table 1 shows the results of Cstrack without training the embedding branch, with training the embedding branch, and corresponding results for FairMOT. Considering that the IDF1 metric is more responsive to tracking and MOTA is more responsive to detection, IDF1 represents the tracking effect, and MOTA represents the detection effect. The results in Table 1 demonstrate that the training of embedding branches improves tracking performance in terms of IDF1 substantially, but reduces detection performance in terms of MOTA. The problem is solved by the propose decouple strategy and results will be further discussed in Sect. “Ablation study”.

Although the SS-MOT is a self-supervised MOT method, we does not have additional feature extraction network and complex pseudo-label generation step. We conducted the efficiency comparision experiment with the state-of-the art method FairMOT [4] on the MOT17 dataset, kee** our experimental setup consistent with the FairMOT [4]. Both our SS-MOT and FairMOT are inferred on a single Tesla V100 GPU, and the comparison results are shown in Table 2.

Table 2 Comparison of efficiency between SS-MOT and FairMOT
Table 3 Analysis of the proposed loss on the MOT17 validation set

Ablation study

We will study the ablation of loss, negative sample numbers, temperature, and linear assignment thresh for self-supervised embedding training and show the visualization results. All experiments covered in this subsection will be based on FairMOT implementation.

Losses

Our losses consist of three sublosses: \(L_{self}\) is responsible for distancing the features of objects within the same frame; \(L_{cross}\) is responsible for narrowing the difference between positive sample pairs that are successfully matched in adjacent frames; and \(L_{cycle}\) is responsible for ensuring that the forward and reverse matching results remain consistent. Table 3 shows the effect of using each loss on the validation set, where the results in the fourth row show the effect of supervised training. As shown in Table 3 the use of \(L_{self}\) alone achieves comparable results to supervision. The addition of \(L_{cross}\) and \(L_{cycle}\) considerably improves IDF1 and reduces IDS, that is, improves the effect of the embedded branch, but also causes a drop in recall (FN drop) and a drop in MOTA, which we believe is caused by the competition between the embedded and detected branches.

Negative sample numbers

Given that both \(L_{cross}\) and \(L_{cycle}\) are based on contrast loss, the number of negative samples greatly affects the effect of contrast loss. Therefore, we investigated the negative sample size. Both \(L_{cross}\) and \(L_{cycle}\) constrain the positive sample pairs that match successfully, and the remaining objects within the current two frames are treated as negative samples. Given that the MOT17 dataset is composed of multiple video segments, we can treat the objects of different videos within the same batch as negative samples. We treated the negative samples populated from different video segments as additional negative samples and analyzed the number of these additional populated negative samples. Table 4 shows the effect of FairMOT when using different numbers of negative samples, where \({N_t}\) is the number of objects in the first frame. As shown in Table 4, more negative samples bring higher IDF1, but at the same time will lower MOTA. Therefore, we finally choose \({N_{neg}}/{N_t} = 2\) to balance the most critical MOTA and IDF1 indicators.

Temperature

Self-supervised contrast loss uses temperature to control the weight of difficult samples. [22] set the temperature to 0.5 and mentioned that this value will have different optimal values depending on the task. Therefore, we compared the effect of different fixed T values and added a comparison of the effect of adaptive T values in Table 5. From the results in the table, \(T=1/2\) still achieves the best results at fixed values, but T obtained dynamically based on the target number achieves the best results. Therefore, set \(T = 1/2(\log ({N_t} + {N_{t + 1}} + 1))\).

Linear assignment thresh

We compared the effects of using different thresholds in Table 6, where \({N_{match}}\) and \({N_{right}}\) represent the ratio of the number of successful matches to the total number of targets and the number of correct matches to the number of successful matches in the last epoch of training, respectively. A higher thresh leads to a significant decrease in the number of successful matches and does not increase the correct rate much more, whereas a lower thresh decreases the correct rate more while increasing the number of matches. Therefore, we chose to set \(thresh = 0.7\) based on the final results.

Ablation study for decoupling

Considering that SiamMOT uses the same model as Cstrack and FairMOT itself is the decoupling output, we will first analyze the effect of Cstrack using different depths and widths of the decoupling module on the number of parameters and effects in the supervised case in this subsection. We will then analyze the effect of the decoupling module in supervised and self-supervised training of FairMOT and SiamMOT.

Table 4 Analysis of additional Negative sample numbers
Table 5 Analysis of temperature
Table 6 Analysis of linear assignment thresh
Table 7 Analysis of decoupling model’s depth and channel

We have studied the depth and channel of the decoupling module separately, as shown in Table 7. Most of the depth and channel decoupling can achieve better results than Cstrack. Experiments on depth reveal that the relationship between depth and effect is not linear. When \(depth = 3\), the metrics are significantly lower, and \(depth = 2\) is generally better than \(depth = 1\) for decoupling. Therefore, we finally chose \(depth = 2\). Experimenting with the number of channels shows that \(channel=256\) gives better results. We believe that this finding is due to the fact that Cstrack’s FPN output dimensions are 1024, 512, and 256. Setting the channel to 128 will lead to overcompression of features and poor performance of the metrics, whereas setting it to 512 will result in an excessive number of parameters and possibly redundant dimensions, resulting in a small gap between the 256 and 512 metrics for the channel.

We conducted ablation experiments on the decoupling modules using supervised training and self-supervised training for FairMOT and SiamMOT, respectively. Table 8 shows the effect of the decoupling module in the supervised and self-supervised scenarios. The decoupling module significantly improves the metric effect in all conditions.

Table 8 Analysis of decoupling effect
Table 9 Comparison of our method with recent online trackers in MOT17 and MOT20 benchmark

MOTChallenge results

We submitted the results to the MOT Challenge benchmark to evaluate the performance of our proposed tracker. We compared our method with current state-of-the-art supervised and unsupervised MOT algorithms and reported the test results for MOT17 and MOT20 datasets.

We applied the self-supervised training and decoupling modules to FairMOT, Cstrack, and SiamMOT. As shown in Table 9, our self-supervised method achieves comparable results with the supervised method on both MOT17 and MOT20 datasets. Our self-supervised training method achieves better results than all existing unsupervised methods except OUTrack, which uses an additional supervised signal. This is a strong result, especially considering the high generality of our approach in the JDE paradigm.

Conclusion

In this paper, we propose an selfsupervised MOT method based on an efficient JDE paradigm with self-supervised contrast learning training, called SS-MOT. We introduce the SD module to decouple the JDE paradigm model for multitasking. Our method overcomes the challenge that most tracker models require a large number of instance-level trajectory annotations for training and avoids the difficulty of fitting most models that use classification methods to learn embedded networks. Extensive experimental results show that SS-MOT achieves the best results in unsupervised mode in MOT17 and MOT20 benchmarks and achieves comparable results with supervised learning. SS-MOT is a generic training method for all JDE paradigm models, which can be applied to the training of real-time MOT models and can also be used for online learning updates of MOT models.