Log in

Towards Frame Rate Agnostic Multi-object Tracking

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Multi-object Tracking (MOT) is one of the most fundamental computer vision tasks that contributes to various video analysis applications. Despite the recent promising progress, current MOT research is still limited to a fixed sampling frame rate of the input stream. They are neither as flexible as humans nor well-matched to industrial scenarios which require the trackers to be frame rate insensitive in complicated conditions. In fact, we empirically found that the accuracy of all recent state-of-the-art trackers drops dramatically when the input frame rate changes. For a more intelligent tracking solution, we shift the attention of our research work to the problem of Frame Rate Agnostic MOT (FraMOT), which takes frame rate insensitivity into consideration. In this paper, we propose a Frame Rate Agnostic MOT framework with a Periodic training Scheme (FAPS) to tackle the FraMOT problem for the first time. Specifically, we propose a Frame Rate Agnostic Association Module (FAAM) that infers and encodes the frame rate information to aid identity matching across multi-frame-rate inputs, improving the capability of the learned model in handling complex motion-appearance relations in FraMOT. Moreover, the association gap between training and inference is enlarged in FraMOT because those post-processing steps not included in training make a larger difference in lower frame rate scenarios. To address it, we propose Periodic Training Scheme to reflect all post-processing steps in training via tracking pattern matching and fusion. Along with the proposed approaches, we make the first attempt to establish an evaluation method for this new task of FraMOT. Besides providing simulations and evaluation metrics, we try to solve new challenges in two different modes, i.e., known frame rate and unknown frame rate, aiming to handle a more complex situation. The quantitative experiments on the challenging MOT17/20 dataset (FraMOT version) have clearly demonstrated that the proposed approaches can handle different frame rates better and thus improve the robustness against complicated scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Data Availibility

All data used in this paper are publicly available on corresponding websites. MOT17/MOT20: motchallenge.net; CrowdHuman: www.crowdhuman.org; CityScapes: www.cityscapes-dataset.com; HIE: humaninevents.org. SOMPT22: sompt22.github.io

References

  • Bernardin, K., & Stiefelhagen, R. (2008). Evaluating multiple object tracking performance: The clear mot metrics. Journal on Image and Video Processing, 2008, 1.

    Article  Google Scholar 

  • Brasó, G., & Leal-Taixé, L. (2020). Learning a neural solver for multiple object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6247–6257).

  • Chu, P., & Ling, H. (2019). Famnet: Joint learning of feature, affinity and multi-dimensional assignment for online multiple object tracking. In The IEEE International Conference on Computer Vision (ICCV).

  • Chu, Q., Ouyang, W., Li, H., Wang, X., Liu, B., & Yu, N. (2017). Online multi-object tracking using cnn-based single object tracker with spatial-temporal attention mechanism. In ICCV.

  • Cordts, M., Omran, M., Ramos, S., Rehfeld, T. (2016). The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Dendorfer, P., Rezatofighi, H., Milan, A., Shi, J., Cremers, D., Reid,I., Roth, S., Schindler, K., & Leal-Taixe, L. (2020). Mot20: A benchmark for multi object tracking in crowded scenes. ar**v:2003.09003

  • Dendorfer, P., Osep, A., Milan, A., Schindler, K., Cremers, D., Reid, I., Roth, S., & Leal-Taixe, L. (2021). Motchallenge: A benchmark for single-camera multiple target tracking. International Journal of Computer Vision, 129(4), 845–881.

    Article  Google Scholar 

  • Ge, Z., Liu, S., Wang, F., Li, Z., & Sun, J. (2021). YOLOX: Exceeding YOLO series in 2021. https://arxiv.org/abs/2107.08430.

  • Han, T., Bai, L., Gao, J., Wang, Q., & Ouyang, W. (2022). Dr. vic: Decomposition and reasoning for video individual counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 3083–3092).

  • Hornakova, A., Henschel, R., Rosenhahn, B., & Swoboda, P. (2020). Lifted disjoint paths with application in multiple object tracking. In International conference on machine learning, PMLR (pp. 4364–4375).

  • Hu, W., Shi, X., Zhou, Z., & **ng, J. (2020). Dual l1-normalized context aware tensor power iteration and its applications to multi-object tracking and multi-graph matching. International Journal of Computer Vision, 128(2), 360–392.

    Article  MathSciNet  Google Scholar 

  • Kieritz, H., Hubner, W., & Arens, M. (2018). Joint detection and online multi-object tracking. In CVPRW.

  • Li, J., Gao, X., & Jiang, T. (2020). Graph networks for multiple object tracking. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 719–728).

  • Lin, W., Liu, H., Liu, S., Li, Y., Qian, R., Wang, T., Xu, N., **ong, H., Qi, G.-J. & Sebe, N. (2020). Human in events: A large-scale benchmark for human-centric video analysis in complex events. ar**v:2005.04490

  • Luiten, J., Osep, A., Dendorfer, P., Torr, P., Geiger, A., Leal-Taixe, L., & Leibe, B. (2021). Hota: A higher order metric for evaluating multi-object tracking. International Journal of Computer Vision, 129(2), 548–578.

    Article  Google Scholar 

  • Ma, C., Yang, F., Li, Y., Jia, H., **e, X., & Gao, W. (2021). Deep trajectory post-processing and position projection for single & multiple camera multiple object tracking. International Journal of Computer Vision, 129(12), 3255–3278.

    Article  Google Scholar 

  • Maksai, A., & Fua, P. (2019). Eliminating exposure bias and metric mismatch in multiple object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4639–4648).

  • Milan, A., Leal-Taixé, L., & Reid, I., Roth, S., & Schindler, K. (2016). MOT16: A benchmark for multi-object tracking. ar**v:1603.00831

  • Milan, A., Rezatofighi, S. H., Dick, A. R., Reid,. I., & Schindler, K. (2017). Online multi-target tracking using recurrent neural networks. In AAAI.

  • Ristani, E., Solera, F., Zou, R., Cucchiara, R., & Tomasi, C. (2016). Performance measures and a data set for multi-target, multi-camera tracking. In European Conference on Computer Vision (pp. 17–35). Springer.

  • Sadeghian, A., Alahi, A., & Savarese, S. (2017). Tracking the untrackable: Learning to track multiple cues with long-term dependencies. In ICCV.

  • Saleh, F., Aliakbarian, S., Rezatofighi, H., & Salzmann, M. (2021). Probabilistic tracklet scoring and inpainting for multiple object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 14329–14339).

  • Shao, S., Zhao, Z., Li, B., **ao, T., Yu, G., Zhang, X., & Sun, J. (2018). Crowdhuman: A benchmark for detecting human in a crowd. ar**v:1805.00123

  • Simsek, F. E., Cigla, C., & Kayabol, K. (2023). Sompt22: A surveillance oriented multi-pedestrian tracking dataset. In Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part V (pp. 659–675). Springer.

  • Takala, V., & Pietikainen, M. (2007). Multi-object tracking using color, texture and motion. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, IEEE (pp. 1–7).

  • Tang, S., Andres, B., Andriluka, M., & Schiele, B. (2016). Multi-person tracking by multicut and deep matching. In ECCV.

  • Tang, S., Andriluka, M., Andres, B., & Schiele, B. (2017). Multiple people tracking by lifted multicut and person reidentification. In CVPR.

  • Wang, Y., Kitani, K., & Weng, X. (2021). Joint object detection and multi-object tracking with graph neural networks. In 2021 IEEE International Conference on Robotics and Automation (ICRA), IEEE (pp. 13708–13715).

  • Wen, L., Lei, Z., Chang, M. C., Qi, H., & Lyu, S. (2017). Multi-camera multi-target tracking with space-time-view hyper-graph. International Journal of Computer Vision, 122(2), 313–333.

    Article  MathSciNet  Google Scholar 

  • Wojke, N., Bewley, A., & Paulus, D. (2017). Simple online and realtime tracking with a deep association metric. In ICIP.

  • Wu, J., Cao, J., & Song, L., Wang, Y., Yang, M., & Yuan, J. (2021). Track to detect and segment: An online multi-object tracker. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12352–12361).

  • Xu, J., Cao, Y., Zhang, Z., & Hu, H. (2019). Spatial-temporal relation networks for multi-object tracking. In ICCV.

  • Yoon, J. H., Lee, C. R., Yang, M. H., & Yoon, K.-J. (2019). Structural constraint data association for online multi-object tracking. International Journal of Computer Vision, 127(1), 1–21.

    Article  Google Scholar 

  • Yu, F., Li, W., Li, Q., Liu, Y, Shi, X., & Yan, J. (2016). Poi: Multiple object tracking with high performance detection and appearance feature. In ECCV.

  • Zhang, L., Li, Y., & Nevatia, R. (2008). Global data association for multi-object tracking using network flows. In CVPr.

  • Zhang, Y., Wang, C., Wang, X., Zend, W., & Liu, W. (2021). Fairmot: On the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision, 129(11), 3069–3087.

    Article  Google Scholar 

  • Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F., Yuan, Z., Luo, P., Liu, W., & Wang, X. (2022). Bytetrack: Multi-object tracking by associating every detection box. In Proceedings of the European Conference on Computer Vision (ECCV).

  • Zhou, X., Koltun, V., & Krähenbühl, P. (2020). Tracking objects as points. ECCV.

Download references

Acknowledgements

This work is partially supported by the National Key R &D Program of China (NO.2022ZD0160100). The Research is supported by Shanghai AI Laboratory. Wanli Ouyang was supported by the Australian Research Council Grant DP200103223, Australian Medical Research Future Fund MRFAI000085, CRC-P Smart Material Recovery Facility (SMRF) - Curby Soft Plastics, and CRC-P ARIA - Bionic Visual-Spatial Prosthesis for the Blind.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lei Bai.

Additional information

Communicated by Matej Kristan.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Tacking Results Demo

Appendix A: Tacking Results Demo

Fig. 12
figure 12

Tracking results on testing sequence MOT20-06, with \(k=25\). For better demonstration, only 20 trajectories are shown

Fig. 13
figure 13

Tracking results on testing sequence MOT20-04, with \(k=50\)

Figures 12 and 13 show some selected tracking results of our approach on the MOT20 testing set. Different colors and different numbers represent different trajectories. We can see that at the lower frame rate scenarios (i.e., sampling factor \(k\ge 25\)), our method keeps a stable performance. More importantly, the results are all from the same model checkpoint, showing that our method can handle various frame rates robustly. Among these scenarios, the movements of the targets between adjacent frames are much larger than those in normal frame rate scenarios.

In Fig. 13, the targets of number 2, number 6 and number 19 do not have a large bounding box overlap between adjacent frames, leading to less reliable motion cues. In normal frame rate scenarios, such a large motion gap in adjacent frames usually indicates a different identity, which is quite different from lower frame rate scenarios. Thanks to the FAAM design and PTS strategy, Our method is able to make a correct prediction in multi-frame-rate settings simultaneously.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Feng, W., Bai, L., Yao, Y. et al. Towards Frame Rate Agnostic Multi-object Tracking. Int J Comput Vis 132, 1443–1462 (2024). https://doi.org/10.1007/s11263-023-01943-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-023-01943-2

Keywords

Navigation