On the Relevance of Temporal Features for Medical Ultrasound Video Recognition

  • Conference paper
  • First Online:
Medical Image Computing and Computer Assisted Intervention – MICCAI 2023 (MICCAI 2023)

Abstract

Many medical ultrasound video recognition tasks involve identifying key anatomical features regardless of when they appear in the video suggesting that modeling such tasks may not benefit from temporal features. Correspondingly, model architectures that exclude temporal features may have better sample efficiency. We propose a novel multi-head attention architecture that incorporates these hypotheses as inductive priors to achieve better sample efficiency on common ultrasound tasks. We compare the performance of our architecture to an efficient 3D CNN video recognition model in two settings: one where we expect not to require temporal features and one where we do. In the former setting, our model outperforms the 3D CNN - especially when we artificially limit the training data. In the latter, the outcome reverses. These results suggest that expressive time-independent models may be more effective than state-of-the-art video recognition models for some common ultrasound tasks in the low-data regime. Code is available at https://github.com/MedAI-Clemson/pda_detection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Amirian, S., Rasheed, K., Taha, T.R., Arabnia, H.R.: Automatic image and video caption generation with deep learning: a concise review and algorithmic overlap. IEEE Access 8, 218386–218400 (2020)

    Article  Google Scholar 

  2. Carbonneau, M.A., Cheplygina, V., Granger, E., Gagnon, G.: Multiple instance learning: a survey of problem characteristics and applications. Pattern Recogn. 77, 329–353 (2018)

    Article  Google Scholar 

  3. Chen, H., et al.: automatic fetal ultrasound standard plane detection using knowledge transferred recurrent neural networks. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9349, pp. 507–514. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24553-9_62

    Chapter  Google Scholar 

  4. Dezaki, F.T., et al.: Deep residual recurrent neural networks for characterisation of cardiac cycle phase from echocardiograms. In: Cardoso, M.J., et al. (eds.) DLMIA/ML-CDS -2017. LNCS, vol. 10553, pp. 100–108. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67558-9_12

    Chapter  Google Scholar 

  5. Ding, X., Li, B., Hu, W., **ong, W., Wang, Z.: Horror video scene recognition based on multi-view multi-instance learning. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012. LNCS, vol. 7726, pp. 599–610. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37431-9_46

    Chapter  Google Scholar 

  6. Gu, Z., Mei, T., Hua, X.S., Tang, J., Wu, X.: Multi-layer multi-instance learning for video concept detection. IEEE Trans. Multimedia 10(8), 1605–1616 (2008)

    Article  Google Scholar 

  7. Heo, B., et al.: Adamp: slowing down the slowdown for momentum optimizers on scale-invariant weights. ar**v preprint ar**v:2006.08217 (2020)

  8. Howard, J.P., et al.: Improving ultrasound video classification: an evaluation of novel deep learning methods in echocardiography. J. Med. Artif. Intell. 3 (2020)

    Google Scholar 

  9. Ilse, M., Tomczak, J., Welling, M.: Attention-based deep multiple instance learning. In: International Conference on Machine Learning, pp. 2127–2136. PMLR (2018)

    Google Scholar 

  10. Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2012)

    Article  Google Scholar 

  11. Kornblith, A.E., et al.: Development and validation of a deep learning strategy for automated view classification of pediatric focused assessment with sonography for trauma. J. Ultrasound Med. 41(8), 1915–1924 (2022)

    Article  Google Scholar 

  12. Lei, H., Ashrafi, A., Chang, P., Chang, A., Lai, W.: Patent ductus arteriosus (PDA) detection in echocardiograms using deep learning. Intell.-Based Med. 6, 100054 (2022)

    Google Scholar 

  13. Liu, S., et al.: Deep learning in medical ultrasound analysis: a review. Engineering 5(2), 261–275 (2019)

    Article  Google Scholar 

  14. Luo, W., **ng, J., Milan, A., Zhang, X., Liu, W., Kim, T.K.: Multiple object tracking: a literature review. Artif. Intell. 293, 103448 (2021)

    Article  MathSciNet  Google Scholar 

  15. Mazzia, V., Angarano, S., Salvetti, F., Angelini, F., Chiaberge, M.: Action transformer: a self-attention model for short-time pose-based human action recognition. Pattern Recogn. 124, 108487 (2022)

    Article  Google Scholar 

  16. Ouyang, D., et al.: Video-based AI for beat-to-beat assessment of cardiac function. Nature 580(7802), 252–256 (2020)

    Article  Google Scholar 

  17. Patra, A., Huang, W., Noble, J.A.: Learning spatio-temporal aggregation for fetal heart analysis in ultrasound video. In: Cardoso, M.J., et al. (eds.) DLMIA/ML-CDS -2017. LNCS, vol. 10553, pp. 276–284. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67558-9_32

    Chapter  Google Scholar 

  18. Plizzari, C., Cannici, M., Matteucci, M.: Spatial temporal transformer network for skeleton-based action recognition. In: Del Bimbo, A., et al. (eds.) ICPR 2021. LNCS, vol. 12663, pp. 694–701. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-68796-0_50

    Chapter  Google Scholar 

  19. Pu, B., Li, K., Li, S., Zhu, N.: Automatic fetal ultrasound standard plane recognition based on deep learning and IIoT. IEEE Trans. Industr. Inf. 17(11), 7771–7780 (2021)

    Article  Google Scholar 

  20. Rasheed, K., Junejo, F., Malik, A., Saqib, M.: Automated fetal head classification and segmentation using ultrasound video. IEEE Access 9, 160249–160267 (2021)

    Article  Google Scholar 

  21. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. Adv. Neural Inform. Process. Syst. 27 (2014)

    Google Scholar 

  22. Sofka, M., Milletari, F., Jia, J., Rothberg, A.: Fully convolutional regression network for accurate detection of measurement points. In: Cardoso, M.J., et al. (eds.) DLMIA/ML-CDS -2017. LNCS, vol. 10553, pp. 258–266. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67558-9_30

    Chapter  Google Scholar 

  23. Stikic, M., Schiele, B.: Activity recognition from sparsely labeled data using multi-instance learning. In: Choudhury, T., Quigley, A., Strang, T., Suginuma, K. (eds.) LoCA 2009. LNCS, vol. 5561, pp. 156–173. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-01721-6_10

    Chapter  Google Scholar 

  24. Taye, M., Morrow, D., Cull, J., Smith, D.H., Hagan, M.: Deep learning for fast quality assessment. J. Ultrasound Med. 42(1), 71–79 (2022)

    Google Scholar 

  25. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)

    Google Scholar 

  26. Vaswani, A., et al.: Attention is all you need. Adv. Neural Inform. Process. Syst. 30 (2017)

    Google Scholar 

  27. Wightman, R.: Pytorch image models. https://github.com/rwightman/pytorch-image-models (2019). https://doi.org/10.5281/zenodo.4414861

  28. **a, H., Zhan, Y.: A survey on temporal action localization. IEEE Access 8, 70477–70487 (2020)

    Article  Google Scholar 

  29. Yang, J., Yan, R., Hauptmann, A.G.: Multiple instance learning for labeling faces in broadcasting news video. In: Proceedings of the 13th Annual ACM International Conference on Multimedia, pp. 31–40 (2005)

    Google Scholar 

  30. Zhang, H.B., et al.: A comprehensive survey of vision-based human action recognition methods. Sensors 19(5), 1005 (2019)

    Article  Google Scholar 

Download references

Acknowledgement

We thank Clemson University for their generous allotment of compute time on the Palmetto Cluster.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to D. Hudson Smith .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2688 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Smith, D.H., Lineberger, J.P., Baker, G.H. (2023). On the Relevance of Temporal Features for Medical Ultrasound Video Recognition. In: Greenspan, H., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2023. MICCAI 2023. Lecture Notes in Computer Science, vol 14221. Springer, Cham. https://doi.org/10.1007/978-3-031-43895-0_70

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-43895-0_70

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-43894-3

  • Online ISBN: 978-3-031-43895-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation