Multimodal Depression Recognition Using Audio and Visual

  • Conference paper
  • First Online:
Applied Intelligence (ICAI 2023)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 2014))

Included in the following conference series:

  • 192 Accesses

Abstract

Depression, as one of the prominent challenges in the field of worldwide psychological health, affects the quality of life and psychological well-being of hundreds of millions of people. Due to its high prevalence, recurrence and strong association with other health problems, early diagnosis and treatment are crucial. With advances in technology, audio and visual data are increasingly recognized as biomarkers for the identification of depression. However, it should be noted that many existing studies focus primarily on a single modality, often overlooking the potential complementarity between different modalities. In this context, this study proposes an advanced approach that integrates convolutional neural networks (CNN) and bidirectional long short-term memory networks (BiLSTM) with attention mechanisms, with the objective of extracting more profound features from speech data. For facial expressions, a hybrid model comprising temporal convolutional networks (TCN) and long short-term memory networks (LSTM) is utilized. Furthermore, to achieve a seamless integration of different modalities, we design a cross-attention fusion strategy that allows speech and facial information to be integrated into a unified framework. Our methodology’s efficacy is confirmed by the experimental findings on the E-DAIC dataset, in which the multimodal fusion strategy demonstrates higher precision and reliability in detecting depression compared to a single modality.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
EUR 29.95
Price includes VAT (Germany)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR 64.19
Price includes VAT (Germany)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
EUR 80.24
Price includes VAT (Germany)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Hammar, Å., Ronold, E.H., Rekkedal, G.Å.: Cognitive impairment and neurocognitive profiles in major depression—a clinical perspective. Front. Psychiatry 13, 764374 (2022)

    Article  Google Scholar 

  2. WHO: Depression key facts. World Health Organization (2023). https://www.who.int/news-room/fact-sheets/detail/depression

  3. Schumann, I., Schneider, A., Kantert, C., Löwe, B., Linde, K.: Physicians’ attitudes, diagnostic process and barriers regarding depression diagnosis in primary care: a systematic review of qualitative studies. Fam. Pract. 29, 255–263 (2012)

    Article  Google Scholar 

  4. World Health Organization. Depression and Other Common Mental Disorders: Global Health Estimates. World Health Organization (2017)

    Google Scholar 

  5. Mundt, J.C., Vogel, A.P., Feltner, D.E., Lenderking, W.R.: Vocal acoustic biomarkers of depression severity and treatment response. Biol. Psychiat. 72(7), 580–587 (2012)

    Article  Google Scholar 

  6. Rejaibi, E., Komaty, A., Meriaudeau, F., Agrebi, S., Othmani, A.: MFCC-based recurrent neural network for automatic clinical depression recognition and assessment from speech. Biomed. Signal Process. Control 71, 103107 (2022)

    Article  Google Scholar 

  7. He, L., Cao, C.: Automated depression analysis using convolutional neural networks from speech. J. Biomed. Inform. 83, 103–111 (2018)

    Article  Google Scholar 

  8. Ma, X., Yang, H., Chen, Q., Huang, D., Wang, Y.: Depaudionet: an efficient deep model for audio based depression classification. In: Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, pp. 35–42 (2016)

    Google Scholar 

  9. Girard, J.M., Cohn, J.F., Mahoor, M.H., Mavadati, S., Rosenwald, D.P.: Social risk and depression: evidence from manual and automatic facial expression analysis. In: 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pp. 1–8. IEEE (2013)

    Google Scholar 

  10. Pampouchidou, A., et al.: Automatic assessment of depression based on visual cues: a systematic review. IEEE Trans. Affect. Comput. 10(4), 445–470 (2017)

    Article  Google Scholar 

  11. Gavrilescu, M., Vizireanu, N.: Predicting depression, anxiety, and stress levels from videos using the facial action coding system. Sensors 19(17), 3693 (2019)

    Article  Google Scholar 

  12. Ramachandram, D., Taylor, G.W.: Deep multimodal learning: a survey on recent advances and trends. IEEE Signal Process. Mag. 34(6), 96–108 (2017)

    Article  Google Scholar 

  13. Meng, H., Huang, D., Wang, H., Yang, H., Ai-Shuraifi, M., Wang, Y.: Depression recognition based on dynamic facial and vocal expression features using partial least square regression. In: Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge, pp. 21–30 (2013)

    Google Scholar 

  14. Alghowinem, S., Goecke, R., Wagner, M., Parkerx, G., Breakspear, M.: Head pose and movement analysis as an indicator of depression. In: 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, pp. 283–288. IEEE (2013)

    Google Scholar 

  15. Williamson, J.R., Young, D., Nierenberg, A.A., Niemi, J., Helfer, B.S., Quatieri, T.F.: Tracking depression severity from audio and video based on speech articulatory coordination. Comput. Speech Lang. 55, 40–56 (2019)

    Article  Google Scholar 

  16. Bone, D., Lee, C.C., Narayanan, S.: Robust unsupervised arousal rating: a rule-based framework with knowledge-inspired vocal features. IEEE Trans. Affect. Comput. 5(2), 201–213 (2014)

    Article  Google Scholar 

  17. Eyben, F., Weninger, F., Schuller, B.: Affect recognition in real-life acoustic conditions-a new perspective on feature selection. In: Proceedings INTERSPEECH 2013, 14th Annual Conference of the International Speech Communication Association, Lyon, France (2013)

    Google Scholar 

  18. Zhai, S.P., Yang, Y.Y.: Bilingual text sentiment analysis based on attention mechanism Bi-LSTM. Comput. Appl. Softw. 36(12), 251–255 (2019)

    Google Scholar 

  19. Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. ar**v preprint ar**v:1803.01271 (2018)

  20. Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection, pp. 156–165 (2017)

    Google Scholar 

  21. Chao, Y.W., Vijayanarasimhan, S., Seybold, B., Ross, D.A., Deng, J., Sukthankar, R.: Rethinking the faster R-CNN architecture for temporal action localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1130–1139 (2018)

    Google Scholar 

  22. Khorram, S., Aldeneh, Z., Dimitriadis, D., McInnis, M., Provost, E.M.: Capturing long-term temporal dependencies with convolutional networks for continuous emotion recognition. ar**v preprint ar**v:1708.07050 (2017)

Download references

Acknowledgement

The authors acknowledge the Key Research and Development Plan of Anhui Province (202104d07020006), the Natural Science Foundation of Anhui Province (2108085MF223), University Natural Sciences Research Project of Anhui Province (KJ2021A0991), the Key Research and Development Plan of Hefei (2021GJ030).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to **a Xu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xu, X., Zhang, G., Mao, X., Lu, Q. (2024). Multimodal Depression Recognition Using Audio and Visual. In: Huang, DS., Premaratne, P., Yuan, C. (eds) Applied Intelligence. ICAI 2023. Communications in Computer and Information Science, vol 2014. Springer, Singapore. https://doi.org/10.1007/978-981-97-0903-8_22

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-0903-8_22

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-0902-1

  • Online ISBN: 978-981-97-0903-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation