Multimodal Depression Recognition Using Audio and Visual

Xu, **a; Zhang, Guanhong; Mao, Xueqian; Lu, Qinghua

doi:10.1007/978-981-97-0903-8_22

**a Xu⁸,
Guanhong Zhang⁸,
Xueqian Mao⁸ &
…
Qinghua Lu⁸

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 2014))

Included in the following conference series:

International Conference on Applied Intelligence

192 Accesses

Abstract

Depression, as one of the prominent challenges in the field of worldwide psychological health, affects the quality of life and psychological well-being of hundreds of millions of people. Due to its high prevalence, recurrence and strong association with other health problems, early diagnosis and treatment are crucial. With advances in technology, audio and visual data are increasingly recognized as biomarkers for the identification of depression. However, it should be noted that many existing studies focus primarily on a single modality, often overlooking the potential complementarity between different modalities. In this context, this study proposes an advanced approach that integrates convolutional neural networks (CNN) and bidirectional long short-term memory networks (BiLSTM) with attention mechanisms, with the objective of extracting more profound features from speech data. For facial expressions, a hybrid model comprising temporal convolutional networks (TCN) and long short-term memory networks (LSTM) is utilized. Furthermore, to achieve a seamless integration of different modalities, we design a cross-attention fusion strategy that allows speech and facial information to be integrated into a unified framework. Our methodology’s efficacy is confirmed by the experimental findings on the E-DAIC dataset, in which the multimodal fusion strategy demonstrates higher precision and reliability in detecting depression compared to a single modality.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: EUR 29.95; Price includes VAT (Germany)

eBook: EUR 64.19; Price includes VAT (Germany)

Softcover Book: EUR 80.24; Price includes VAT (Germany)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Multi-modal Depression Estimation Based on Sub-attentional Fusion

An Automatic Depression Detection Method with Cross-Modal Fusion Network and Multi-head Attention Mechanism

CANAMRF: An Attention-Based Model for Multimodal Depression Detection

References

Hammar, Å., Ronold, E.H., Rekkedal, G.Å.: Cognitive impairment and neurocognitive profiles in major depression—a clinical perspective. Front. Psychiatry 13, 764374 (2022)
Article Google Scholar
WHO: Depression key facts. World Health Organization (2023). https://www.who.int/news-room/fact-sheets/detail/depression
Schumann, I., Schneider, A., Kantert, C., Löwe, B., Linde, K.: Physicians’ attitudes, diagnostic process and barriers regarding depression diagnosis in primary care: a systematic review of qualitative studies. Fam. Pract. 29, 255–263 (2012)
Article Google Scholar
World Health Organization. Depression and Other Common Mental Disorders: Global Health Estimates. World Health Organization (2017)
Google Scholar
Mundt, J.C., Vogel, A.P., Feltner, D.E., Lenderking, W.R.: Vocal acoustic biomarkers of depression severity and treatment response. Biol. Psychiat. 72(7), 580–587 (2012)
Article Google Scholar
Rejaibi, E., Komaty, A., Meriaudeau, F., Agrebi, S., Othmani, A.: MFCC-based recurrent neural network for automatic clinical depression recognition and assessment from speech. Biomed. Signal Process. Control 71, 103107 (2022)
Article Google Scholar
He, L., Cao, C.: Automated depression analysis using convolutional neural networks from speech. J. Biomed. Inform. 83, 103–111 (2018)
Article Google Scholar
Ma, X., Yang, H., Chen, Q., Huang, D., Wang, Y.: Depaudionet: an efficient deep model for audio based depression classification. In: Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, pp. 35–42 (2016)
Google Scholar
Girard, J.M., Cohn, J.F., Mahoor, M.H., Mavadati, S., Rosenwald, D.P.: Social risk and depression: evidence from manual and automatic facial expression analysis. In: 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pp. 1–8. IEEE (2013)
Google Scholar
Pampouchidou, A., et al.: Automatic assessment of depression based on visual cues: a systematic review. IEEE Trans. Affect. Comput. 10(4), 445–470 (2017)
Article Google Scholar
Gavrilescu, M., Vizireanu, N.: Predicting depression, anxiety, and stress levels from videos using the facial action coding system. Sensors 19(17), 3693 (2019)
Article Google Scholar
Ramachandram, D., Taylor, G.W.: Deep multimodal learning: a survey on recent advances and trends. IEEE Signal Process. Mag. 34(6), 96–108 (2017)
Article Google Scholar
Meng, H., Huang, D., Wang, H., Yang, H., Ai-Shuraifi, M., Wang, Y.: Depression recognition based on dynamic facial and vocal expression features using partial least square regression. In: Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge, pp. 21–30 (2013)
Google Scholar
Alghowinem, S., Goecke, R., Wagner, M., Parkerx, G., Breakspear, M.: Head pose and movement analysis as an indicator of depression. In: 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, pp. 283–288. IEEE (2013)
Google Scholar
Williamson, J.R., Young, D., Nierenberg, A.A., Niemi, J., Helfer, B.S., Quatieri, T.F.: Tracking depression severity from audio and video based on speech articulatory coordination. Comput. Speech Lang. 55, 40–56 (2019)
Article Google Scholar
Bone, D., Lee, C.C., Narayanan, S.: Robust unsupervised arousal rating: a rule-based framework with knowledge-inspired vocal features. IEEE Trans. Affect. Comput. 5(2), 201–213 (2014)
Article Google Scholar
Eyben, F., Weninger, F., Schuller, B.: Affect recognition in real-life acoustic conditions-a new perspective on feature selection. In: Proceedings INTERSPEECH 2013, 14th Annual Conference of the International Speech Communication Association, Lyon, France (2013)
Google Scholar
Zhai, S.P., Yang, Y.Y.: Bilingual text sentiment analysis based on attention mechanism Bi-LSTM. Comput. Appl. Softw. 36(12), 251–255 (2019)
Google Scholar
Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. ar**v preprint ar**v:1803.01271 (2018)
Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection, pp. 156–165 (2017)
Google Scholar
Chao, Y.W., Vijayanarasimhan, S., Seybold, B., Ross, D.A., Deng, J., Sukthankar, R.: Rethinking the faster R-CNN architecture for temporal action localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1130–1139 (2018)
Google Scholar
Khorram, S., Aldeneh, Z., Dimitriadis, D., McInnis, M., Provost, E.M.: Capturing long-term temporal dependencies with convolutional networks for continuous emotion recognition. ar**v preprint ar**v:1708.07050 (2017)

Download references

Acknowledgement

The authors acknowledge the Key Research and Development Plan of Anhui Province (202104d07020006), the Natural Science Foundation of Anhui Province (2108085MF223), University Natural Sciences Research Project of Anhui Province (KJ2021A0991), the Key Research and Development Plan of Hefei (2021GJ030).

Author information

Authors and Affiliations

Department of Artificial Intelligence and Big Data, Hefei University, Hefei, 230000, China
**a Xu, Guanhong Zhang, Xueqian Mao & Qinghua Lu

Authors

**a Xu
View author publications
You can also search for this author in PubMed Google Scholar
Guanhong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xueqian Mao
View author publications
You can also search for this author in PubMed Google Scholar
Qinghua Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to **a Xu .

Editor information

Editors and Affiliations

Eastern Institute of Technology, Zhejiang, China
De-Shuang Huang
University of Wollongong, North Wollongong, NSW, Australia
Prashan Premaratne
Guangxi Academy of Sciences, Guangxi, China
Changan Yuan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, X., Zhang, G., Mao, X., Lu, Q. (2024). Multimodal Depression Recognition Using Audio and Visual. In: Huang, DS., Premaratne, P., Yuan, C. (eds) Applied Intelligence. ICAI 2023. Communications in Computer and Information Science, vol 2014. Springer, Singapore. https://doi.org/10.1007/978-981-97-0903-8_22

Download citation

DOI: https://doi.org/10.1007/978-981-97-0903-8_22
Published: 01 March 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0902-1
Online ISBN: 978-981-97-0903-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multimodal Depression Recognition Using Audio and Visual

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-modal Depression Estimation Based on Sub-attentional Fusion

An Automatic Depression Detection Method with Cross-Modal Fusion Network and Multi-head Attention Mechanism

CANAMRF: An Attention-Based Model for Multimodal Depression Detection

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Multimodal Depression Recognition Using Audio and Visual

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-modal Depression Estimation Based on Sub-attentional Fusion

An Automatic Depression Detection Method with Cross-Modal Fusion Network and Multi-head Attention Mechanism

CANAMRF: An Attention-Based Model for Multimodal Depression Detection

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation