Abstract
Humans are arguably innately prepared to comprehend others’ emotional expressions from subtle body movements. If robots or computers can be empowered with this capability, a number of robotic applications become possible. Automatically recognizing human bodily expression in unconstrained situations, however, is daunting given the incomplete understanding of the relationship between emotional expressions and body movements. The current research, as a multidisciplinary effort among computer and information sciences, psychology, and statistics, proposes a scalable and reliable crowdsourcing approach for collecting in-the-wild perceived emotion data for computers to learn to recognize body languages of humans. To accomplish this task, a large and growing annotated dataset with 9876 video clips of body movements and 13,239 human characters, named Body Language Dataset (BoLD), has been created. Comprehensive statistical analysis of the dataset revealed many interesting insights. A system to model the emotional expressions based on bodily movements, named Automated Recognition of Bodily Expression of Emotion (ARBEE), has also been developed and evaluated. Our analysis shows the effectiveness of Laban Movement Analysis (LMA) features in characterizing arousal, and our experiments using LMA features further demonstrate computability of bodily expression. We report and compare results of several other baseline methods which were developed for action recognition based on two different modalities, body skeleton and raw image. The dataset and findings presented in this work will likely serve as a launchpad for future discoveries in body language understanding that will enable future robots to interact and collaborate more effectively with humans.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-019-01215-y/MediaObjects/11263_2019_1215_Fig1_HTML.jpg)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-019-01215-y/MediaObjects/11263_2019_1215_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-019-01215-y/MediaObjects/11263_2019_1215_Fig3_HTML.jpg)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-019-01215-y/MediaObjects/11263_2019_1215_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-019-01215-y/MediaObjects/11263_2019_1215_Fig5a_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-019-01215-y/MediaObjects/11263_2019_1215_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-019-01215-y/MediaObjects/11263_2019_1215_Fig7_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-019-01215-y/MediaObjects/11263_2019_1215_Fig8_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-019-01215-y/MediaObjects/11263_2019_1215_Fig9_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-019-01215-y/MediaObjects/11263_2019_1215_Fig10_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-019-01215-y/MediaObjects/11263_2019_1215_Fig11_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-019-01215-y/MediaObjects/11263_2019_1215_Fig12_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-019-01215-y/MediaObjects/11263_2019_1215_Fig13_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-019-01215-y/MediaObjects/11263_2019_1215_Fig14_HTML.png)
Similar content being viewed by others
References
Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., & Vijayanarasimhan, S. (2016). Youtube-8m: A large-scale video classification benchmark. ar**v preprint ar**v:1609.08675.
Aristidou, A., Charalambous, P., & Chrysanthou, Y. (2015). Emotion analysis and classification: understanding the performers’ emotions using the lma entities. Computer Graphics Forum, 34(6), 262–276.
Aristidou, A., Zeng, Q., Stavrakis, E., Yin, K., Cohen-Or, D., Chrysanthou, Y., & Chen, B. (2017). Emotion control of unstructured dance movements. In: Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation, article 9.
Aviezer, H., Trope, Y., & Todorov, A. (2012). Body cues, not facial expressions, discriminate between intense positive and negative emotions. Science, 338(6111), 1225–1229.
Bewley, A., Ge, Z., Ott, L., Ramos, F., & Upcroft, B. (2016). Simple online and realtime tracking. Proceedings of the IEEE International Conference on Image Processing, pp. 3464–3468. https://doi.org/10.1109/ICIP.2016.7533003.
Biel, J. I., & Gatica-Perez, D. (2013). The youtube lens: Crowdsourced personality impressions and audiovisual analysis of vlogs. IEEE Transactions on Multimedia, 15(1), 41–55.
Caba Heilbron, F., Escorcia, V., Ghanem, B., & Carlos Niebles, J. (2015). Activitynet: A large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970.
Cao, Z., Simon, T., Wei, S.E., & Sheikh, Y. (2017). Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299.
Carmichael, L., Roberts, S., & Wessell, N. (1937). A study of the judgment of manual expression as presented in still and motion pictures. The Journal of Social Psychology, 8(1), 115–142.
Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp. 4724–4733.
Noroozi, F., Kaminska, D., Corneanu, C., Sapinski, T., Escalera, S., & Anbarjafari, G. (2018). Survey on emotional body gesture recognition. Journal Of IEEE Transactions on Affective Computing. https://doi.org/10.1109/TAFFC.2018.2874986.
Dael, N., Mortillaro, M., & Scherer, K. R. (2012). Emotion expression in body action and posture. Emotion, 12(5), 1085.
Dalal, N., Triggs, B., & Schmid, C. (2006). Human detection using oriented histograms of flow and appearance. In: Proceedings of the European Conference on Computer Vision, Springer, pp. 428–441.
Datta, R., Joshi, D., Li, J., & Wang, J.Z. (2006). Studying aesthetics in photographic images using a computational approach. In: European conference on computer vision, Springer, pp. 288–301.
Dawid, A. P., & Skene, A. M. (1979). Maximum likelihood estimation of observer error-rates using the em algorithm. Applied Statistics, 28, 20–28.
De Gelder, B. (2006). Towards the neurobiology of emotional body language. Nature Reviews Neuroscience, 7(3), 242–249.
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255.
Douglas-Cowie, E., Cowie, R., Sneddon, I., Cox, C., Lowry, L., McRorie, M., Martin, L.J.C., Devillers, J., Abrilian, A., & Batliner, S., et al. (2007). The humaine database: addressing the needs of the affective computing community. In: Proceedings of the International Conference on Affective Computing and Intelligent Interaction, pp. 488–500.
Ekman, P. (1992). Are there basic emotions? Psychological Review, 99(3), 550–553.
Ekman, P. (1993). Facial expression and emotion. American Psychologist, 48(4), 384.
Ekman, P., & Friesen, W. V. (1977). Facial Action Coding System: A technique for the measurement of facial movement. Palo Alto: Consulting Psychologists Press, Stanford University.
Ekman, P., & Friesen, W. V. (1986). A new pan-cultural facial expression of emotion. Motivation and Emotion, 10(2), 159–168.
Eleftheriadis, S., Rudovic, O., & Pantic, M. (2015). Discriminative shared gaussian processes for multiview and view-invariant facial expression recognition. IEEE Transactions on Image Processing, 24(1), 189–204.
Fabian Benitez-Quiroz, C., Srinivasan, R., & Martinez, A.M. (2016). Emotionet: An accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5562–5570.
Gu, C., Sun, C., Ross, D.A., Vondrick, C., Pantofaru, C., Li, Y., Vijayanarasimhan, S., Toderici, G., Ricco, S., Sukthankar, R., et al. (2018). Ava: A video dataset of spatio-temporally localized atomic visual actions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6047–6056.
Gunes, H., & Piccardi, M. (2005). Affect recognition from face and body: early fusion vs. late fusion. Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, 4, 3437–3443.
Gunes, H., & Piccardi, M. (2007). Bi-modal emotion recognition from expressive face and body gestures. Journal of Network and Computer Applications, 30(4), 1334–1345.
Gwet, K.L. (2014). Handbook of Inter-rater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters. Advanced Analytics, LLC.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778.
Iqbal, U., Milan, A., & Gall, J. (2017). Posetrack: Joint multi-person pose estimation and tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2011–2020.
Kantorov, V., & Laptev, I. (2014). Efficient feature extraction, encoding and classification for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2593–2600.
Karg, M., Samadani, A. A., Gorbet, R., Kühnlenz, K., Hoey, J., & Kulić, D. (2013). Body movements for affective expression: A survey of automatic recognition and generation. IEEE Transactions on Affective Computing, 4(4), 341–359.
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., & Natsev, P., et al. (2017). The kinetics human action video dataset. ar**v preprint ar**v:1705.06950.
Kipf, T.N., & Welling, M. (2016). Semi-supervised classification with graph convolutional networks. ar**v preprint ar**v:1609.02907.
Kleinsmith, A., & Bianchi-Berthouze, N. (2013). Affective body expression perception and recognition: A survey. IEEE Transactions on Affective Computing, 4(1), 15–33.
Kleinsmith, A., De Silva, P. R., & Bianchi-Berthouze, N. (2006). Cross-cultural differences in recognizing affect from body posture. Interacting with Computers, 18(6), 1371–1389.
Kleinsmith, A., Bianchi-Berthouze, N., & Steed, A. (2011). Automatic recognition of non-acted affective postures. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 41(4), 1027–1038.
Kosti, R., Alvarez, J.M., Recasens, A., & Lapedriza, A. (2017). Emotion recognition in context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1667–1675.
Krakovsky, M. (2018). Artificial (emotional) intelligence. Communications of the ACM, 61(4), 18–19.
Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105.
Laban, R., & Ullmann, L. (1971). The Mastery of Movement. Bingley: ERIC.
Lu, X., Suryanarayan, P., Adams Jr, R.B., Li, J., Newman, M.G., & Wang, J.Z. (2012). On shape and the computability of emotions. In: Proceedings of the ACM International Conference on Multimedia, ACM, pp. 229–238.
Lu, X., Adams Jr, R.B., Li, J., Newman, M.G., Wang, J.Z. (2017). An investigation into three visual characteristics of complex scenes that evoke human emotion. In: Proceedings of the International Conference on Affective Computing and Intelligent Interaction, pp. 440–447.
Luvizon, D.C., Picard, D., & Tabia, H. (2018). 2d/3d pose estimation and action recognition using multitask deep learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5137–5146.
Martinez, J., Hossain, R., Romero, J., & Little, J.J. (2017). A simple yet effective baseline for 3d human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2640–2649.
Meeren, H. K., van Heijnsbergen, C. C., & de Gelder, B. (2005). Rapid perceptual integration of facial expression and emotional body language. Proceedings of the National Academy of Sciences of the United States of America, 102(45), 16518–16523.
Mehrabian, A. (1980). Basic dimensions for a general psychological theory: Implications for personality, social, environmental, and developmental studies. Cambridge: The MIT Press.
Mehrabian, A. (1996). Pleasure-arousal-dominance: A general framework for describing and measuring individual differences in temperament. Current Psychology, 14(4), 261–292.
Nicolaou, M. A., Gunes, H., & Pantic, M. (2011). Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space. IEEE Transactions on Affective Computing, 2(2), 92–105.
Perronnin, F., & Dance, C. (2007). Fisher kernels on visual vocabularies for image categorization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8.
Potapov, D., Douze, M., Harchaoui, Z., & Schmid, C. (2014). Category-specific video summarization. In: European Conference on Computer Vision, Springer, pp. 540–555.
Ruggero Ronchi, M., & Perona, P. (2017). Benchmarking and error diagnosis in multi-instance pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 369–378.
Schindler, K., Van Gool, L., & de Gelder, B. (2008). Recognizing emotions expressed by body pose: A biologically inspired neural model. Neural Networks, 21(9), 1238–1246.
Shiffrar, M., Kaiser, M.D., & Chouchourelou, A. (2011). Seeing human movement as inherently social. The Science of Social Vision, pp. 248–264.
Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576.
Soomro, K., Zamir, A.R., Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. ar**v preprint ar**v:1212.0402v1.
Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, D., et al. (2016). Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2), 64–73.
Towns, J., Cockerill, T., Dahan, M., Foster, I., Gaither, K., Grimshaw, A., et al. (2014). Xsede: Accelerating scientific discovery. Computing in Science & Engineering, 16(5), 62–74.
Wakabayashi, A., Baron-Cohen, S., Wheelwright, S., Goldenfeld, N., Delaney, J., Fine, D., et al. (2006). Development of short forms of the empathy quotient (eq-short) and the systemizing quotient (sq-short). Personality and Individual Differences, 41(5), 929–940.
Wallbott, H. G. (1998). Bodily expression of emotion. European Journal of Social Psychology, 28(6), 879–896.
Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3551–3558.
Wang, H., Kläser, A., Schmid, C., & Liu, C.L. (2011). Action recognition by dense trajectories. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3169–3176.
Wang, L., **ong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In: European Conference on Computer Vision, Springer, pp. 20–36.
Xu, F., Zhang, J., & Wang, J. Z. (2017). Microexpression identification and categorization using a facial dynamics map. IEEE Transactions on Affective Computing, 8(2), 254–267.
Yan, S., **ong, Y., & Lin, D. (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence.
Ye, J., Li, J., Newman, M. G., Adams, R. B., & Wang, J. Z. (2019). Probabilistic multigraph modeling for improving the quality of crowdsourced affective data. IEEE Transactions on Affective Computing, 10(1), 115–128.
Zach, C., Pock, T., & Bischof, H. (2007). A duality based approach for realtime tv-l 1 optical flow. In: Proceedings of the Joint Pattern Recognition Symposium, Springer, pp. 214–223.
Acknowledgements
This material is based upon work supported in part by The Pennsylvania State University. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation Grant No. ACI-1548562 (Towns et al. 2014). The work was also supported through a GPU gift from the NVIDIA Corporation. The authors are grateful to the thousands of Amazon Mechanical Turk independent contractors for their time and dedication in providing invaluable emotion ground truth labels for the video collection. Hanjoo Kim contributed in some of the discussions. Jeremy Yuya Ong supported the data collection and visualization effort. We thank Amazon.com, Inc. for supporting the expansion of this line of research.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Luo, Y., Ye, J., Adams, R.B. et al. ARBEE: Towards Automated Recognition of Bodily Expression of Emotion in the Wild. Int J Comput Vis 128, 1–25 (2020). https://doi.org/10.1007/s11263-019-01215-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-019-01215-y