Emotion Aware Reinforcement Network for Visual Storytelling

  • Conference paper
  • First Online:
Artificial Neural Networks and Machine Learning – ICANN 2022 (ICANN 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13530))

Included in the following conference series:

  • 2393 Accesses

Abstract

Visual storytelling is the task of generating a sequence of human-like sentences (i.e. story) for an ordered stream of images. Unlike traditional image captioning, the story contains not only factual descriptions but also concepts and objects that do not explicitly appear in the input images. Recent works utilize either end-to-end or multi-stage frameworks to produce more relevant and coherent stories but usually ignore latent emotional information. In this work, to generate an affective story, we propose an Emotion Aware Reinforcement Network for VIsual StoryTelling (EARN-VIST). Specifically in our network, lexicon-based attention is leveraged to encourage the model to pay more attention to the emotional words. Then we apply two emotional consistency reinforcement learning rewards using an emotion classifier and commonsense transformer respectively to find the gap between generated story and human-labeled story so as to refine the generation process. Experimental results on the VIST dataset and human evaluation demonstrate that our model outperforms most of the cutting-edge models across multiple evaluation metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    We build this emotion vocabulary in terms of the NRC Affect Intensity Lexicon [15]. There are four emotion categories in emotion vocabulary, which are anger, fear, sadness, happiness.

  2. 2.

    https://github.com/utterworks/fast-bert.

  3. 3.

    https://github.com/atcbosselut/comet-commonsense.

  4. 4.

    Note that gold story represents the manually annotated story in VIST dataset.

  5. 5.

    https://visionandlanguage.net/VIST/.

  6. 6.

    https://competitions.codalab.org/competitions/17751.

References

  1. Banerjee, S., Lavie, A.: Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)

    Google Scholar 

  2. Bosselut, A., Rashkin, H., Sap, M., Malaviya, C., Celikyilmaz, A., Choi, Y.: Comet: commonsense transformers for automatic knowledge graph construction. ar**v preprint ar**v:1906.05317 (2019)

  3. Brahman, F., Chaturvedi, S.: Modeling protagonist emotions for emotion-aware storytelling. ar**v preprint ar**v:2010.06822 (2020)

  4. Chen, H., Huang, Y., Takamura, H., Nakayama, H.: Commonsense knowledge aware concept selection for diverse and informative visual storytelling. ar**v preprint ar**v:2102.02963 (2021)

  5. Gu, S., Wang, W., Wang, F., Huang, J.H.: Neuromodulator and emotion biomarker for stress induced mental disorders. Neural Plasticity 2016 (2016)

    Google Scholar 

  6. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  7. Hsu, C.C., et al.: Knowledge-enriched visual storytelling. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 7952–7960 (2020)

    Google Scholar 

  8. Hsu, C.Y., Chu, Y.W., Huang, T.H., Ku, L.W.: Plot and rework: modeling storylines for visual storytelling. ar**v preprint ar**v:2105.06950 (2021)

  9. Hu, J., Cheng, Y., Gan, Z., Liu, J., Gao, J., Neubig, G.: What makes a good story? designing composite rewards for visual storytelling. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 7969–7976 (2020)

    Google Scholar 

  10. Huang, Q., Gan, Z., Celikyilmaz, A., Wu, D., Wang, J., He, X.: Hierarchically structured reinforcement learning for topically coherent visual story generation. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 8465–8472 (2019)

    Google Scholar 

  11. Huang, T.H., et al.: Visual storytelling. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1233–1239 (2016)

    Google Scholar 

  12. Jung, Y., Kim, D., Woo, S., Kim, K., Kim, S., Kweon, I.S.: Hide-and-tell: learning to bridge photo streams for visual storytelling. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 11213–11220 (2020)

    Google Scholar 

  13. Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, pp. 74–81 (2004)

    Google Scholar 

  14. Mohammad, S., Bravo-Marquez, F., Salameh, M., Kiritchenko, S.: Semeval-2018 task 1: affect in tweets. In: Proceedings of the 12th International Workshop on Semantic Evaluation, pp. 1–17 (2018)

    Google Scholar 

  15. Mohammad, S.M.: Word affect intensities. ar**v preprint ar**v:1704.08798 (2017)

  16. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)

    Google Scholar 

  17. Qi, M., Qin, J., Huang, D., Shen, Z., Yang, Y., Luo, J.: Latent memory-augmented graph transformer for visual storytelling. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 4892–4901 (2021)

    Google Scholar 

  18. Ranzato, M., Chopra, S., Auli, M., Zaremba, W.: Sequence level training with recurrent neural networks. ar**v preprint ar**v:1511.06732 (2015)

  19. Sap, M., et al.: Atomic: An atlas of machine commonsense for if-then reasoning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3027–3035 (2019)

    Google Scholar 

  20. Speer, R., Chin, J., Havasi, C.: Conceptnet 5.5: an open multilingual graph of general knowledge. In: Thirty-first AAAI Conference on Artificial Intelligence (2017)

    Google Scholar 

  21. Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)

    Google Scholar 

  22. Wang, R., Wei, Z., Li, P., Zhang, Q., Huang, X.: Storytelling from an image stream using scene graphs. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 9185–9192 (2020)

    Google Scholar 

  23. Wang, X., Chen, W., Wang, Y.F., Wang, W.Y.: No metrics are perfect: adversarial reward learning for visual storytelling. ar**v preprint ar**v:1804.09160 (2018)

  24. Xu, C., Yang, M., Li, C., Shen, Y., Ao, X., Xu, R.: Imagine, reason and write: visual storytelling with graph knowledge and relational reasoning. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 3022–3029 (2021)

    Google Scholar 

Download references

Acknowledgements

Supported by National Natural Science Foundation of China Nos 61972059, 61773272, 61602332; Natural Science Foundation of the Jiangsu Higher Education Institutions of China No 19KJA230001, Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University No 93K172016K08; the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yi Ji .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, X., Cai, H., Jiang, T., Liu, C., Ji, Y. (2022). Emotion Aware Reinforcement Network for Visual Storytelling. In: Pimenidis, E., Angelov, P., Jayne, C., Papaleonidas, A., Aydin, M. (eds) Artificial Neural Networks and Machine Learning – ICANN 2022. ICANN 2022. Lecture Notes in Computer Science, vol 13530. Springer, Cham. https://doi.org/10.1007/978-3-031-15931-2_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-15931-2_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-15930-5

  • Online ISBN: 978-3-031-15931-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation