Connecting Vision and Language with Localized Narratives

Pont-Tuset, Jordi; Uijlings, Jasper; Changpinyo, Soravit; Soricut, Radu; Ferrari, Vittorio

doi:10.1007/978-3-030-58558-7_38

Jordi Pont-Tuset¹²,
Jasper Uijlings¹²,
Soravit Changpinyo¹³,
Radu Soricut¹³ &
…
Vittorio Ferrari¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12350))

Included in the following conference series:

European Conference on Computer Vision

4281 Accesses
47 Citations

Abstract

We propose Localized Narratives, a new form of multimodal image annotations connecting vision and language. We ask annotators to describe an image with their voice while simultaneously hovering their mouse over the region they are describing. Since the voice and the mouse pointer are synchronized, we can localize every single word in the description. This dense visual grounding takes the form of a mouse trace segment per word and is unique to our data. We annotated 849k images with Localized Narratives: the whole COCO, Flickr30k, and ADE20K datasets, and 671k images of Open Images, all of which we make publicly available. We provide an extensive analysis of these annotations showing they are diverse, accurate, and efficient to produce. We also demonstrate their utility on the application of controlled image captioning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: EUR 29.95; Price includes VAT (Germany)

eBook: EUR 85.59; Price includes VAT (Germany)

Softcover Book: EUR 106.99; Price includes VAT (Germany)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

The Curious Layperson: Fine-Grained Image Recognition Without Expert Labels

Article Open access 13 September 2023

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

Overview of the ImageCLEF 2020: Multimedia Retrieval in Medical, Lifelogging, Nature, and Internet Applications

References

Amodei, D., et al.: Deep speech 2: end-to-end speech recognition in English and Mandarin. In: ICML (2016)
Google Scholar
Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: ECCV (2016)
Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2018)
Google Scholar
Antol, S., et al.: VQA: visual question answering. In: ICCV (2015)
Google Scholar
Benenson, R., Popov, S., Ferrari, V.: Large-scale interactive object segmentation with human annotators. In: CVPR (2019)
Google Scholar
Bigham, J.P., et al.: VizWiz: nearly real-time answers to visual questions. In: Proceedings of the 23nd Annual ACM Symposium on User Interface Software and Technology (2010)
Google Scholar
Changpinyo, S., Pang, B., Sharma, P., Soricut, R.: Decoupled box proposal and featurization with ultrafine-grained semantic labels improve image captioning and visual question answering. In: EMNLP-IJCNLP (2019)
Google Scholar
Chen, X., et al.: Microsoft COCO captions: data collection and evaluation server. ar**v (2015)
Google Scholar
Cirik, V., Morency, L.P., Berg-Kirkpatrick, T.: Visual referring expression recognition: what do systems actually learn? In: NAACL (2018)
Google Scholar
Cornia, M., Baraldi, L., Cucchiara, R.: Show, control and tell: a framework for generating controllable and grounded captions. In: CVPR (2019)
Google Scholar
Dai, D.: Towards cost-effective and performance-aware vision algorithms. Ph.D. thesis, ETH Zurich (2016)
Google Scholar
Damen, D., et al.: The EPIC-KITCHENS dataset: collection, challenges and baselines. IEEE Trans. PAMI (2020)
Google Scholar
Dogan, P., Sigal, L., Gross, M.: Neural sequential phrase grounding (seqground). In: CVPR (2019)
Google Scholar
Google cloud speech-to-text API. https://cloud.google.com/speech-to-text/
Graves, A., Mohamed, A.R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: ICASSP (2013)
Google Scholar
Gygli, M., Ferrari, V.: Efficient object annotation via speaking and pointing. In: IJCV (2019)
Google Scholar
Gygli, M., Ferrari, V.: Fast object class labelling via speech. In: CVPR (2019)
Google Scholar
Harwath, D., Recasens, A., Surís, D., Chuang, G., Torralba, A., Glass, J.: Jointly discovering visual objects and spoken words from raw sensory input. In: ECCV (2018)
Google Scholar
Honnibal, M., Montani, I.: spaCy 2: natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing (2017). spacy.io
Google Scholar
Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: CVPR (2019)
Google Scholar
Johnson, J., Karpathy, A., Fei-Fei, L.: Densecap: fully convolutional localization networks for dense captioning. In: CVPR (2016)
Google Scholar
Kahneman, D.: Attention and effort. Citeseer (1973)
Google Scholar
Kalchbrenner, N., et al.: Efficient neural audio synthesis. In: ICML (2018)
Google Scholar
Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referitgame: referring to objects in photographs of natural scenes. In: EMNLP (2014)
Google Scholar
Kim, D.J., Choi, J., Oh, T.H., Kweon, I.S.: Dense relational captioning: triple-stream networks for relationship-based captioning. In: CVPR (2019)
Google Scholar
Krause, J., Johnson, J., Krishna, R., Fei-Fei, L.: A hierarchical approach for generating descriptive image paragraphs. In: CVPR (2017)
Google Scholar
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV 123(1), 32–73 (2017)
Article MathSciNet Google Scholar
Kruskal, J.B., Liberman, M.: The symmetric time-war** problem: from continuous to discrete. In: Time Warps, String Edits, and Macromolecules - The Theory and Practice of Sequence Comparison, chap. 4. CSLI Publications (1999)
Google Scholar
Kuznetsova, A., et al.: The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale. ar**v preprint ar**v:1811.00982 (2018)
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out (2004)
Google Scholar
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: ECCV (2014)
Google Scholar
Liu, C., Mao, J., Sha, F., Yuille, A.: Attention correctness in neural image captioning. In: AAAI (2017)
Google Scholar
Lu, J., Yang, J., Batra, D., Parikh, D.: Neural baby talk. In: CVPR (2018)
Google Scholar
Malinowski, M., Rohrbach, M., Fritz, M.: Ask your neurons: a neural-based approach to answering questions about images. In: ICCV (2015)
Google Scholar
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: CVPR (2016)
Google Scholar
Mehri, S., et al.: Samplernn: an unconditional end-to-end neural audio generation model. In: ICLR (2017)
Google Scholar
Oord, A.V.D., et al.: Wavenet: a generative model for raw audio. ar**v 1609.03499 (2016)
Google Scholar
Oviatt, S.: Multimodal interfaces. In: The Human-Computer Interaction Handbook: Fundamentals, Evolving Technologies and Emerging Applications (2003)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: ACL (2002)
Google Scholar
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. IJCV 123(1), 74–93 (2017)
Article MathSciNet Google Scholar
Ravanelli, M., Parcollet, T., Bengio, Y.: The Pytorch-Kaldi speech recognition toolkit. In: ICASSP (2019)
Google Scholar
Reed, S.E., Akata, Z., Mohan, S., Tenka, S., Schiele, B., Lee, H.: Learning what and where to draw. In: NeurIPS, pp. 217–225 (2016)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS (2015)
Google Scholar
Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T., Saenko, K.: Object hallucination in image captioning. In: EMNLP (2018)
Google Scholar
Selvaraju, R.R., et al.: Taking a HINT: leveraging explanations to make vision and language models more grounded. In: ICCV (2019)
Google Scholar
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL (2018)
Google Scholar
Tan, F., Feng, S., Ordonez, V.: Text2scene: generating compositional scenes from textual descriptions. In: CVPR (2019)
Google Scholar
Vaidyanathan, P., Prud, E., Pelz, J.B., Alm, C.O.: SNAG : spoken narratives and gaze dataset. In: ACL (2018)
Google Scholar
Vasudevan, A.B., Dai, D., Van Gool, L.: Object referring in visual scene with spoken language. In: CVPR (2017)
Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: CVPR (2015)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR (2015)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans. PAMI 39(4), 652–663 (2016)
Article Google Scholar
Website: Localized Narratives Data and Visualization (2020). https://google.github.io/localized-narratives
Wu, S., Wieland, J., Farivar, O., Schiller, J.: Automatic alt-text: computer-generated image descriptions for blind users on a social network service. In: Conference on Computer Supported Cooperative Work and Social Computing (2017)
Google Scholar
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML (2015)
Google Scholar
Yan, S., Yang, H., Robertson, N.: ParaCNN: visual paragraph generation via adversarial twin contextual CNNs. ar**v (2020)
Google Scholar
Yin, G., Liu, B., Sheng, L., Yu, N., Wang, X., Shao, J.: Semantics disentangling for text-to-image generation. In: CVPR (2019)
Google Scholar
Yin, G., Sheng, L., Liu, B., Yu, N., Wang, X., Shao, J.: Context and attribute grounded dense captioning. In: CVPR (2019)
Google Scholar
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. TACL 2, 67–78 (2014)
Article Google Scholar
Yu, J., Li, J., Yu, Z., Huang, Q.: Multimodal transformer with multi-view visual representation for image captioning. ar**v 1905.07841 (2019)
Google Scholar
Zhao, Y., Wu, S., Reynolds, L., Azenkot, S.: The effect of computer-generated descriptions on photo-sharing experiences of people with visual impairments. ACM Hum.-Comput. Interact. 1 (2017)
Google Scholar
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Semantic understanding of scenes through the ADE20K dataset. IJCV 127(3), 302–321 (2019)
Article Google Scholar
Zhou, L., Kalantidis, Y., Chen, X., Corso, J.J., Rohrbach, M.: Grounded video description. In: CVPR (2019)
Google Scholar
Ziegler, Z.M., Melas-Kyriazi, L., Gehrmann, S., Rush, A.M.: Encoder-agnostic adaptation for conditional language generation. ar**v (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

Google Research, Zurich, USA
Jordi Pont-Tuset, Jasper Uijlings & Vittorio Ferrari
Google Research, Venice, CA, USA
Soravit Changpinyo & Radu Soricut

Authors

Jordi Pont-Tuset
View author publications
You can also search for this author in PubMed Google Scholar
Jasper Uijlings
View author publications
You can also search for this author in PubMed Google Scholar
Soravit Changpinyo
View author publications
You can also search for this author in PubMed Google Scholar
Radu Soricut
View author publications
You can also search for this author in PubMed Google Scholar
Vittorio Ferrari
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jordi Pont-Tuset .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 11697 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pont-Tuset, J., Uijlings, J., Changpinyo, S., Soricut, R., Ferrari, V. (2020). Connecting Vision and Language with Localized Narratives. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12350. Springer, Cham. https://doi.org/10.1007/978-3-030-58558-7_38

Download citation

DOI: https://doi.org/10.1007/978-3-030-58558-7_38
Published: 29 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58557-0
Online ISBN: 978-3-030-58558-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Connecting Vision and Language with Localized Narratives

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

The Curious Layperson: Fine-Grained Image Recognition Without Expert Labels

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Overview of the ImageCLEF 2020: Multimedia Retrieval in Medical, Lifelogging, Nature, and Internet Applications

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 11697 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Connecting Vision and Language with Localized Narratives

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

The Curious Layperson: Fine-Grained Image Recognition Without Expert Labels

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Overview of the ImageCLEF 2020: Multimedia Retrieval in Medical, Lifelogging, Nature, and Internet Applications

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 11697 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation