Log in

Visual language navigation: a survey and open challenges

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

With the recent development of deep learning, AI models are widely used in various domains. AI models show good performance for definite tasks such as image classification and text generation. With the recent development of generative models (e.g., BigGAN, GPT-3), AI models also show impressive results for diverse generation tasks (e.g., photo-realistic image, paragraph generation). As the performance of each AI model improves, interest in comprehensive tasks, such as visual language navigation (VLN) which follows the language instruction with an egocentric view, is also growing. However, the model integration for VLN has a problem due to the model complexity, modal heterogeneity, and paired data shortage. This study provides a comprehensive survey on VLN with a systemic approach for reviewing recent trends. At first, we define a taxonomy for fundamental techniques which need to perform VLN. We analyze from four perspectives of VLN: representation learning, reinforcement learning, component, and evaluation. We investigate the pros and cons of each component and methodology that have been conducted recently. This survey categorizes major research institute's approaches with taxonomy defined in four perspectives, unlike other conventional surveys. Finally, we discuss current open challenges and conclude our study by giving possible future directions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Adapted from Baker et al. 2019)

Fig. 5
Fig. 6
Fig. 7

Adapted from Shah et al. 2018)

Fig. 8

Similar content being viewed by others

References

  • Abbasnejad E, Teney D, Parvaneh A, Shi J, Hengel AVD (2020) Counterfactual vision and language learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10044–10054

  • Agarwal R, Schuurmans D, Norouzi M (2020) An optimistic perspective on offline reinforcement learning. In: International Conference on Machine Learning. PMLR, pp 104–114

  • Alamri H, Hori C, Marks TK, Batra D, ParikhD (2018) Audio visual scene-aware dialog (avsd) track for natural language generation in dstc7. In: DSTC7 at AAAI2019 Workshop 2

  • Alikhani M, Sharma P, Li S, Soricut R, Stone M (2020) Clue: Cross-modal coherence modeling for caption generation. In: Association for Computational Linguistics (ACL), 2020

  • Ammanabrolu P, Hausknecht M (2020) Graph constrained reinforcement learning for natural language action spaces. In: International Conference on Learning Representations (ICLR), 2020

  • Anand A, Belilovsky E, Kastner K, Larochelle H, Courville A (2018) Blindfold baselines for embodied QA. In: NIPS 2018 Visually-Grounded Interaction and Language (ViGilL) Workshop

  • Anderson P et al (2018) Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3674–3683

  • Arumugam D, Karamcheti S, Gopalan N, Williams EC, Rhee M, Wong LL, Tellex S (2019) Grounding natural language instructions to semantic goal representations for abstraction and generalization. Auton Robot 43(2):449–468

    Article  Google Scholar 

  • Baker B, Kanitscheider I, Markov T, Wu Y, Powell G, McGrew B, Mordatch I (2019) Emergent tool use from multi-agent autocurricula. In: International Conference on Learning Representations, 2019

  • Banino A et al (2018) Vector-based navigation using grid-like representations in artificial agents. Nature 557(7705):429–433

    Article  Google Scholar 

  • Banino A et al (2020) Memo: a deep network for flexible combination of episodic memories. In: International Conference on Learning Representations (ICLR), 2020

  • Batra D et al (2020) Objectnav revisited: on evaluation of embodied agents navigating to objects. CoRR 2020

  • Bear DM et al (2020) Learning physical graph representations from visual scenes. In: 34th Conference on Neural Information Processing Systems (NeurIPS), 2020

  • Bertasius G, Torresani L (2020) COBE: contextualized object embeddings from narrated instructional video. In: Neurips 2020

  • Blukis V, Brukhim N, Bennett A, Knepper RA, Artzi Y (2018) Following high-level navigation instructions on a simulated quadcopter with imitation learning. Robot Sci Syst (RSS)

  • Blukis V, Terme Y, Niklasson E, Knepper RA, Artzi Y (2019) Learning to map natural language instructions to physical quadcopter control using simulated flight. In: Conference on Robot Learning (CoRL) 2019

  • Blukis V, Knepper RA, Artzi Y (2020) Few-shot object grounding and map** for natural language robot instruction following. In: 4th Conference on Robot Learning (CoRL 2020)

  • Brown TB et al (2020) Language models are few-shot learners. In: 34th Conference on Neural Information Processing Systems (NeurIPS), 2020

  • Bruce J, Sünderhauf N, Mirowski P, Hadsell R, Milford M (2018) Learning deployable navigation policies at kilometer scale from a single traversal. In: Proceedings of The 2nd Conference on Robot Learning. PMLR 87, pp 346–361

  • Cangea C, Belilovsky E, Liò P, Courville A (2019) VideoNavQA: bridging the gap between visual and embodied question answering. In: BMVC 2019

  • Cerda-Mardini, P., Araujo, V., & Soto, A. (2020) Translating natural language instructions for behavioral robot navigation with a multi-head attention mechanism. In: ACL 2020 WiNLP workshop

  • Chang M, Gupta A, Gupta S (2020) Semantic visual navigation by watching youtube videos. In: NeurIPS 2020

  • Chaplot DS, Gandhi D, Gupta S, Gupta A, Salakhutdinov R (2020a) Learning to explore using active neural slam. In: International Conference on Learning Representations (ICLR), 2020a

  • Chaplot DS, Gandhi DP, Gupta A, Salakhutdinov RR (2020b) Object goal navigation using goal-oriented semantic exploration. Adv Neural Inf Process Syst 33:4247

    Google Scholar 

  • Chaplot DS, Jiang H, Gupta S, Gupta A (2020c) Semantic curiosity for active visual learning. In: European Conference on Computer Vision. Springer, Cham, pp 309–326

  • Chaplot DS, Salakhutdinov R, Gupta A, Gupta S (2020d) Neural topological slam for visual navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12875–12884

  • Chen B, Song S, Lipson H, Vondrick C (2019a) Visual hide and seek. In: Artificial Life Conference Proceedings. One Rogers Street, MIT Press, Cambridge, MA

  • Chen H, Suhr A, Misra D, Snavely N, Artzi Y (2019b) Touchdown: natural language navigation and spatial reasoning in visual street environments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12538–12547

  • Chen B et al (2020a) Robust policies via mid-level visual representations: an experimental study in manipulation and navigation. CoRL 2020

  • Chen C et al (2020b) Soundspaces: audio-visual navigation in 3d environments. In: Computer Vision–ECCV 2020a: 16th European Conference, vol 16. Springer, pp 17–36

  • Chen H, Ding G, Liu X, Lin Z, Liu J, Han J (2020c) Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12655–12663

  • Chen V, Gupta A, Marino K (2020d) Ask Your Humans: Using Human Instructions to Improve Generalization in Reinforcement Learning. In: ICLR 2021

  • Chen Y, Tian Y, He M (2020e) Monocular human pose estimation: a survey of deep learning-based methods. Comput Vision Image Underst 192:102897

    Article  Google Scholar 

  • Chen W, Gan Z, Li L, Cheng Y, Wang W, Liu J (2021) Meta module network for compositional visual reasoning. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 655–664

  • Chevalier-Boisvert M, Bahdanau D, Lahlou S, Willems L, Saharia C, Nguyen TH, Bengio Y (2019) BabyAI: first steps towards grounded language learning with a human in the loop. In: International Conference on Learning Representations, p 105

  • Chu, Y. W, Lin, K. Y, Hsu, C. C, Ku, L. W. (2020) Multi-step joint-modality attention network for scene-aware dialogue system. In: DSTC8 collocated with Association for the Advancement of Artificial Intelligence (AAAI) 2020

  • Colas C, Akakzia A, Oudeyer PY, Chetouani M, Sigaud O (2020a) Language-conditioned goal generation: a new approach to language grounding for RL. In: ICML 2020a Workshop

  • Colas C, Karch T, Lair N, Dussoux JM, Moulin-Frier C, Dominey PF, Oudeyer PY (2020b) Language as a cognitive tool to imagine goals in curiosity-driven exploration. In: NeurIPS 2020b

  • Co-Reyes JD et al (2019) Guiding policies with language via meta-learning. In: International Conference on Learning Representations (ICLR), 2019

  • Crook PA, Poddar S, De A, Shafi S, Whitney D, Geramifard A, Subba R (2019) SIMMC: situated Interactive Multi-Modal Conversational Data Collection And Evaluation Platform. In: ASRU 2019

  • Das A, Datta S, Gkioxari G, Lee S, Parikh D, Batra D (2018a) Embodied question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1–10

  • Das A, Gkioxari G, Lee S, Parikh D, Batra D (2018b) Neural modular control for embodied question answering. In: Conference on Robot Learning. PMLR, pp 53–62

  • Das A et al. (2020) Probing emergent semantics in predictive agents via question answering. In: International Conference on Machine Learning (ICML), 2020

  • Datta S, Sikka K, Roy A, Ahuja K, Parikh D, Divakaran A (2019) Align2ground: Weakly supervised phrase grounding guided by image-caption alignment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 2601–2610

  • Dean V, Tulsiani S, Gupta A (2020) See, hear, explore: curiosity via audio-visual association. In: NeurIPS 2020

  • Deitke M et al (2020) Robothor: an open simulation-to-real embodied ai platform. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3164–3174

  • Deng Z, Narasimhan K, Russakovsky O (2020) Evolving graphical planner: contextual global planning for vision-and-language navigation. In: Neurips2020

  • Do V, Camburu OM, Akata Z, Lukasiewicz T (2020) e-SNLI-VE-2.0: Corrected Visual-Textual Entailment with Natural Language Explanations. In: IEEE CVPR Workshop on Fair, Data Efficient and Trusted Computer Vision

  • Du H, Yu X, Zheng L (2020) Learning object relation graph and tentative policy for visual navigation. In: European Conference on Computer Vision. Springer, Cham, pp 19–34

  • Engelcke M, Kosiorek AR, Parker Jones O, Posner H (2020) GENESIS: generative scene inference and sampling of object-centric latent representations. In: Proceedings of the ICLR, 2020

  • Eysenbach B, Gupta A, Ibarz J, Levine S (2019) Diversity is all you need: learning skills without a reward function. In: ICLR 2019 Conference 752

  • Fan A, Jernite Y, Perez E, Grangier D, Weston J, Auli M. (2019) ELI5: long form question answering. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

  • Fan A et al (2020) Generating interactive worlds with text. Proc AAAI Conf Artif Intell 34(02):1693–1700

    Google Scholar 

  • Fang K, Toshev A, Fei-Fei L, Savarese S (2019) Scene memory transformer for embodied agents in long-horizon tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 538–547

  • Fang Z, Gokhale T, Banerjee P, Baral C, Yang Y (2020) Video2commonsense: generating commonsense descriptions to enrich video captioning. In: Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

  • Feng Q, Ablavsky V, Bai Q, Li G, Sclaroff S (2020) Real-time visual object tracking with natural language description. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 700–709

  • Fried D et al (2018) Speaker-follower models for vision-and-language navigation. In: Advances in Neural Information Processing Systems

  • Fu S, **ong K, Ge X, Tang S, Chen W, Wu Y (2020a) Quda: natural language queries for visual data analytics. CoRR 2020a

  • Fu TJ, Wang XE, Peterson MF, Grafton ST, Eckstein MP, Wang WY (2020b) Counterfactual vision-and-language navigation via adversarial path sampler. In: European Conference on Computer Vision. Springer, Cham, pp 71–86

  • Gabeur V, Sun C, Alahari K, Schmid C (2020) Multi-modal transformer for video retrieval. In: Computer Vision–ECCV 2020: 16th European Conference, vol 16. Springer, Berlin, pp 214–229

  • Gafni O, Wolf L, Taigman Y (2019) Vid2game: controllable characters extracted from real-world videos. In: ICLR 2020

  • Gan C, Zhang Y, Wu J, Gong B, Tenenbaum JB (2020) Look, listen, and act: towards audio-visual embodied navigation. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp 9701–9707

  • Gao R, Chen C, Al-Halah Z, Schissler C, Grauman K (2020) Visualechoes: spatial image representation learning through echolocation. In: European Conference on Computer Vision. Springer, Cham, pp 658–676

  • Garcia-Ceja E, Riegler M, Nordgreen T, Jakobsen P, Oedegaard KJ, Tørresen J (2018) Mental health monitoring with multimodal sensing and machine learning: a survey. Pervasive Mob Comput 51:1–26

    Article  Google Scholar 

  • Gidaris S, Bursuc A, Komodakis N, Pérez P, Cord M (2020) Learning representations by predicting bags of visual words. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6928–6938

  • Gordon D, Kembhavi A, Rastegari M, Redmon J, Fox D, Farhadi A (2018) Iqa: visual question answering in interactive environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4089–4098

  • Gordon D, Kadian A, Parikh D, Hoffman J, Batra D (2019) Splitnet: Sim2sim and task2task transfer for embodied visual navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1022–1031

  • Goyal, P, Niekum, S, Mooney, RJ (2020) PixL2R: guiding reinforcement learning using natural language by map** pixels to rewards. In: Conference on Robot Learning (CoRL) 2020

  • Gruslys A et al (2020) The advantage regret-matching actor-critic. CoRR 2020

  • Guo Y, Cheng Z, Nie L, Liu Y, Wang Y, Kankanhalli M (2019) Quantifying and alleviating the language prior problem in visual question answering. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp 75–84

  • Hao W, Li C, Li X, Carin L, Gao J (2020) Towards learning a generic agent for vision-and-language navigation via pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 13137–13146

  • Harish YVS, Pandya H, Gaud A, Terupally S, Shankar S, Krishna KM (2020) DFVS: deep flow guided scene agnostic image based visual servoing. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp 9000–9006

  • He Z et al (2021) ActionBert: leveraging user actions for semantic understanding of user interfaces. In: AAAI Conference on Artificial Intelligence (AAAI-21) 2021

  • Heinrich S et al (2020) Crossmodal language grounding in an embodied neurocognitive model. Front Neurorobot. https://doi.org/10.3389/fnbot.2020.00052

    Article  Google Scholar 

  • Hermann KM, Malinowski M, Mirowski P, Banki-Horvath A, Anderson K, Hadsell R (2020) Learning to follow directions in street view. Proc AAAI Conf Artif Intell 34(07):11773–11781

    Google Scholar 

  • Hill F, Tieleman O, von Glehn T, Wong N, Merzic H, Clark S (2020) Grounded language learning fast and slow. In: ICLR 2021

  • Hong R, Liu D, Mo X, He X, Zhang H (2019) Learning to compose and reason with language tree structures for visual grounding. IEEE Trans Pattern Anal Mach Intell 44:684

    Article  Google Scholar 

  • Hong Y, Rodriguez-Opazo C, Qi Y, Wu Q, Gould S (2020) Language and visual entity relationship graph for agent navigation. In: NeurIPS 2020

  • Hu H, Yarats D, Gong Q, Tian Y, Lewis M (2019) Hierarchical decision making by generating and following natural language instructions. In: Advances in neural information processing systems, 2019

  • Hu R, Singh A, Darrell T, Rohrbach M (2020) Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 9992–10002

  • Huang H, Jain V, Mehta H, Ku A, Magalhaes G, Baldridge J, Ie E (2019) Transferable representation learning in vision-and-language navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 7404–7413

  • Hutsebaut-Buysse M, Mets K, Latré S (2020) Pre-trained word embeddings for goal-conditional transfer learning in reinforcement learning. In: International Conference on Machine Learning (ICML) 2020 Language in Reinforcement Learning (LaReL) Workshop

  • Ilinykh N, Zarrieß S, Schlangen D (2019) Meetup! a corpus of joint activity dialogues in a visual environment. In: Proceedings of the 23rd Workshop on the Semantics and Pragmatics of Dialogue (semdial/LondonLogue)

  • Jaderberg M, Mnih V, Czarnecki WM, Schaul T, Leibo JZ, Silver D, Kavukcuoglu K (2017) Reinforcement learning with unsupervised auxiliary tasks. In: ICLR 2017

  • Jain U et al (2019) Two body problem: collaborative visual task completion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6689–6699

  • Jaunet T, Vuillemot R, Wolf C (2020) DRLViz: understanding decisions and memory in deep reinforcement learning. Comput Gr Forum 39(3):49–61

    Article  Google Scholar 

  • Ji J, Krishna R, Fei-Fei L, Niebles JC(2020) Action genome: actions as compositions of spatio-temporal scene graphs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10236–10247

  • Jia B, Chen Y, Huang S, Zhu Y, Zhu SC (2020) Lemma: a multi-view dataset for learning multi-agent multi-task activities. In: European Conference on Computer Vision. Springer, Cham, pp 767–786

  • Jiang Y, Gu SS, Murphy KP, Finn C (2019) Language as an abstraction for hierarchical deep reinforcement learning. Adv Neural Inf Process Syst 32:9419–9431

    Google Scholar 

  • Jiang H, Misra I, Rohrbach M, Learned-Miller E, Chen X (2020a) In defense of grid features for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10267–10276

  • Jiang M, Luketina J, Nardelli N, Minervini P, Torr PH, Whiteson S, Rocktäschel T (2020b) WordCraft: an environment for benchmarking commonsense agents. In: ICML, 2020b Workshop

  • Joze HRV, Shaban A, Iuzzolino ML, Koishida K (2020) MMTM: multimodal transfer module for CNN fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 13289–13299

  • Juliani A et al (2019) Obstacle tower: a generalization challenge in vision, control, and planning. In: International Joint Conferences on Artificial Intelligence (IJCAI), 2019

  • Kadian A et al (2019) Are we making real progress in simulated environments? Measuring the sim2real gap in embodied visual navigation. In: IEEE Robotics and Automation Letters (RA-L), 2019

  • Karch T, Lair N, Colas C, Dussoux JM, Moulin-Frier C, Dominey PF, Oudeyer PY (2020) Language-goal imagination to foster creative exploration in Deep RL. In: ICML 2020 Workshop

  • Khetarpal K, Ahmed Z, Comanici G, Abel D, Precup D (2020) What can I do here? A theory of affordances in reinforcement learning. In: International Conference on Machine Learning. PMLR, pp 5243–5253

  • Kipf T et al (2019) Compile: Compositional imitation learning and execution. In: International Conference on Machine Learning. PMLR, pp 3418–3428

  • Koh JY, Baldridge J, Lee H, Yang Y (2021) Text-to-image generation grounded by fine-grained user attention. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 237–246

  • Krantz J, Wijmans E, Majumdar A, Batra D, Lee S (2020) Beyond the nav-graph: vision-and-language navigation in continuous environments. In: European Conference on Computer Vision. Springer, Cham, pp 104–120

  • Kreutzer J, Riezler S, Lawrence C (2020) Learning from human feedback: Challenges for real-world reinforcement learning in nlp. In: Real-World RL Workshop at NeurIPS, 2020

  • Ku A, Anderson P, Patel R, Ie E, Baldridge J (2020) Room-across-room: multilingual vision-and-language navigation with dense spatiotemporal grounding. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

  • Kulhánek J, Derner E, De Bruin T, Babuška R (2019) Vision-based navigation using deep reinforcement learning. In: 2019 European Conference on Mobile Robots (ECMR). IEEE, pp 1–8

  • Landi F, Baraldi L, Corsini M, Cucchiara R (2019) Embodied vision-and-language navigation with dynamic convolutional filters. In: The British Machine Vision Conference (BMVC), 2019

  • Landi F, Baraldi L, Cornia M, Corsini M, Cucchiara R (2021) Multimodal attention networks for low-level vision-and-language navigation. Comput Vision Image Underst 210:103255

    Article  Google Scholar 

  • Le H, Chen NF (2020) Multimodal transformer with pointer network for the dstc8 avsd challenge. In: DSTC Workshop at Association for the Advancement of Artificial Intelligence (AAAI), 2020

  • Le H, Hoi SC (2020) Video-grounded dialogues with pretrained generation language models. Assoc Comput Linguist (ACL). https://doi.org/10.48550/ar**v.2006.15319

    Article  Google Scholar 

  • Lewis M, Fan A (2018) Generative question answering: learning to answer the whole question. In: International Conference on Learning Representations

  • Li Y, Košecka J (2020) Learning view and target invariant visual servoing for navigation. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp 658–664

  • Li A, Hu H, Mirowski P, Farajtabar M (2019a) Cross-view policy learning for street navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 8100–8109

  • Li J, Tang S, Wu F, Zhuang Y (2019b) Walking with mind: Mental imagery enhanced embodied qa. In: Proceedings of the 27th ACM International Conference on Multimedia, pp 1211–1219

  • Li A, Bansal S, Giovanis G, Tolani V, Tomlin C, Chen M (2020a) Generating robust supervision for learning-based visual navigation using hamilton-jacobi reachability. In: Learning for Dynamics and Control. PMLR, pp 500–510

  • Li D, Yu X, Xu C, Petersson L, Li H (2020b) Transferring cross-domain knowledge for video sign language recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6205–6214

  • Li J, Wang X, Tang S, Shi H, Wu F, Zhuang Y, Wang WY (2020c) Unsupervised reinforcement learning of transferable meta-skills for embodied navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12123–12132

  • Li L, Chen YC, Cheng Y, Gan Z, Yu L, Liu J (2020d) Hero: hierarchical encoder for video+ language omni-representation pre-training. In: Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020d

  • Li S, Chaplot DS, Tsai YHH, Wu Y, Morency LP, Salakhutdinov R (2020e) Unsupervised domain adaptation for visual navigation. In: Deep Reinforcement Learning Workshop at NeurIPS, 2020e

  • Li Z, Li Z, Zhang J, Feng Y, Zhou J (2021) Bridging text and video: a universal multimodal transformer for video-audio scene-aware dialog. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing

  • Liang M, Yang B, Zeng W, Chen Y, Hu R, Casas S, Urtasun R (2020) Pnpnet: end-to-end perception and prediction with tracking in the loop. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 11553–11562

  • Lin AS, Wu L, Corona R, Tai K, Huang Q, Mooney RJ (2018) Generating animated videos of human activities from natural language descriptions. Learning 1

  • Liu A et al (2020a) Spatiotemporal attacks for embodied agents. In: European Conference on Computer Vision. Springer, Cham, pp 122–138

  • Liu J, Chen W, Cheng Y, Gan Z, Yu L, Yang Y, Liu J (2020b) Violin: a large-scale dataset for video-and-language inference. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10900–10910

  • Liu YT, Li YJ, Wang YCF (2020c) Transforming multi-concept attention into video summarization. In: Proceedings of the Asian Conference on Computer Vision

  • Loynd R, Fernandez R, Celikyilmaz A, Swaminathan A, Hausknecht M (2020) Working memory graphs. In: International Conference on Machine Learning. PMLR, pp 6404–6414

  • Lu J, Batra D, Parikh D, Lee S (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems 2019

  • Lu J, Goswami V, Rohrbach M, Parikh D, Lee S (2020) 12-in-1: Multi-task vision and language representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10437–10446

  • Ma CY, Lu J, Wu Z, AlRegib G, Kira Z, Socher R, **ong C (2019a) Self-monitoring navigation agent via auxiliary progress estimation. In: International Conference on Learning Representations (ICLR), 2019a

  • Ma CY, Wu Z, AlRegib G, **ong C, Kira Z (2019b) The regretful agent: heuristic-aided navigation through progress estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6732–6740

  • Madureira B, Schlangen D (2020) An overview of natural language state representation for reinforcement learning. In: ICML 2020 Workshop on Language in Reinforcement Learning (LaReL), vol 4

  • Majumdar A, Shrivastava A, Lee S, Anderson P, Parikh D, Batra D (2020) Improving vision-and-language navigation with image-text pairs from the web. In: European Conference on Computer Vision. Springer, Cham, pp 259–274

  • Marasović A, Bhagavatula C, Park JS, Bras RL, Smith NA, Choi Y (2020) Natural language rationales with full-stack visual reasoning: from pixels to semantic frames to commonsense graphs. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings

  • Martins R, Bersan D, Campos MF, Nascimento ER (2020) Extending maps with semantic and contextual object information for robot navigation: a learning-based framework using visual and depth cues. J Intell Robot Syst 2020:1–15

    Google Scholar 

  • Mei T, Zhang W, Yao T (2020) Vision and language: from visual perception to content creation. APSIPA Trans Signal Inf Process. https://doi.org/10.1017/ATSIP.2020.10

    Article  Google Scholar 

  • Miech A, Alayrac JB, Smaira L, Laptev I, Sivic J, Zisserman A (2020) End-to-end learning of visual representations from uncurated instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 9879–9889

  • Mirowski P et al (2017) Learning to navigate in complex environments. In: International Conference on Learning Representations (ICLR), 2017

  • Mirowski P et al (2018) Learning to navigate in cities without a map. Adv Neural Inf Process Syst 31:2419–2430

    Google Scholar 

  • Mirowski P et al (2019) The streetlearn environment and dataset. CoRR2019

  • Mogadala A, Kalimuthu M, Klakow D (2021) Trends in integration of vision and language research: A survey of tasks, datasets, and methods. J Artif Intell Res 71:1183

    Article  MathSciNet  MATH  Google Scholar 

  • Moghaddam MK, Wu Q, Abbasnejad E, Shi J (2021) Optimistic agent: accurate graph-based value estimation for more successful visual navigation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 3733–3742

  • Moon S et al (2020) Situated and interactive multimodal conversations. In: Proceedings of the 28th International Conference on Computational Linguistics, pp 1103–1121

  • Morad SD, Mecca R, Poudel RP, Liwicki S, Cipolla R (2021) Embodied visual navigation with automatic curriculum learning in real environments. IEEE Robot Autom Lett 6(2):683–690

    Article  Google Scholar 

  • Mou X, Sigouin B, Steenstra I, Su H (2020) Multimodal dialogue state tracking by qa approach with data augmentation. In: Association for the Advancement of Artificial Intelligence (AAAI) DSTC8 Workshop

  • Mshali H, Lemlouma T, Moloney M, Magoni D (2018) A survey on health monitoring systems for health smart homes. Int J Ind Ergon 66:26–56

    Article  Google Scholar 

  • Nagarajan T, Grauman K (2020) Learning affordance landscapes for interaction exploration in 3D environments. In: NeurIPS 2020

  • Nagarajan T, Li Y, Feichtenhofer C, Grauman K (2020) Ego-topo: environment affordances from egocentric video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 163–172

  • Narasimhan M, Wijmans E, Chen X, Darrell T, Batra D, Parikh D, Singh A (2020) Seeing the un-scene: Learning amodal semantic maps for room navigation. In: European Conference on Computer Vision. Springer, Cham, pp 513–529

  • Nguyen K, Daumé III H (2019a) Help, anna! visual navigation with natural multimodal assistance via retrospective curiosity-encouraging imitation learning. In: Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019

  • Nguyen K, Dey D, Brockett C, Dolan B (2019b) Vision-based navigation with language-based assistance via imitation learning with indirect intervention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12527–12537

  • Pan X, Zhang T, Ichter B, Faust A, Tan J, Ha S (2020) Zero-shot imitation learning from demonstrations for legged robot visual navigation. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp 679–685

  • Pan J, Chen S, Shou MZ, Liu Y, Shao J, Li H (2021) Actor-context-actor relation network for spatio-temporal action localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 464–474

  • Park SM, Kim YG (2021) Survey and challenges of story generation models-A multimodal perspective with five steps: data embedding, topic modeling, storyline generation, draft story generation, and story evaluation. Inf Fusion 67:41–63

    Article  Google Scholar 

  • Patel R, Rodriguez-Sanchez R, Konidaris G (2020) On the relationship between structure in natural language and models of sequential decision processes. In: The 1st Workshop on Language in Reinforcement Learning, International Conference on Machine Learning (ICML), 2020

  • Patro B, Namboodiri VP (2018) Differential attention for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7680–7688

  • Perez E, Lewis P, Yih WT, Cho K, Kiela D (2020) Unsupervised question decomposition for question answering. In: Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

  • Prabhudesai M, Tung HYF, Javed SA, Sieb M, Harley AW, Fragkiadaki K (2020) Embodied language grounding with 3d visual feature representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2220–2229

  • Puig X et al (2020) Watch-and-help: a challenge for social perception and human-AI collaboration. In: ICLR2021

  • Qi W, Mullapudi RT, Gupta S, Ramanan D (2020a) Learning to move with affordance maps. In: International Conference on Learning Representations (ICLR), 2020a

  • Qi Y, Pan Z, Zhang S, van den Hengel A, Wu Q (2020b) Object-and-action aware model for visual language navigation. In: Computer Vision–ECCV 2020b: 16th European Conference, vol 16. Springer, pp 303–317

  • Qi Y, Wu Q, Anderson P, Wang X, Wang WY, Shen C, Hengel AVD (2020c) Reverie: remote embodied visual referring expression in real indoor environments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 9982–9991

  • Qiu Y, Pal A, Christensen HI (2020) Target driven visual navigation exploiting object relationships. In: CoRL 2020

  • Raffel C et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 21:1–67

    MathSciNet  MATH  Google Scholar 

  • Ramakrishnan SK, Al-Halah Z, GraumanK (2020) Occupancy anticipation for efficient exploration and navigation. In: European Conference on Computer Vision. Springer, Cham, pp 400–418

  • Rao M, Raju A, Dheram P, Bui B, Rastrow A (2020) Speech to semantics: improve asr and nlu jointly via all-neural interfaces. In: Proceedings of INTERSPEECH, 2020

  • Ren M, Iuzzolino ML, Mozer MC, Zemel RS (2020) Wandering within a world: online contextualized few-shot learning. In: ICML 2020 Workshop LifelongML

  • Ritter S, Faulkner R, Sartran L, Santoro A, Botvinick M, Raposo D (2020) Rapid task-solving in novel environments. In: ICLR 2021

  • Rosano M, Furnari A, Gulino L, Farinella GM (2020) A comparison of visual navigation approaches based on localization and reinforcement learning in virtual and real environments. In: VISIGRAPP, pp 628–635

  • Rosenberger P, Cosgun A, Newbury R, Kwan J, Ortenzi V, Corke P, Grafinger M (2020) Object-independent human-to-robot handovers using real time robotic vision. IEEE Robot Autom Lett 6(1):17–23

    Article  Google Scholar 

  • Sadhu A, Chen K, Nevatia R (2020) Video object grounding using semantic roles in language description. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10417–10427

  • Sammani F, Melas-Kyriazi L (2020) Show, edit and tell: a framework for editing image captions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4808–4816

  • Sanh V, Debut L, Chaumond J, Wolf T (2019) DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In: 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing—NeurIPS, 2019

  • Savva M et al (2019) Habitat: a platform for embodied ai research. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 9339–9347

  • Sax A, Zhang JO, Emi B, Zamir A, Savarese S, Guibas L, Malik J (2019) Learning to navigate using mid-level visual priors. In: Conference on Robot Learning, 2019

  • Shah P, Fiser M, Faust A, Kew JC, Hakkani-Tur D (2018) Follownet: robot navigation by following natural language directions with deep reinforcement learning. In: Third Machine Learning in Planning and Control of Robot Motion Workshop at ICRA, 2018

  • Shah R, Krasheninnikov D, Alexander J, Abbeel P, Dragan A (2019) The implicit preference information in an initial state. In: International Conference on Learning Representations

  • Shamsian A, Kleinfeld O, Globerson A, Chechik G (2020) Learning object permanence from video. In: European Conference on Computer Vision. Springer, Cham, pp 35–50

  • Shen WB, Xu D, Zhu Y, Guibas LJ, Fei-Fei L, Savarese S (2019) Situational fusion of visual representation for visual navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 2881–2890

  • Shridhar M et al (2020) Alfred: a benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749

  • Shridhar M, Yuan X, Côté MA, Bisk Y, Trischler A, Hausknecht M (2021) ALFWorld: aligning text and embodied environments for interactive learning. In: ICLR2021

  • Shuster K, Humeau S, Hu H, Bordes A, Weston J (2019) Engaging image captioning via personality. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12516–12526

  • Shuster K, Urbanek J, Dinan E, Szlam A, Weston J (2020) Deploying lifelong open-domain dialogue learning. CoRR 2020

  • Sigurdsson G et al (2020) Visual grounding in video for unsupervised word translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10850–10859

  • Silva R, Vasco M, Melo FS, Paiva A, Veloso M (2020) Playing games in the Dark: an approach for cross-modality transfer in reinforcement learning. In: Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, 2020

  • Singh A et al (2019) Towards vqa models that can read. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8317–8326

  • Siriwardhana S, Weerasekera R, Nanayakkara S (2018) Target driven visual navigation with hybrid asynchronous universal successor representations. In: Deep Reinforcement Learning Workshop, NeurIPS, 2018

  • Srinivas A, Laskin M, Abbeel P (2020) Curl: Contrastive unsupervised representations for reinforcement learning. In: International Conference on Machine Learning (ICML), 2020

  • Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, Dai J (2020) Vl-bert: Pre-training of generic visual-linguistic representations. In: International Conference on Learning Representations (ICLR), 2020

  • Suhr A et al (2019) Executing instructions in situated collaborative interactions. In: Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019

  • Sun Z, Yu H, Song X, Liu R, Yang Y, Zhou D (2020) Mobilebert: a compact task-agnostic bert for resource-limited devices. In: ACL, 2020

  • Szlam A et al (2019) Why build an assistant in minecraft? CoRR 2019

  • Tamari R, Shani C, Hope T, Petruck MR, Abend O, Shahaf D (2020) Language (re) modelling: towards embodied language understanding. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

  • Tan H, Bansal M (2020) Vokenization: improving language understanding with contextualized, visual-grounded supervision. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

  • Tan S, Liu H, Guo D, Zhang X, Sun F (2020) Towards embodied scene description. Robot Sci Syst

  • Thomason J, Gordon D, Bisk Y (2019) Shifting the baseline: single modality performance on visual navigation qa. NAACL 2019:1977–1983

    Google Scholar 

  • Thomason J, Murray M, Cakmak M, Zettlemoyer L (2020) Vision-and-dialog navigation. In: Conference on Robot Learning. PMLR, pp 394–406

  • Tsai YHH, Bai S, Liang PP, Kolter JZ, Morency LP, Salakhutdinov R (2019) Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the conference. Association for Computational Linguistics. Meeting. NIH Public Access, 2019, p 6558

  • Wang X, **ong W, Wang H, Wang WY (2018) Look before you leap: bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 37–53

  • Wang X et al (2019a) Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6629–6638

  • Wang X, Jain V, Ie E, Wang WY, Kozareva Z, Ravi S (2019b) Natural language grounded multitask navigation. In: ViGIL@ NeurIPS

  • Wang J, Zhang Y, Kim TK, Gu Y (2020a) Modelling hierarchical structure between dialogue policy and natural language generator with option framework for task-oriented dialogue system. In: ICLR 2021

  • Wang XE, Jain V, Ie E, Wang WY, Kozareva Z, Ravi S (2020b) Environment-agnostic multitask learning for natural language grounded navigation. In: Computer Vision–ECCV 2020b: 16th European Conference, vol. 16. Springer, pp 413–430

  • Wang Y (2021) Survey on deep multi-modal data analytics: collaboration, rivalry, and fusion. ACM Trans Multimed Comput Commun Appl TOMM 17(1s):1–25

    Google Scholar 

  • Waytowich N, Barton SL, Lawhern V, Warnell G (2019) A narration-based reward sha** approach using grounded natural language commands. In: The Imitation, Intent and Interaction (I3) workshop, ICML 2019

  • Wijmans E et al (2019) Embodied question answering in photorealistic environments with point cloud perception. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6659–6668

  • Wijmans E et al (2020) Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames. In: ICML 2020 Workshop

  • Wiles O, Gkioxari G, Szeliski R, Johnson J (2020) Synsin: end-to-end view synthesis from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7467–7477

  • Wortsman M, Ehsani K, Rastegari M, Farhadi A, Mottaghi R (2019) Learning to learn how to learn: Self-adaptive visual navigation using meta-learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6750–6759

  • Wu Y, Wu Y, Gkioxari G, Tian Y (2018a) Building generalizable agents with a realistic and rich 3d environment. In: ICLR, 2018

  • Wu Y, Wu Y, Gkioxari G, Tian Y, Tamar A, Russell S (2018b) Learning a Semantic Prior for Guided Navigation. In: European Conference on Computer Vision (ECCV), 2018

  • Wu Y, Wu Y, Tamar A, Russell S, Gkioxari G, Tian Y (2019) Bayesian relational memory for semantic visual navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 2769–2779

  • Wu J, Li G, Han X, Lin L (2020a) Reinforcement learning for weakly supervised temporal grounding of natural language in untrimmed videos. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1283–1291

  • Wu Q, Manocha D, Wang J, Xu K (2020b) Neonav: improving the generalization of visual navigation via generating next expected observations. Proc AAAI Conf Artif Intell 34(06):10001–10008

    Google Scholar 

  • Wu SA, Wang RE, Evans JA, Tenenbaum J, Parkes DC, Kleiman-Weiner M (2020c) Too many cooks: coordinating multi-agent collaboration through inverse planning. In: CogSci

  • **a F et al (2020) Interactive gibson benchmark: a benchmark for interactive navigation in cluttered environments. IEEE Robot Autom Soc 5(2):713

    Article  Google Scholar 

  • **ang F et al (2020a) Sapien: A simulated part-based interactive environment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition:11097–11107

  • **ang J, Wang XE, Wang WY (2020b) Learning to stop: a simple yet effective approach to urban vision-language navigation. In: Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

  • **e L, Markham A, Trigoni N (2020) SnapNav: learning mapless visual navigation with sparse directional guidance and visual reference. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp 1682–1688

  • Ye J, Batra D, Wijmans E, Das A (2020) Auxiliary tasks speed up learning pointgoal navigation. CoRL 2020

  • Yu H, Lian X, Zhang H, Xu W (2018) Guided feature transformation (gft): a neural language grounding module for embodied agents. In: Conference on Robot Learning. PMLR, pp 81–98

  • Yu D et al (2019a) Commonsense and semantic-guided navigation through language in embodied environment. In: ViGIL@ NeurIPS

  • Yu L, Chen X, Gkioxari G, Bansal M, Berg TL, Batra D (2019b) Multi-target embodied question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6309–6318

  • Zaheer M et al (2020) Big Bird: transformers for longer sequences. In: NeurIPS

  • Zeng F, Wang C, Ge SS (2020) A survey on visual navigation for artificial agents with deep reinforcement learning. IEEE Access 8:135426–135442

    Article  Google Scholar 

  • Zhan X, Pan X, Dai B, Liu Z, Lin D, Loy CC (2020) Self-supervised scene de-occlusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3784–3792

  • Zhang Y, Hassan M, Neumann H, Black MJ, Tang S (2020) Generating 3d people in scenes without people. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6194–6204

  • Zheng L, Zhu C, Zhang J, Zhao H, Huang H, Niessner M, Xu K (2019) Active scene understanding via online semantic reconstruction. Comput Gr Forum 38(7):103–114

    Article  Google Scholar 

  • Zhong V, Rocktäschel T, Grefenstette E (2019) RTFM: generalising to novel environment dynamics via reading. In: International Conference on Learning Representations (ICLR), 2020

  • Zhou L, Small K (2020) Inverse reinforcement learning with natural language goals. CoRR 2020

  • Zhu Y, Mottaghi R, Kolve E, Lim JJ, Gupta A, Fei-Fei L, Farhadi A (2017) Target-driven visual navigation in indoor scenes using deep reinforcement learning. In: 2017 IEEE international conference on robotics and automation (ICRA), pp 3357–3364

  • Zhu F, Zhu, Y Chang X, Liang X (2020a) Vision-language navigation with self-supervised auxiliary reasoning tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10012–10022

  • Zhu Y et al (2020b) Dark, beyond deep: a paradigm shift to cognitive ai with humanlike common sense. Engineering 6(3):310–345

    Article  Google Scholar 

  • Zhu Y, Zhu F, Zhan Z, Lin B, Jiao J, Chang X, Liang X (2020c) Vision-dialog navigation by exploring cross-modal memory. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10730–10739

Download references

Acknowledgements

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2021R1A2C2012635)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Young-Gab Kim.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

See Table 8.

Table 8 List of main acronyms

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Park, SM., Kim, YG. Visual language navigation: a survey and open challenges. Artif Intell Rev 56, 365–427 (2023). https://doi.org/10.1007/s10462-022-10174-9

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-022-10174-9

Keywords

Navigation